huaxingao commented on PR #14297:
URL: https://github.com/apache/iceberg/pull/14297#issuecomment-3428714379

   In Spark DSv2, planning/validation happens on the driver. 
`BatchWrite#createBatchWriterFactory` runs on the driver and returns a 
`DataWriterFactory` that is serialized to executors. That factory must already 
carry the write schema the executors will use when they create DataWriters.
   
   For shredded variant, we don’t know the shredded schema at planning time. We 
have to inspect some records to derive it. Doing a read on the driver during 
`createBatchWriterFactory` would mean starting a second job inside planning, 
which is not how DSv2 is intended to work. 
   
   Because of that, the current proposed [Spark 
approach](https://github.com/apache/spark/pull/52406/) is: put the logical 
variant in the writer factory, on the executor, buffer the first N rows, infer 
the shredded schema from data, then initialize the concrete writer and flush 
the buffer. I believe this PR follow the same approach, which seems like a 
practical solution to me given DSV2's constraints.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to