huaxingao commented on PR #14297: URL: https://github.com/apache/iceberg/pull/14297#issuecomment-3428714379
In Spark DSv2, planning/validation happens on the driver. `BatchWrite#createBatchWriterFactory` runs on the driver and returns a `DataWriterFactory` that is serialized to executors. That factory must already carry the write schema the executors will use when they create DataWriters. For shredded variant, we don’t know the shredded schema at planning time. We have to inspect some records to derive it. Doing a read on the driver during `createBatchWriterFactory` would mean starting a second job inside planning, which is not how DSv2 is intended to work. Because of that, the current proposed [Spark approach](https://github.com/apache/spark/pull/52406/) is: put the logical variant in the writer factory, on the executor, buffer the first N rows, infer the shredded schema from data, then initialize the concrete writer and flush the buffer. I believe this PR follow the same approach, which seems like a practical solution to me given DSV2's constraints. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
