aokolnychyi commented on PR #7430: URL: https://github.com/apache/iceberg/pull/7430#issuecomment-1526779646
Can we identify exact scenarios when the default split size performs poorly and check if we can solve the underlying problem? For instance, if the scheduler is FIFO, can we use the default cluster parallelism and the size of the data to be processed to come up with an optimal split size? We first find matching files and then plan splits so the split size can be dynamic, we just need a good way to estimate it correctly. I am not going to oppose a SQL config but I don't think we should rely on an internal SQL property for built-in file sources. Thoughts, @puchengy @RussellSpitzer @szehon-ho @singhpk234 @rdblue? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
