rdblue commented on pull request #3292: URL: https://github.com/apache/iceberg/pull/3292#issuecomment-962684130
@RussellSpitzer, if I understand what this is doing correctly, it looks like this is fixing two problems. First, Iceberg can probably pack tasks more densely at read time. The 75/75/75/75 situation you describe also applies to regular task planning and that's why you're changing the config there. Second, the bin packing strategy needs to configure the scan it creates with the new config you introduced. But, this has to guess at how it can split files. I think that we may be able to do this in a more straightforward way. Instead of using two configs, a size to split tasks down to and a size to combine up to, I think we can update just the split code to be a little smarter. We already have a `split` implementation of `FileScanTask` that uses row group offsets. That's the smallest chunk of a Parquet or ORC file that we can split down to. What if we produce one task per row group rather than trying to guess how far down to split? Then we could combine just like normal. We could also avoid producing a ton of tiny splits by basing this on a heuristic and the original target split size. In `FileScanTask.split`, actually split down to `targetSize / 4`. What do you think? That would simplify configuration and make it possible for us to more densely pack splits for normal scans as well as those for compaction. If you like it, I think we would also want to add a way to combine tasks for the same file back into larger splits. For example, `FileScanTask(a.parquet, 0, 50)` and `FileScanTask(a.parquet, 50, 100)` in the same combined task would be rewritten as `FileScanTask(a.parquet, 0, 100)`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
