rdblue commented on pull request #3292:
URL: https://github.com/apache/iceberg/pull/3292#issuecomment-962684130


   @RussellSpitzer, if I understand what this is doing correctly, it looks like 
this is fixing two problems. First, Iceberg can probably pack tasks more 
densely at read time. The 75/75/75/75 situation you describe also applies to 
regular task planning and that's why you're changing the config there. Second, 
the bin packing strategy needs to configure the scan it creates with the new 
config you introduced. But, this has to guess at how it can split files.
   
   I think that we may be able to do this in a more straightforward way. 
Instead of using two configs, a size to split tasks down to and a size to 
combine up to, I think we can update just the split code to be a little smarter.
   
   We already have a `split` implementation of `FileScanTask` that uses row 
group offsets. That's the smallest chunk of a Parquet or ORC file that we can 
split down to. What if we produce one task per row group rather than trying to 
guess how far down to split? Then we could combine just like normal. We could 
also avoid producing a ton of tiny splits by basing this on a heuristic and the 
original target split size. In `FileScanTask.split`, actually split down to 
`targetSize / 4`.
   
   What do you think? That would simplify configuration and make it possible 
for us to more densely pack splits for normal scans as well as those for 
compaction.
   
   If you like it, I think we would also want to add a way to combine tasks for 
the same file back into larger splits. For example, `FileScanTask(a.parquet, 0, 
50)` and `FileScanTask(a.parquet, 50, 100)` in the same combined task would be 
rewritten as `FileScanTask(a.parquet, 0, 100)`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to