kbendick commented on pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#issuecomment-915492512


   > @WinkerDu I definitely agreed the v2 bin-pack algorithm should be improved 
for v2 to consider the total size of insert & delete files. I think the 
`iterms-per-bin` proposed from you team is trying to resolve the unbalanced 
issue, but I'm concerning it's hard to set the correct `iterms-per-bin` value 
for a given table in real production environment, because the `iterms-per-bin` 
is still controlling the data file's count. We actually don't have a real 
suitable approach to evaluate the cost about joining the data file size & its 
delete records. I think we need more accurate approach to decide which scan 
tasks should be dispatched to different tasks.
   
   I need to spend some time looking closer at the test cases (and probably try 
this out on some V2 tables), but I share the concern that this config value 
might be really hard to determine in a production env. Especially for example 
Flink users who have CDC streams, often times databases will experience a burst 
of deletes / updates due to some cron schedule and will then have an outsized 
number of delete files for a period of time (assuming partitioned by time as 
well).
   
   Wondering how we would go about picking a good number (or how often one 
would need to set a non-standard number, or change the number for individual 
sections of the table).
   
   That said, I'm also not adverse to adding another argument to make bin 
packing more useful in the near-term while we figure out the best way to have a 
more "V2 native" algorithm / parameter set.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to