openinx commented on pull request #3073: URL: https://github.com/apache/iceberg/pull/3073#issuecomment-914997015
@WinkerDu I definitely agreed the v2 bin-pack algorithm should be improved for v2 to consider the total size of insert & delete files. I think the `iterms-per-bin` proposed from you team is trying to resolve the unbalanced issue, but I'm concerning it's hard to set the correct `iterms-per-bin` value for a given table in real production environment, because the `iterms-per-bin` is still controlling the data file's count. We actually don't have a real suitable approach to evaluate the cost about joining the data file size & its delete records. I think we need more accurate approach to decide which scan tasks should be dispatched to different tasks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
