pan3793 commented on PR #6831: URL: https://github.com/apache/kyuubi/pull/6831#issuecomment-2523230494
> in the scenario of merging small files, we only need to consider the shuffle data size (this rule is only for shuffle data to file, doesn't matter what the data source is). I overlook this, you are right. I read the the Iceberg's code and understand the how it works, I am a little bit pessimistic to adopt it, because the real compression ratio is affected by data itself, the assumption is not always true, when the estimate deviates significantly from the real value, it's hard to explain to users how this happen. Files written by a Spark job are likely read by other Spark jobs, and the data will be covnerted to Spark InternelRow layout(same as Shuffle) again, have the compression ratio been considered on the read code path too? I think it would be nice if we add a section in docs to explain how `advisoryPartitionSizeInBytes` affects the written file size, with some tune guidance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@kyuubi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: notifications-unsubscr...@kyuubi.apache.org For additional commands, e-mail: notifications-h...@kyuubi.apache.org