pan3793 commented on PR #6831:
URL: https://github.com/apache/kyuubi/pull/6831#issuecomment-2523230494

   > in the scenario of merging small files, we only need to consider the 
shuffle data size (this rule is only for shuffle data to file, doesn't matter 
what the data source is).
   
   I overlook this, you are right.
   
   I read the the Iceberg's code and understand the how it works, I am a little 
bit pessimistic to adopt it, because the real compression ratio is affected by 
data itself, the assumption is not always true, when the estimate deviates 
significantly from the real value, it's hard to explain to users how this 
happen. Files written by a Spark job are likely read by other Spark jobs, and 
the data will be covnerted to Spark InternelRow layout(same as Shuffle) again, 
have the compression ratio been considered on the read code path too?
   
   I think it would be nice if we add a section in docs to explain how 
`advisoryPartitionSizeInBytes` affects the written file size, with some tune 
guidance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@kyuubi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscr...@kyuubi.apache.org
For additional commands, e-mail: notifications-h...@kyuubi.apache.org

Reply via email to