jackye1995 commented on PR #7194:
URL: https://github.com/apache/iceberg/pull/7194#issuecomment-1483284297

   Thanks for putting this up @rajarshisarkar. 
   
   For some background, Brooklyn Data's 
[blog](https://brooklyndata.co/blog/benchmarking-open-table-formats) cited that 
Iceberg read workloads were 7x-8x slower against Delta when an UPSERT command 
added 92000 small files. 
   
   We reproduced the setup internally and noticed a speed-up up to 6.8x in the 
Iceberg read queries after combining the small files The community also saw a 
6.25x improvement in read query performance after compaction on a 25MB dataset 
consisting of 100,000 records in https://github.com/apache/iceberg/issues/5997. 
   
   I understand the difference is there because we want to decouple 
optimization from read and write, but I am curious to see if we could provide 
some out-of-the-box optimization vendor integrations in this way through the 
metrics reporter if the user does not want to use any auto-optimization 
solution. 
   
   @nastra @Fokko @rdblue @danielcweeks please let us know if this is something 
that the community is interested in taking, or if not, how we could add some 
integrations in a community-friendly way to close the gap in table format 
comparisons like this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to