marton-bod opened a new pull request #3752:
URL: https://github.com/apache/iceberg/pull/3752


   Hive serializes the Iceberg table object into each individual split so that 
the table object can be deserialized on the executor side for efficient 
reads/writes (also because executors might not be authorized to communicate 
with the metastore to load the table). 
   
   Since the FileIO is part of the table and it has its own hadoop 
configuration, this configuration will be the dominant factor determining the 
size of the serialized split. In our tests we have found that due to this 
serialized config, iceberg splits are 15-20x larger than normal Hive splits 
(which led to OOM in some of our perf tests). 
   
   This PR proposes to introduce a config which can turn off this config 
serialization, and let the deserializer-side fill out the config values instead 
(which works for Hive executors, since they have all the config values in 
hand). This can reduce the split size by ~20x based on local tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to