marton-bod opened a new pull request #3752: URL: https://github.com/apache/iceberg/pull/3752
Hive serializes the Iceberg table object into each individual split so that the table object can be deserialized on the executor side for efficient reads/writes (also because executors might not be authorized to communicate with the metastore to load the table). Since the FileIO is part of the table and it has its own hadoop configuration, this configuration will be the dominant factor determining the size of the serialized split. In our tests we have found that due to this serialized config, iceberg splits are 15-20x larger than normal Hive splits (which led to OOM in some of our perf tests). This PR proposes to introduce a config which can turn off this config serialization, and let the deserializer-side fill out the config values instead (which works for Hive executors, since they have all the config values in hand). This can reduce the split size by ~20x based on local tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
