HotSushi opened a new pull request #1564: URL: https://github.com/apache/iceberg/pull/1564
Hive when it spawns Map Reduce jobs, needs to initialize `HiveIcebergSerDe` on mappers. When HiveIcebergSerDe is initialized, [a catalog instance](https://github.com/apache/iceberg/blob/master/mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergSerDe.java#L41) is created and queried for table schema. This doesn't seem like a good behavior, considering that there can be 1000s of mappers, and that can overload the catalog, for example a hive metastore. I accidentally came across this behavior while trying to run it on a yarn cluster, the mappers didn't have access to HiveMetaStore classes ([see the error](https://gist.github.com/HotSushi/fe86bdfe576138aa53d1b6cf4b12a24b)). The reason this is not reproducible in HiveRunner unit tests, is because the classpath already contains HiveMetaStore classes. With this PR, the jobconf will be checked first to see if serialized schema is already present, if so use that instead of querying the catalog. Note that we can't completely get rid of the catalog call because `HiveIcebergSerDe` is also initialized during query analysis time when jobconf is not set (see the stacktrace if we get rid of the call [here](https://gist.github.com/HotSushi/33d8c7bdd59e7dbda202c3172a0be186)). I didn't write any tests with this PR because i'm not sure how classpath changes can be tested in HiveRunner, but any ideas are welcome. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
