HotSushi opened a new pull request #1564:
URL: https://github.com/apache/iceberg/pull/1564


   Hive when it spawns Map Reduce jobs, needs to initialize `HiveIcebergSerDe` 
on mappers. When HiveIcebergSerDe is initialized, [a catalog 
instance](https://github.com/apache/iceberg/blob/master/mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergSerDe.java#L41)
  is created and queried for table schema. This doesn't seem like a good 
behavior, considering that there can be 1000s of mappers, and that can overload 
the catalog, for example a hive metastore. 
   
   I accidentally came across this behavior while trying to run it on a yarn 
cluster, the mappers didn't have access to HiveMetaStore classes ([see the 
error](https://gist.github.com/HotSushi/fe86bdfe576138aa53d1b6cf4b12a24b)). The 
reason this is not reproducible in HiveRunner unit tests, is because the 
classpath already contains HiveMetaStore classes.
   
   With this PR, the jobconf will be checked first to see if serialized schema 
is already present, if so use that instead of querying the catalog. 
   
   Note that we can't completely get rid of the catalog call because 
`HiveIcebergSerDe` is also initialized during query analysis time when jobconf 
is not set (see the stacktrace if we get rid of the call 
[here](https://gist.github.com/HotSushi/33d8c7bdd59e7dbda202c3172a0be186)). 
   
   I didn't write any tests with this PR because i'm not sure how classpath 
changes can be tested in HiveRunner, but any ideas are welcome. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to