[GitHub] [iceberg] HotSushi opened a new pull request #1564: Hive: Don't use catalog to initialize serde on mappers

GitBox Thu, 08 Oct 2020 14:54:24 -0700


HotSushi opened a new pull request #1564:
URL: https://github.com/apache/iceberg/pull/1564

Hive when it spawns Map Reduce jobs, needs to initialize `HiveIcebergSerDe`
on mappers. When HiveIcebergSerDe is initialized, [a catalog
instance](https://github.com/apache/iceberg/blob/master/mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergSerDe.java#L41)
is created and queried for table schema. This doesn't seem like a good
behavior, considering that there can be 1000s of mappers, and that can overload
the catalog, for example a hive metastore.

I accidentally came across this behavior while trying to run it on a yarn
cluster, the mappers didn't have access to HiveMetaStore classes ([see the
error](https://gist.github.com/HotSushi/fe86bdfe576138aa53d1b6cf4b12a24b)). The
reason this is not reproducible in HiveRunner unit tests, is because the
classpath already contains HiveMetaStore classes.

With this PR, the jobconf will be checked first to see if serialized schema
is already present, if so use that instead of querying the catalog.

Note that we can't completely get rid of the catalog call because
`HiveIcebergSerDe` is also initialized during query analysis time when jobconf
is not set (see the stacktrace if we get rid of the call
[here](https://gist.github.com/HotSushi/33d8c7bdd59e7dbda202c3172a0be186)).

I didn't write any tests with this PR because i'm not sure how classpath
changes can be tested in HiveRunner, but any ideas are welcome.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] HotSushi opened a new pull request #1564: Hive: Don't use catalog to initialize serde on mappers

Reply via email to