For those of you not familiar with AWS Glue
Catalog<https://aws.amazon.com/glue/>, it’s a Hive Metastore implemented as a
web service. The Glue service is composed of different components, but the one
I’m interested in is the Catalog. Today, there’s a Hive metastore
implementation and you can plug the catalog to Spark as instructed
Basically, the Hive metastore Java class is swapped with an implementation that
calls into Glue’s web service.
I don’t like this implementation because:
* It puts Hive as a middle-man between Spark and Glue
* It prevents Glue specific implementations
As an example of the second issue, the Hive version embedded in Spark today
does not support partition pruning for column types that are fractionals or
timestamps. I have a pull request to fix
this<https://github.com/apache/spark/pull/20100>, but as rxin correctly pointed
out, I have to fake a new Hive version called Glue or something and put this
under the Hive shim for it.
I have locally implemented a version of
on top of Glue and would like to productionize it and submit it as a pull
request. You can set spark.catalog.implementation config to “glue” and then it
will use Glue instead of either the in-memory catalog or Hive.
Rudimentary tests are promising and I can hook up Parquet tables directly
without going through any Hive. I really need this because I need to fix a data
consistency issue with InsertIntoHiveTable command when data is backed by S3.
The biggest challenge is that I had to upgrade the AWS SDK to a newer version
so that it includes the Glue client since Glue is a new service. So far, I
haven’t see any jar hell issues, but that’s the main drawback I can see. I’ve
made sure the version is in sync with the Kinesis client used by
Are there any objections to this? Any guidance around upgrading the AWS client?
Who would be a good person to review this pull request?