A new external catalog

Tayyebi, Ameen Tue, 13 Feb 2018 11:50:56 -0800

Hello everyone,

For those of you not familiar with AWS Glue 
Catalog<https://aws.amazon.com/glue/>, it’s a Hive Metastore implemented as a 
web service. The Glue service is composed of different components, but the one 
I’m interested in is the Catalog. Today, there’s a Hive metastore 
implementation and you can plug the catalog to Spark as instructed 
here.<https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html> 
Basically, the Hive metastore Java class is swapped with an implementation that 
calls into Glue’s web service.

I don’t like this implementation because:

* It puts Hive as a middle-man between Spark and Glue
* It prevents Glue specific implementations

As an example of the second issue, the Hive version embedded in Spark today
does not support partition pruning for column types that are fractionals or
timestamps. I have a pull request to fix
this<https://github.com/apache/spark/pull/20100>, but as rxin correctly pointed
out, I have to fake a new Hive version called Glue or something and put this
under the Hive shim for it.

I have locally implemented a version of
ExternalCatalog<https://github.com/apache/spark/blob/2fd12af4372a1e2c3faf0eb5d0a1cf530abc0016/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala>
on top of Glue and would like to productionize it and submit it as a pull
request. You can set spark.catalog.implementation config to “glue” and then it
will use Glue instead of either the in-memory catalog or Hive.

Rudimentary tests are promising and I can hook up Parquet tables directly
without going through any Hive. I really need this because I need to fix a data
consistency issue with InsertIntoHiveTable command when data is backed by S3.
Different topic.

The biggest challenge is that I had to upgrade the AWS SDK to a newer version
so that it includes the Glue client since Glue is a new service. So far, I
haven’t see any jar hell issues, but that’s the main drawback I can see. I’ve
made sure the version is in sync with the Kinesis client used by
spark-streaming module.

Are there any objections to this? Any guidance around upgrading the AWS client?
Who would be a good person to review this pull request?

Thanks,
-Ameen

A new external catalog

Reply via email to