Hello everyone,

For those of you not familiar with AWS Glue 
Catalog<https://aws.amazon.com/glue/>, it’s a Hive Metastore implemented as a 
web service. The Glue service is composed of different components, but the one 
I’m interested in is the Catalog. Today, there’s a Hive metastore 
implementation and you can plug the catalog to Spark as instructed 
Basically, the Hive metastore Java class is swapped with an implementation that 
calls into Glue’s web service.

I don’t like this implementation because:

  *   It puts Hive as a middle-man between Spark and Glue
  *   It prevents Glue specific implementations

As an example of the second issue, the Hive version embedded in Spark today 
does not support partition pruning for column types that are fractionals or 
timestamps. I have a pull request to fix 
this<https://github.com/apache/spark/pull/20100>, but as rxin correctly pointed 
out, I have to fake a new Hive version called Glue or something and put this 
under the Hive shim for it.

I have locally implemented a version of 
 on top of Glue and would like to productionize it and submit it as a pull 
request. You can set spark.catalog.implementation config to “glue” and then it 
will use Glue instead of either the in-memory catalog or Hive.

Rudimentary tests are promising and I can hook up Parquet tables directly 
without going through any Hive. I really need this because I need to fix a data 
consistency issue with InsertIntoHiveTable command when data is backed by S3. 
Different topic.

The biggest challenge is that I had to upgrade the AWS SDK to a newer version 
so that it includes the Glue client since Glue is a new service. So far, I 
haven’t see any jar hell issues, but that’s the main drawback I can see. I’ve 
made sure the version is in sync with the Kinesis client used by 
spark-streaming module.

Are there any objections to this? Any guidance around upgrading the AWS client? 
Who would be a good person to review this pull request?


Reply via email to