Hudi on EMR syncing GLUE catalog issue

Igor Basko Tue, 18 Feb 2020 02:29:13 -0800

Hi Dear List,
I'm trying to catalog Hudi files in GLUE catalog using the sync hive tool,
while using the spark save function (and not the standalone version).


I've created an EMR with Spark application only (without Hive). Also added
the following hive metastore client factory class configuration:
"hive.metastore.client.factory.class":
"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"

I've started the spark-shell using the provided by EMR hudi jars, and also
using the 0.5.1 version and they both gave me the "Cannot create hive
connection ..." error when running the following code
<https://gist.github.com/igorbasko01/05d81fef8f39e305527fd24b946fdb9a>. (
https://gist.github.com/igorbasko01/05d81fef8f39e305527fd24b946fdb9a)

After looking inside HoodieSparkSqlWriter.scala in buildSyncConfig it seems
that there is no way to override the HiveSyncConfig.useJdbc variable to be
false,
(
https://github.com/apache/incubator-hudi/blob/release-0.5.1/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L232
)
which means that in HoodieHiveClient constructor it will always try to
createHiveConnection()
(
https://github.com/apache/incubator-hudi/blob/release-0.5.1/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java#L111
)
Instead of creating a hive client from the configuration.

The next thing I did was to add a parameter that would enable overriding
the useJdbc variable.
Used the custom hudi jar in the EMR, and was able to progress further. But
got a different error down the line.
What I was happy to see that apparently it was using the
AWSGlueClientFactory:
20/02/17 13:55:17 INFO AWSGlueClientFactory: Using region from ec2 metadata
: eu-west-1

And was able to detect that the table doesn't exists in GLUE:
20/02/17 13:55:18 INFO HiveSyncTool: Hive table drivers is not found.
Creating it

But I got the following exception:
java.lang.NoClassDefFoundError: org/json/JSONException
  at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:10847)

A partial log could be found here
<https://gist.github.com/igorbasko01/612f773632cb8382014166e0ed2a06d3> (
https://gist.github.com/igorbasko01/612f773632cb8382014166e0ed2a06d3)

As it seems to me, in the case of checking if a table exists, the
HoodieHiveClient uses the client variable which is an interface
IMetaStoreClient, that the AWSCatalogMetastoreClient implements.
And it works fine.

https://github.com/apache/incubator-hudi/blob/release-0.5.1/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java#L469

https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore/blob/master/aws-glue-datacatalog-spark-client/src/main/java/com/amazonaws/glue/catalog/metastore/AWSCatalogMetastoreClient.java

But the createTable of HoodieHiveClient, eventually creates a
hive.ql.Driver and not uses the AWS client, which eventually gets an
exception.

So what I would like to know, is am I doing it wrong when trying to sync to
GLUE?
Or maybe currently Hudi doesn't support updating GLUE catalog without some
code changes?

Best Regards,
Igor

Hudi on EMR syncing GLUE catalog issue

Reply via email to