Re: Hudi on EMR syncing GLUE catalog issue

Igor Basko Wed, 19 Feb 2020 12:01:32 -0800

Thanks a lot for the suggestion, will try it out.

On Wed, 19 Feb 2020 at 00:36, Mehrotra, Udit <[email protected]>
wrote:


> Hi Igor,
>
> As of current implementation, Hudi submits queries like creating table,
> syncing partitions etc directly to the hive server instead of directly
> communicating with the metastore. Thus while launching the EMR cluster, you
> should install Hive on the cluster as well. Also enable glue catalog for
> both spark and hive and you should be fine.
>
> Thanks,
> Udit Mehrotra
> AWS | EMR
>
> On 2/18/20, 2:29 AM, "Igor Basko" <[email protected]> wrote:
>
>     Hi Dear List,
>     I'm trying to catalog Hudi files in GLUE catalog using the sync hive
> tool,
>     while using the spark save function (and not the standalone version).
>
>     I've created an EMR with Spark application only (without Hive). Also
> added
>     the following hive metastore client factory class configuration:
>     "hive.metastore.client.factory.class":
>
> "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
>
>     I've started the spark-shell using the provided by EMR hudi jars, and
> also
>     using the 0.5.1 version and they both gave me the "Cannot create hive
>     connection ..." error when running the following code
>     <https://gist.github.com/igorbasko01/05d81fef8f39e305527fd24b946fdb9a>.
> (
>     https://gist.github.com/igorbasko01/05d81fef8f39e305527fd24b946fdb9a)
>
>     After looking inside HoodieSparkSqlWriter.scala in buildSyncConfig it
> seems
>     that there is no way to override the HiveSyncConfig.useJdbc variable
> to be
>     false,
>     (
>
> https://github.com/apache/incubator-hudi/blob/release-0.5.1/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L232
>     )
>     which means that in HoodieHiveClient constructor it will always try to
>     createHiveConnection()
>     (
>
> https://github.com/apache/incubator-hudi/blob/release-0.5.1/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java#L111
>     )
>     Instead of creating a hive client from the configuration.
>
>     The next thing I did was to add a parameter that would enable
> overriding
>     the useJdbc variable.
>     Used the custom hudi jar in the EMR, and was able to progress further.
> But
>     got a different error down the line.
>     What I was happy to see that apparently it was using the
>     AWSGlueClientFactory:
>     20/02/17 13:55:17 INFO AWSGlueClientFactory: Using region from ec2
> metadata
>     : eu-west-1
>
>     And was able to detect that the table doesn't exists in GLUE:
>     20/02/17 13:55:18 INFO HiveSyncTool: Hive table drivers is not found.
>     Creating it
>
>     But I got the following exception:
>     java.lang.NoClassDefFoundError: org/json/JSONException
>       at
>
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:10847)
>
>     A partial log could be found here
>     <https://gist.github.com/igorbasko01/612f773632cb8382014166e0ed2a06d3>
> (
>     https://gist.github.com/igorbasko01/612f773632cb8382014166e0ed2a06d3)
>
>     As it seems to me, in the case of checking if a table exists, the
>     HoodieHiveClient uses the client variable which is an interface
>     IMetaStoreClient, that the AWSCatalogMetastoreClient implements.
>     And it works fine.
>
>
> https://github.com/apache/incubator-hudi/blob/release-0.5.1/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java#L469
>
>
> https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore/blob/master/aws-glue-datacatalog-spark-client/src/main/java/com/amazonaws/glue/catalog/metastore/AWSCatalogMetastoreClient.java
>
>     But the createTable of HoodieHiveClient, eventually creates a
>     hive.ql.Driver and not uses the AWS client, which eventually gets an
>     exception.
>
>     So what I would like to know, is am I doing it wrong when trying to
> sync to
>     GLUE?
>     Or maybe currently Hudi doesn't support updating GLUE catalog without
> some
>     code changes?
>
>     Best Regards,
>     Igor
>
>
>

Re: Hudi on EMR syncing GLUE catalog issue

Reply via email to