Thanks a lot for the suggestion, will try it out. On Wed, 19 Feb 2020 at 00:36, Mehrotra, Udit <[email protected]> wrote:
> Hi Igor, > > As of current implementation, Hudi submits queries like creating table, > syncing partitions etc directly to the hive server instead of directly > communicating with the metastore. Thus while launching the EMR cluster, you > should install Hive on the cluster as well. Also enable glue catalog for > both spark and hive and you should be fine. > > Thanks, > Udit Mehrotra > AWS | EMR > > On 2/18/20, 2:29 AM, "Igor Basko" <[email protected]> wrote: > > Hi Dear List, > I'm trying to catalog Hudi files in GLUE catalog using the sync hive > tool, > while using the spark save function (and not the standalone version). > > I've created an EMR with Spark application only (without Hive). Also > added > the following hive metastore client factory class configuration: > "hive.metastore.client.factory.class": > > "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory" > > I've started the spark-shell using the provided by EMR hudi jars, and > also > using the 0.5.1 version and they both gave me the "Cannot create hive > connection ..." error when running the following code > <https://gist.github.com/igorbasko01/05d81fef8f39e305527fd24b946fdb9a>. > ( > https://gist.github.com/igorbasko01/05d81fef8f39e305527fd24b946fdb9a) > > After looking inside HoodieSparkSqlWriter.scala in buildSyncConfig it > seems > that there is no way to override the HiveSyncConfig.useJdbc variable > to be > false, > ( > > https://github.com/apache/incubator-hudi/blob/release-0.5.1/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L232 > ) > which means that in HoodieHiveClient constructor it will always try to > createHiveConnection() > ( > > https://github.com/apache/incubator-hudi/blob/release-0.5.1/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java#L111 > ) > Instead of creating a hive client from the configuration. > > The next thing I did was to add a parameter that would enable > overriding > the useJdbc variable. > Used the custom hudi jar in the EMR, and was able to progress further. > But > got a different error down the line. > What I was happy to see that apparently it was using the > AWSGlueClientFactory: > 20/02/17 13:55:17 INFO AWSGlueClientFactory: Using region from ec2 > metadata > : eu-west-1 > > And was able to detect that the table doesn't exists in GLUE: > 20/02/17 13:55:18 INFO HiveSyncTool: Hive table drivers is not found. > Creating it > > But I got the following exception: > java.lang.NoClassDefFoundError: org/json/JSONException > at > > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:10847) > > A partial log could be found here > <https://gist.github.com/igorbasko01/612f773632cb8382014166e0ed2a06d3> > ( > https://gist.github.com/igorbasko01/612f773632cb8382014166e0ed2a06d3) > > As it seems to me, in the case of checking if a table exists, the > HoodieHiveClient uses the client variable which is an interface > IMetaStoreClient, that the AWSCatalogMetastoreClient implements. > And it works fine. > > > https://github.com/apache/incubator-hudi/blob/release-0.5.1/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java#L469 > > > https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore/blob/master/aws-glue-datacatalog-spark-client/src/main/java/com/amazonaws/glue/catalog/metastore/AWSCatalogMetastoreClient.java > > But the createTable of HoodieHiveClient, eventually creates a > hive.ql.Driver and not uses the AWS client, which eventually gets an > exception. > > So what I would like to know, is am I doing it wrong when trying to > sync to > GLUE? > Or maybe currently Hudi doesn't support updating GLUE catalog without > some > code changes? > > Best Regards, > Igor > > >
