l-jhon opened a new issue, #5254: URL: https://github.com/apache/hudi/issues/5254
**Describe the problem you faced** We are using Hudi Delta Streamer in our data ingestion pipeline, but we have a problem syncing Hudi with Glue metastore, and this happens after the version upgrade from 0.7.0 to 0.10.0. And another stranger thing that is happened is that when we submitted the spark-submit job using ```deploy-mode cluster``` the table isn't created in glue metastore, but if we use ```deploy-mode client``` the table is created successfully. So the problem with using the deploy-mode client is because the table is created in glue metastore and s3, but the job keeps running forever. **To Reproduce** Steps to reproduce the behavior: 1. Executing spark-submit with Hudi DeltaStreamer ``` spark-submit \ --deploy-mode cluster \ --jars s3://bucket/jars/hudi-spark-bundle_2.12-0.10.0.jar,s3://bucket/jars/spark-avro_2.12-3.0.1.jar \ --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer s3://bucket/jars/hudi-utilities-bundle_2.12-0.10.0.jar \ --op BULK_INSERT \ --filter-dupes \ --checkpoint 0 \ --source-ordering-field updated_at \ --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ --table-type COPY_ON_WRITE \ --target-base-path s3://bucket/silver_layer/table_name/ \ --target-table datalake_silver.table_name \ --enable-sync \ --hoodie-conf hoodie.datasource.write.recordkey.field=id \ --hoodie-conf hoodie.datasource.write.precombine.field=updated_at \ --hoodie-conf hoodie.datasource.write.partitionpath.field=date_partition \ --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator \ --hoodie-conf hoodie.table.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator \ --hoodie-conf hoodie.combine.before.insert=True \ --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://bucket/bronze_layer/table_name/ \ --hoodie-conf hoodie.datasource.hive_sync.enable=True \ --hoodie-conf hoodie.datasource.hive_sync.database=datalake_silver \ --hoodie-conf hoodie.datasource.hive_sync.table=table_name \ --hoodie-conf hoodie.datasource.hive_sync.assume_date_partitioning=True \ --hoodie-conf hoodie.datasource.hive_sync.partition_fields=date_partition \ --hoodie-conf hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor \ --hoodie-conf hoodie.datasource.hive_sync.support_timestamp=True \ --hoodie-conf hoodie.consistency.check.enabled=True \ --hoodie-conf hoodie.upsert.shuffle.parallelism=10 \ --hoodie-conf hoodie.insert.shuffle.parallelism=10 \ --hoodie-conf hoodie.bulkinsert.shuffle.parallelism=10 \ --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \ --hoodie-conf hoodie.archive.automatic=True \ --hoodie-conf hoodie.archive.merge.enable=True \ --hoodie-conf hoodie.cleaner.commits.retained=2 \ --hoodie-conf hoodie.clean.automatic=True \ --hoodie-conf hoodie.clean.async=True \ --hoodie-conf hoodie.parquet.max.file.size=1073741824 \ --hoodie-conf hoodie.parquet.small.file.limit=0 \ --hoodie-conf hoodie.parquet.compression.codec=snappy \ --hoodie-conf hoodie.copyonwrite.insert.auto.split=True \ --hoodie-conf hoodie.clustering.async.enabled=True \ --hoodie-conf hoodie.clustering.async.max.commits=4 \ --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=1073741824 \ --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=629145600 \ --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=date_partition \ --hoodie-conf hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy \ --transformer-class org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \ --hoodie-conf "hoodie.deltastreamer.transformer.sql=select id, code, cast(created_at as timestamp) as created_at,cast(updated_at as timestamp) as updated_at,date_partition from <SRC>" ``` **Expected behavior** Job executed successfully, the table created in glue metastore and s3. Hudi version 0.10.0 is able to sync tables with Glue, news tables, and incremental tables to sync new partitions. **Environment Description** * Hudi version : 0.10.0 * Spark version : 3.1.1 * Hive version : 3.1.2 * Hadoop version : * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no **Additional context** We are using AWS EMR version 6.3.0 **Stacktrace** Unfortunately we don't have any errors to show, because the job ends normally without any problem, the problem is only when we use spark in the cluster in deploy mode, it doesn't create the table in glue, and when we use spark in the deploy -mode client it creates the table in glue, but the job is never done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
