[GitHub] [hudi] l-jhon opened a new issue, #5254: [SUPPORT] - Hudi Delta Streamer not create table in Glue when using spark-submit deploy-mode cluster

GitBox Thu, 07 Apr 2022 12:19:06 -0700


l-jhon opened a new issue, #5254:
URL: https://github.com/apache/hudi/issues/5254


   **Describe the problem you faced**
   
   We are using Hudi Delta Streamer in our data ingestion pipeline, but we have 
a problem syncing Hudi with Glue metastore, and this happens after the version 
upgrade from 0.7.0 to 0.10.0. And another stranger thing that is happened is 
that when we submitted the spark-submit job using ```deploy-mode cluster``` the 
table isn't created in glue metastore, but if we use ```deploy-mode client``` 
the table is created successfully. So the problem with using the deploy-mode 
client is because the table is created in glue metastore and s3, but the job 
keeps running forever.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Executing spark-submit with Hudi DeltaStreamer
   
   ```
   spark-submit \
   --deploy-mode cluster \
   --jars 
s3://bucket/jars/hudi-spark-bundle_2.12-0.10.0.jar,s3://bucket/jars/spark-avro_2.12-3.0.1.jar
 \
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
s3://bucket/jars/hudi-utilities-bundle_2.12-0.10.0.jar \
   --op BULK_INSERT \
   --filter-dupes \
   --checkpoint 0 \
   --source-ordering-field updated_at \
   --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
   --table-type COPY_ON_WRITE \
   --target-base-path s3://bucket/silver_layer/table_name/ \
   --target-table datalake_silver.table_name \
   --enable-sync \
   --hoodie-conf hoodie.datasource.write.recordkey.field=id \
   --hoodie-conf hoodie.datasource.write.precombine.field=updated_at \
   --hoodie-conf hoodie.datasource.write.partitionpath.field=date_partition \
   --hoodie-conf 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator
 \
   --hoodie-conf 
hoodie.table.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator \
   --hoodie-conf hoodie.combine.before.insert=True \
   --hoodie-conf 
hoodie.deltastreamer.source.dfs.root=s3://bucket/bronze_layer/table_name/ \
   --hoodie-conf hoodie.datasource.hive_sync.enable=True \
   --hoodie-conf hoodie.datasource.hive_sync.database=datalake_silver \
   --hoodie-conf hoodie.datasource.hive_sync.table=table_name \
   --hoodie-conf hoodie.datasource.hive_sync.assume_date_partitioning=True \
   --hoodie-conf hoodie.datasource.hive_sync.partition_fields=date_partition \
   --hoodie-conf 
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
 \
   --hoodie-conf hoodie.datasource.hive_sync.support_timestamp=True \
   --hoodie-conf hoodie.consistency.check.enabled=True \
   --hoodie-conf hoodie.upsert.shuffle.parallelism=10 \
   --hoodie-conf hoodie.insert.shuffle.parallelism=10 \
   --hoodie-conf hoodie.bulkinsert.shuffle.parallelism=10 \
   --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
   --hoodie-conf hoodie.archive.automatic=True \
   --hoodie-conf hoodie.archive.merge.enable=True \
   --hoodie-conf hoodie.cleaner.commits.retained=2 \
   --hoodie-conf hoodie.clean.automatic=True \
   --hoodie-conf hoodie.clean.async=True \
   --hoodie-conf hoodie.parquet.max.file.size=1073741824 \
   --hoodie-conf hoodie.parquet.small.file.limit=0 \
   --hoodie-conf hoodie.parquet.compression.codec=snappy \
   --hoodie-conf hoodie.copyonwrite.insert.auto.split=True \ 
   --hoodie-conf hoodie.clustering.async.enabled=True \
   --hoodie-conf hoodie.clustering.async.max.commits=4 \
   --hoodie-conf 
hoodie.clustering.plan.strategy.target.file.max.bytes=1073741824 \
   --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=629145600 \
   --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=date_partition \
   --hoodie-conf 
hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy
 \
   --transformer-class 
org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
   --hoodie-conf "hoodie.deltastreamer.transformer.sql=select id, code, 
cast(created_at as timestamp) as created_at,cast(updated_at as timestamp) as 
updated_at,date_partition from <SRC>"
   ```
   
   **Expected behavior**
   
   Job executed successfully, the table created in glue metastore and s3. Hudi 
version 0.10.0 is able to sync tables with Glue, news tables, and incremental 
tables to sync new partitions.
   
   **Environment Description**
   
   * Hudi version : 0.10.0
   
   * Spark version : 3.1.1
   
   * Hive version : 3.1.2
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   We are using AWS EMR version 6.3.0
   
   **Stacktrace**
   
   Unfortunately we don't have any errors to show, because the job ends 
normally without any problem, the problem is only when we use spark in the 
cluster in deploy mode, it doesn't create the table in glue, and when we use 
spark in the deploy -mode client it creates the table in glue, but the job is 
never done.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] l-jhon opened a new issue, #5254: [SUPPORT] - Hudi Delta Streamer not create table in Glue when using spark-submit deploy-mode cluster

Reply via email to