[
https://issues.apache.org/jira/browse/SPARK-47842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun closed SPARK-47842.
---------------------------------
> Spark job relying over Hudi are blocked after one or zero commit
> ----------------------------------------------------------------
>
> Key: SPARK-47842
> URL: https://issues.apache.org/jira/browse/SPARK-47842
> Project: Spark
> Issue Type: Bug
> Components: PySpark, Structured Streaming
> Affects Versions: 3.3.0
> Environment: Hudi version : 0.12.1-amzn-0
> Spark version : 3.3.0
> Hive version : 3.1.3
> Hadoop version : 3.3.3 amz
> Storage (HDFS/S3/GCS..) : S3
> Running on Docker? (yes/no) : no (EMR 6.9.0)
> Additional context
> Reporter: alessandro pontis
> Priority: Major
> Labels: pull-request-available
> Attachments: console_spark.png
>
>
> Hello, we are facing the fact that some pyspark job that rely on Hudi seems
> to be blocked, in fact if we go over the spark console we can see the
> situation in the attachment
> we can see that we have 71 completed jobs but those are CDC process that
> should read from Kafka topic continuously. We verified yet that there are
> messages queued over the kafka topic. If you kill the application and then
> restart in some cases the job will act normally and other times the job still
> remain stacked.
> Our deploy condition are the following:
> We read INSERT, UPDATE and DELETE operation from a Kafka topic and we
> replicate them in a target hudi table stored on Hive via a pyspark job
> running 24/7
>
> PYSPARK WRITE
> df_source.writeStream.foreachBatch(foreach_batch_write_function)
> {{ FOR EACH BATCH FUNCTION:
> #management of delete messages
> batchDF_deletes.write.format('hudi') \
> .option('hoodie.datasource.write.operation', 'delete') \
> .options(**hudiOptions_table) \
> .mode('append') \
> .save(S3_OUTPUT_PATH)
> #management of update and insert messages
> batchDF_upserts.write.format('org.apache.hudi') \
> .option('hoodie.datasource.write.operation', 'upsert') \
> .options(**hudiOptions_table) \
> .mode('append') \
> .save(S3_OUTPUT_PATH)}}
>
> SPARK SUBMIT
> spark-submit --master yarn --deploy-mode cluster --num-executors 1
> --executor-memory 1G --executor-cores 2 --conf
> spark.dynamicAllocation.enabled=false --packages
> org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 --conf
> spark.serializer=org.apache.spark.serializer.KryoSerializer --conf
> spark.sql.hive.convertMetastoreParquet=false --jars
> /usr/lib/hudi/hudi-spark-bundle.jar <path_to_script>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]