annilort opened a new issue, #10035:
URL: https://github.com/apache/hudi/issues/10035
Maybe you have some insight on if this is a 0.14.0 & EMR compatibility error
or something that is happening due to some other issue with updates in
cleaning? I'm not having the same issues when using 0.13.1.
I am running multiple consecutive Spark jobs on EMR serverless, each
iteration writes (inserts) data to the same Hudi tables. I have not set any
cleaning configurations. The clean runs by default on the 12th job (after the
job with the 11th commit), where one parquet file is cleaned and the job is
successful. The subsequent, 13th job run stalls after completing the first
table (lets say table1). Table1 is completed, a parquet file is written to S3
and /.hoodie folder has the corresponding commit as well, but the job does not
move on to further tables, it seems like it stops while trying to read the
table1 .clean file.
**To Reproduce**
Steps to reproduce the behavior:
1. Submit n jobs to EMR serverless where data is written to hudi tables
until clean happens (with spark.jars=hudi-spark3.4-bundle_2.12-0.14.0.jar,)
2. Submit another job to write to the hudi tables
**Expected behavior**
All hudi table writes are completed, job does not stall.
**Environment Description**
Hudi version : 0.14.0
Spark version : 3.4.0
Hive version : 3.1.3
Hadoop version : 3.3.3
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no
**Logs**
The last Spark driver log entry:
`23/11/09 11:12:11 INFO TaskSchedulerImpl: Removed TaskSet 36.0, whose tasks
have all completed, from pool
23/11/09 11:12:11 INFO DAGScheduler: ResultStage 36 (collectAsMap at
HoodieSparkEngineContext.java:164) finished in 0.174 s
23/11/09 11:12:11 INFO DAGScheduler: Job 16 is finished. Cancelling
potential speculative or zombie tasks for this job
23/11/09 11:12:11 INFO TaskSchedulerImpl: Killing all running tasks in stage
36: Stage finished
23/11/09 11:12:11 INFO DAGScheduler: Job 16 finished: collectAsMap at
HoodieSparkEngineContext.java:164, took 0.176789 s
23/11/09 11:12:11 INFO MapPartitionsRDD: Removing RDD 28 from persistence
list
23/11/09 11:12:11 INFO BlockManager: Removing RDD 28
23/11/09 11:12:11 INFO MapPartitionsRDD: Removing RDD 38 from persistence
list
23/11/09 11:12:11 INFO BlockManager: Removing RDD 38
23/11/09 11:12:11 INFO S3NativeFileSystem: Opening
's3://path/table1/.hoodie/hoodie.properties' for reading
23/11/09 11:12:12 INFO S3NativeFileSystem: Opening
's3://path/table1/.hoodie/hoodie.properties' for reading
23/11/09 11:12:12 INFO S3NativeFileSystem: Opening
's3://path/table1/.hoodie/metadata/.hoodie/hoodie.properties' for reading
23/11/09 11:12:12 INFO S3NativeFileSystem: Opening
's3://path/table1/.hoodie/hoodie.properties' for reading
23/11/09 11:12:12 INFO S3NativeFileSystem: Opening
's3://path/table1/.hoodie/hoodie.properties' for reading
23/11/09 11:12:12 INFO S3NativeFileSystem: Opening
's3://path/table1/.hoodie/metadata/.hoodie/hoodie.properties' for reading
23/11/09 11:12:12 INFO S3NativeFileSystem: Opening
's3://path/table1/.hoodie/20231109104515015.clean' for reading
23/11/09 11:12:12 INFO S3NativeFileSystem: Opening
's3://path/table1/.hoodie/20231109104515015.clean' for reading
23/11/09 11:13:09 INFO EmrServerlessClusterSchedulerBackend: Requesting to
kill executor(s) 2
23/11/09 11:13:09 INFO EmrServerlessClusterSchedulerBackend: Actual list of
executor(s) to be killed is 2
23/11/09 11:13:09 INFO ExecutorContainerAllocator: Set total expected execs
to {0=0}
23/11/09 11:13:09 INFO ExecutorAllocationManager: Executors 2 removed due to
idle timeout.
23/11/09 11:13:09 INFO TaskSchedulerImpl: Executor 2 on
[2600:1f10:4da5:c701:aee8:3b04:2e94:3eab] killed by driver.
23/11/09 11:13:09 INFO DAGScheduler: Executor lost: 2 (epoch 8)
23/11/09 11:13:09 INFO ExecutorMonitor: Executor 2 is removed. Remove reason
statistics: (gracefully decommissioned: 0, decommision unfinished: 0, driver
killed: 1, unexpectedly exited: 0).
23/11/09 11:13:09 INFO BlockManagerMasterEndpoint: Trying to remove executor
2 from BlockManagerMaster.
23/11/09 11:13:09 INFO BlockManagerMasterEndpoint: Removing block manager
BlockManagerId(2, [2600:1f10:4da5:c701:aee8:3b04:2e94:3eab], 37605, None)
23/11/09 11:13:09 INFO BlockManagerMaster: Removed 2 successfully in
removeExecutor
23/11/09 11:13:09 INFO DAGScheduler: Shuffle files lost for executor: 2
(epoch 8)`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]