[I] [SUPPORT] EMR job stalls indefinitely after clean (hudi 0.14.0,EMR serverless 6.12.0) [hudi]

via GitHub Thu, 09 Nov 2023 07:38:01 -0800


annilort opened a new issue, #10035:
URL: https://github.com/apache/hudi/issues/10035


   Maybe you have some insight on if this is a 0.14.0 & EMR compatibility error 
or something that is happening due to some other issue with updates in 
cleaning? I'm not having the same issues when using 0.13.1.
   
   I am running multiple consecutive Spark jobs on EMR serverless, each 
iteration writes (inserts) data to the same Hudi tables.  I have not set any 
cleaning configurations. The clean runs by default on the 12th job (after the 
job with the 11th commit), where one parquet file is cleaned and the job is 
successful. The subsequent, 13th job run stalls after completing the first 
table (lets say table1). Table1 is completed, a parquet file is written to S3 
and /.hoodie folder has the corresponding commit as well, but the job does not 
move on to further tables, it seems like it stops while trying to read the 
table1 .clean file.
   
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Submit n jobs to EMR serverless where data is written to hudi tables 
until clean happens (with spark.jars=hudi-spark3.4-bundle_2.12-0.14.0.jar,)
   2. Submit another job to write to the hudi tables
   
   **Expected behavior**
   
   All hudi table writes are completed, job does not stall.
   
   **Environment Description**
   
   Hudi version : 0.14.0
   
   Spark version : 3.4.0
   
   Hive version : 3.1.3
   
   Hadoop version : 3.3.3
   
   Storage (HDFS/S3/GCS..) : S3
   
   Running on Docker? (yes/no) : no
   
   **Logs**
   
   The last Spark driver log entry:
   `23/11/09 11:12:11 INFO TaskSchedulerImpl: Removed TaskSet 36.0, whose tasks 
have all completed, from pool 
   23/11/09 11:12:11 INFO DAGScheduler: ResultStage 36 (collectAsMap at 
HoodieSparkEngineContext.java:164) finished in 0.174 s
   23/11/09 11:12:11 INFO DAGScheduler: Job 16 is finished. Cancelling 
potential speculative or zombie tasks for this job
   23/11/09 11:12:11 INFO TaskSchedulerImpl: Killing all running tasks in stage 
36: Stage finished
   23/11/09 11:12:11 INFO DAGScheduler: Job 16 finished: collectAsMap at 
HoodieSparkEngineContext.java:164, took 0.176789 s
   23/11/09 11:12:11 INFO MapPartitionsRDD: Removing RDD 28 from persistence 
list
   23/11/09 11:12:11 INFO BlockManager: Removing RDD 28
   23/11/09 11:12:11 INFO MapPartitionsRDD: Removing RDD 38 from persistence 
list
   23/11/09 11:12:11 INFO BlockManager: Removing RDD 38
   23/11/09 11:12:11 INFO S3NativeFileSystem: Opening 
's3://path/table1/.hoodie/hoodie.properties' for reading
   23/11/09 11:12:12 INFO S3NativeFileSystem: Opening 
's3://path/table1/.hoodie/hoodie.properties' for reading
   23/11/09 11:12:12 INFO S3NativeFileSystem: Opening 
's3://path/table1/.hoodie/metadata/.hoodie/hoodie.properties' for reading
   23/11/09 11:12:12 INFO S3NativeFileSystem: Opening 
's3://path/table1/.hoodie/hoodie.properties' for reading
   23/11/09 11:12:12 INFO S3NativeFileSystem: Opening 
's3://path/table1/.hoodie/hoodie.properties' for reading
   23/11/09 11:12:12 INFO S3NativeFileSystem: Opening 
's3://path/table1/.hoodie/metadata/.hoodie/hoodie.properties' for reading
   23/11/09 11:12:12 INFO S3NativeFileSystem: Opening 
's3://path/table1/.hoodie/20231109104515015.clean' for reading
   23/11/09 11:12:12 INFO S3NativeFileSystem: Opening 
's3://path/table1/.hoodie/20231109104515015.clean' for reading
   23/11/09 11:13:09 INFO EmrServerlessClusterSchedulerBackend: Requesting to 
kill executor(s) 2
   23/11/09 11:13:09 INFO EmrServerlessClusterSchedulerBackend: Actual list of 
executor(s) to be killed is 2
   23/11/09 11:13:09 INFO ExecutorContainerAllocator: Set total expected execs 
to {0=0}
   23/11/09 11:13:09 INFO ExecutorAllocationManager: Executors 2 removed due to 
idle timeout.
   23/11/09 11:13:09 INFO TaskSchedulerImpl: Executor 2 on 
[2600:1f10:4da5:c701:aee8:3b04:2e94:3eab] killed by driver.
   23/11/09 11:13:09 INFO DAGScheduler: Executor lost: 2 (epoch 8)
   23/11/09 11:13:09 INFO ExecutorMonitor: Executor 2 is removed. Remove reason 
statistics: (gracefully decommissioned: 0, decommision unfinished: 0, driver 
killed: 1, unexpectedly exited: 0).
   23/11/09 11:13:09 INFO BlockManagerMasterEndpoint: Trying to remove executor 
2 from BlockManagerMaster.
   23/11/09 11:13:09 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(2, [2600:1f10:4da5:c701:aee8:3b04:2e94:3eab], 37605, None)
   23/11/09 11:13:09 INFO BlockManagerMaster: Removed 2 successfully in 
removeExecutor
   23/11/09 11:13:09 INFO DAGScheduler: Shuffle files lost for executor: 2 
(epoch 8)`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [SUPPORT] EMR job stalls indefinitely after clean (hudi 0.14.0,EMR serverless 6.12.0) [hudi]

Reply via email to