rhyphenkumar opened a new issue, #7742:
URL: https://github.com/apache/hudi/issues/7742

   I have recently started using Hudi. I am trying to run a hudi job with 
around 50GB of base data on spark in upsert mode.
   
   The job fails with 2 types of exceptions at different runs. Sometimes jobs 
fail with exception
   
   org.apache.spark.ExecutorDeadException: The relative remote executor(Id: 
19), which maintains the block data to fetch is dead.
   
   and at other times they fail with 
   
   ERROR server.ChunkFetchRequestHandler: Error sending result ChunkFetchSuccess
   
   
   **To Reproduce**
   
   Pyspark code used is as follows 
   
   ```
   from pyspark.sql import SparkSession spark = SparkSession \
       .builder \
       .appName("hudi_test") \
       .enableHiveSupport().getOrCreate()  
   tableName = "hudi_test12"
   basePath = "/tmp/rahul/hudi_test12" 
   
   df = spark.read.parquet("/user/data/input")
   df = df.repartition(500)
   df.show() hudi_options = {
       'hoodie.table.name': tableName,
       'hoodie.datasource.write.recordkey.field': 'row_id',
       'hoodie.datasource.write.partitionpath.field': 'rpt_partition_id',
       'hoodie.datasource.write.table.name': tableName,
       'hoodie.datasource.write.operation': 'upsert',
       'hoodie.datasource.write.precombine.field': 'created_time',
       'hoodie.upsert.shuffle.parallelism': 500,
       'hoodie.insert.shuffle.parallelism': 500
   } 
   df.write.format("hudi"). \
       options(**hudi_options). \
       mode("overwrite"). \
       save(basePath)
   ```
   
   
   **Environment Description**
   
   * Hudi version : hudi-spark3.2-bundle_2.12-0.13.0
   
   * Spark version : 3.2.0
   
   * Hadoop version : 3.1
   
   * Storage : HDFS
   
   * Running on Docker?  : No 
   
   
   spark submit command used are as follows :
   
   spark3-submit --master yarn --jars 
avro-1.10.0.jar,hudi-spark3.2-bundle_2.12-0.13.0-SNAPSHOT.jar --deploy-mode 
cluster --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'   
--conf 
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
   --conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' 
--driver-memory 5g  --executor-memory 22g --num-executors 20 --executor-cores 
12  test.py
   
   **Stacktrace for error 1**
   
   ```23/01/24 12:37:55 ERROR client.TransportResponseHandler: Still have 1 
requests outstanding when connection from 
mnplld-shddn02.india.airtel.itm/10.240.8.108:42916 is closed
   23/01/24 12:37:55 INFO shuffle.RetryingBlockTransferor: Retrying fetch (1/3) 
for 1 outstanding blocks after 5000 ms
   23/01/24 12:38:01 INFO client.TransportClientFactory: Found inactive 
connection to mnplld-shddn02.india.airtel.itm/10.240.8.108:42916, creating a 
new one.
   23/01/24 12:38:01 ERROR shuffle.RetryingBlockTransferor: Exception while 
beginning fetch of 1 outstanding blocks (after 1 retries)
   org.apache.spark.ExecutorDeadException: The relative remote executor(Id: 
19), which maintains the block data to fetch is dead.
       at 
org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:136)
       at 
org.apache.spark.network.shuffle.RetryingBlockTransferor.transferAllOutstanding(RetryingBlockTransferor.java:154)
       at 
org.apache.spark.network.shuffle.RetryingBlockTransferor.lambda$initiateRetry$0(RetryingBlockTransferor.java:184)
       at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
       at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
       at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
       at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
       at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
       at java.base/java.lang.Thread.run(Thread.java:834)
   23/01/24 12:38:01 ERROR storage.ShuffleBlockFetcherIterator: Failed to get 
block(s) from mnplld-shddn02.india.airtel.itm:42916
   org.apache.spark.ExecutorDeadException: The relative remote executor(Id: 
19), which maintains the block data to fetch is dead.
       at 
org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:136)
       at 
org.apache.spark.network.shuffle.RetryingBlockTransferor.transferAllOutstanding(RetryingBlockTransferor.java:154)
       at 
org.apache.spark.network.shuffle.RetryingBlockTransferor.lambda$initiateRetry$0(RetryingBlockTransferor.java:184)
       at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
       at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
       at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
       at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
       at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
       at java.base/java.lang.Thread.run(Thread.java:834)```
   ```
   
   **Stacktrace for error 2**
   
   ```
   ERROR server.ChunkFetchRequestHandler: Error sending result 
ChunkFetchSuccess[streamChunkId=StreamChunkId[streamId=386695752133,chunkIndex=0],buffer=FileSegmentManagedBuffer[file=/data13/yarn/nm/usercache/ocpdev_user/appcache/application_1672133878607_1225/blockmgr-d01002ec-b16b-4a53-8287-775371c52c67/17/shuffle_5_3308_0.data,offset=0,length=185864203]]
 to /10.240.8.93:34882; closing connection
   io.netty.channel.StacklessClosedChannelException
        at io.netty.channel.AbstractChannel.close(ChannelPromise)(Unknown 
Source)
   23/01/24 13:20:34 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED 
SIGNAL TERM
   23/01/24 13:20:34 ERROR server.ChunkFetchRequestHandler: Error sending 
result 
ChunkFetchSuccess[streamChunkId=StreamChunkId[streamId=386695752119,chunkIndex=0],buffer=FileSegmentManagedBuffer[file=/data02/yarn/nm/usercache/ocpdev_user/appcache/application_1672133878607_1225/blockmgr-67e726ca-319f-4412-a3cb-291ad05851ec/04/shuffle_5_3286_0.data,offset=0,length=185736475]]
 to /10.240.8.116:58848; closing connection
   io.netty.channel.StacklessClosedChannelException
        at io.netty.channel.AbstractChannel.close(ChannelPromise)(Unknown 
Source)
   23/01/24 13:20:34 ERROR server.ChunkFetchRequestHandler: Error sending 
result 
ChunkFetchSuccess[streamChunkId=StreamChunkId[streamId=386695752120,chunkIndex=0],buffer=FileSegmentManagedBuffer[file=/data15/yarn/nm/usercache/ocpdev_user/appcache/application_1672133878607_1225/blockmgr-63c89aed-a7a8-4dbc-8451-6132f8fc89a4/1f/shuffle_5_2236_0.data,offset=0,length=185942031]]
 to /10.240.8.116:58848; closing connection
   io.netty.channel.StacklessClosedChannelException
        at io.netty.channel.AbstractChannel.close(ChannelPromise)(Unknown 
Source)
   23/01/24 13:20:34 ERROR client.TransportResponseHandler: Still have 1 
requests outstanding when connection from 
mnplld-shddn10.india.airtel.itm/10.240.8.116:34840 is closed
   ```
   
   Kindly support in fixing these issues.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to