rhyphenkumar opened a new issue, #7742:
URL: https://github.com/apache/hudi/issues/7742
I have recently started using Hudi. I am trying to run a hudi job with
around 50GB of base data on spark in upsert mode.
The job fails with 2 types of exceptions at different runs. Sometimes jobs
fail with exception
org.apache.spark.ExecutorDeadException: The relative remote executor(Id:
19), which maintains the block data to fetch is dead.
and at other times they fail with
ERROR server.ChunkFetchRequestHandler: Error sending result ChunkFetchSuccess
**To Reproduce**
Pyspark code used is as follows
```
from pyspark.sql import SparkSession spark = SparkSession \
.builder \
.appName("hudi_test") \
.enableHiveSupport().getOrCreate()
tableName = "hudi_test12"
basePath = "/tmp/rahul/hudi_test12"
df = spark.read.parquet("/user/data/input")
df = df.repartition(500)
df.show() hudi_options = {
'hoodie.table.name': tableName,
'hoodie.datasource.write.recordkey.field': 'row_id',
'hoodie.datasource.write.partitionpath.field': 'rpt_partition_id',
'hoodie.datasource.write.table.name': tableName,
'hoodie.datasource.write.operation': 'upsert',
'hoodie.datasource.write.precombine.field': 'created_time',
'hoodie.upsert.shuffle.parallelism': 500,
'hoodie.insert.shuffle.parallelism': 500
}
df.write.format("hudi"). \
options(**hudi_options). \
mode("overwrite"). \
save(basePath)
```
**Environment Description**
* Hudi version : hudi-spark3.2-bundle_2.12-0.13.0
* Spark version : 3.2.0
* Hadoop version : 3.1
* Storage : HDFS
* Running on Docker? : No
spark submit command used are as follows :
spark3-submit --master yarn --jars
avro-1.10.0.jar,hudi-spark3.2-bundle_2.12-0.13.0-SNAPSHOT.jar --deploy-mode
cluster --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
--conf
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
--conf
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
--driver-memory 5g --executor-memory 22g --num-executors 20 --executor-cores
12 test.py
**Stacktrace for error 1**
```23/01/24 12:37:55 ERROR client.TransportResponseHandler: Still have 1
requests outstanding when connection from
mnplld-shddn02.india.airtel.itm/10.240.8.108:42916 is closed
23/01/24 12:37:55 INFO shuffle.RetryingBlockTransferor: Retrying fetch (1/3)
for 1 outstanding blocks after 5000 ms
23/01/24 12:38:01 INFO client.TransportClientFactory: Found inactive
connection to mnplld-shddn02.india.airtel.itm/10.240.8.108:42916, creating a
new one.
23/01/24 12:38:01 ERROR shuffle.RetryingBlockTransferor: Exception while
beginning fetch of 1 outstanding blocks (after 1 retries)
org.apache.spark.ExecutorDeadException: The relative remote executor(Id:
19), which maintains the block data to fetch is dead.
at
org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:136)
at
org.apache.spark.network.shuffle.RetryingBlockTransferor.transferAllOutstanding(RetryingBlockTransferor.java:154)
at
org.apache.spark.network.shuffle.RetryingBlockTransferor.lambda$initiateRetry$0(RetryingBlockTransferor.java:184)
at
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:834)
23/01/24 12:38:01 ERROR storage.ShuffleBlockFetcherIterator: Failed to get
block(s) from mnplld-shddn02.india.airtel.itm:42916
org.apache.spark.ExecutorDeadException: The relative remote executor(Id:
19), which maintains the block data to fetch is dead.
at
org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:136)
at
org.apache.spark.network.shuffle.RetryingBlockTransferor.transferAllOutstanding(RetryingBlockTransferor.java:154)
at
org.apache.spark.network.shuffle.RetryingBlockTransferor.lambda$initiateRetry$0(RetryingBlockTransferor.java:184)
at
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:834)```
```
**Stacktrace for error 2**
```
ERROR server.ChunkFetchRequestHandler: Error sending result
ChunkFetchSuccess[streamChunkId=StreamChunkId[streamId=386695752133,chunkIndex=0],buffer=FileSegmentManagedBuffer[file=/data13/yarn/nm/usercache/ocpdev_user/appcache/application_1672133878607_1225/blockmgr-d01002ec-b16b-4a53-8287-775371c52c67/17/shuffle_5_3308_0.data,offset=0,length=185864203]]
to /10.240.8.93:34882; closing connection
io.netty.channel.StacklessClosedChannelException
at io.netty.channel.AbstractChannel.close(ChannelPromise)(Unknown
Source)
23/01/24 13:20:34 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED
SIGNAL TERM
23/01/24 13:20:34 ERROR server.ChunkFetchRequestHandler: Error sending
result
ChunkFetchSuccess[streamChunkId=StreamChunkId[streamId=386695752119,chunkIndex=0],buffer=FileSegmentManagedBuffer[file=/data02/yarn/nm/usercache/ocpdev_user/appcache/application_1672133878607_1225/blockmgr-67e726ca-319f-4412-a3cb-291ad05851ec/04/shuffle_5_3286_0.data,offset=0,length=185736475]]
to /10.240.8.116:58848; closing connection
io.netty.channel.StacklessClosedChannelException
at io.netty.channel.AbstractChannel.close(ChannelPromise)(Unknown
Source)
23/01/24 13:20:34 ERROR server.ChunkFetchRequestHandler: Error sending
result
ChunkFetchSuccess[streamChunkId=StreamChunkId[streamId=386695752120,chunkIndex=0],buffer=FileSegmentManagedBuffer[file=/data15/yarn/nm/usercache/ocpdev_user/appcache/application_1672133878607_1225/blockmgr-63c89aed-a7a8-4dbc-8451-6132f8fc89a4/1f/shuffle_5_2236_0.data,offset=0,length=185942031]]
to /10.240.8.116:58848; closing connection
io.netty.channel.StacklessClosedChannelException
at io.netty.channel.AbstractChannel.close(ChannelPromise)(Unknown
Source)
23/01/24 13:20:34 ERROR client.TransportResponseHandler: Still have 1
requests outstanding when connection from
mnplld-shddn10.india.airtel.itm/10.240.8.116:34840 is closed
```
Kindly support in fixing these issues.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]