hgudladona opened a new issue, #13356:
URL: https://github.com/apache/hudi/issues/13356
**Describe the problem you faced**
Hello, We are seeing intermittent timeouts with timeline server based write
markers. The stack trace is below. All the marker configurations are default.
However timeout value we have set is 30s while the default is 5m. Our workloads
run on EKS cluster with oss spark and hudi. This issue does not show up
consistently, and cannot provide a reproducible steps. However it tends to show
up, sometimes, under higher load. Due to this behavior we have reset the
marker behavior to DIRECT at the expense of S3 API costs. We need help
identifying the problem and fixing this.
**To Reproduce**
Steps to reproduce the behavior:
We cannot guarantee the behavior can be reproduced consistently. But here is
our setup
1. Run Delta streamer from kafka in continuous mode with timeline server
based markers
2. In each commit write ~1200 partitions (implicitly parquet files and
marker requests)
3. Run 150 Executors with 2 cores each and sufficient memory
4. Wait for this to fail :)
**Expected behavior**
Timeline server based markers consistently succeed.
**Environment Description**
* Hudi version : 0.14.1
* Spark version : 3.4.x
* Hadoop version : 3.3
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : yes
* Driver resources:
```
Limits:
cpu: 8
memory: 6758Mi
Requests:
cpu: 8
memory: 6758Mi
```
**Additional context**
We suspect this to be a bug in the locking behavior in the MarkerDirState,
which shows up only on a higher load. As stated above the driver itself is
sufficiently resourced. We tried to profile the driver using JProfiler during
the occurrence of the problem but we could not find any obvious problems.
Kindly, let me know what additional information you need here.
**Stacktrace**
```2025-05-24T16:24:36,310 WARN [task-result-getter-0]
org.apache.spark.internal.Logging:Lost task 648.1 in stage 22.0 (TID 13747)
(10.41.173.157 executor 171): org.apache.hudi.exception.HoodieUpsertException:
Error upserting bucketType UPDATE for partition :648
at
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:342)
at
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleInsertPartition(BaseSparkCommitActionExecutor.java:348)
at
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:259)
at
org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102)
at
org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102)
at
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:905)
at
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:905)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:377)
at
org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1552)
at
org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1462)
at
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1526)
at
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1349)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:375)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:326)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
at
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
at org.apache.spark.scheduler.Task.run(Task.scala:139)
at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: org.apache.hudi.exception.HoodieRemoteException: Failed to create
marker file
tenant=xxxxxx/date=20250524/3427d9c9-ea16-4af1-89f0-9d6fc1045cdf-0_648-22-13747_20250524161054306.parquet.marker.MERGE
Read timed out
at
org.apache.hudi.table.marker.TimelineServerBasedWriteMarkers.executeCreateMarkerRequest(TimelineServerBasedWriteMarkers.java:187)
at
org.apache.hudi.table.marker.TimelineServerBasedWriteMarkers.create(TimelineServerBasedWriteMarkers.java:143)
at
org.apache.hudi.table.marker.WriteMarkers.create(WriteMarkers.java:95)
at
org.apache.hudi.io.HoodieWriteHandle.createMarkerFile(HoodieWriteHandle.java:144)
at org.apache.hudi.io.HoodieMergeHandle.init(HoodieMergeHandle.java:198)
at
org.apache.hudi.io.HoodieMergeHandle.<init>(HoodieMergeHandle.java:134)
at
org.apache.hudi.io.HoodieMergeHandle.<init>(HoodieMergeHandle.java:125)
at
org.apache.hudi.io.HoodieMergeHandleFactory.create(HoodieMergeHandleFactory.java:68)
at
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.getUpdateHandle(BaseSparkCommitActionExecutor.java:400)
at
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:368)
at
org.apache.hudi.table.action.deltacommit.BaseSparkDeltaCommitActionExecutor.handleUpdate(BaseSparkDeltaCommitActionExecutor.java:79)
at
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:335)
... 30 more
Caused by: java.net.SocketTimeoutException: Read timed out
at java.base/java.net.SocketInputStream.socketRead0(Native Method)
at java.base/java.net.SocketInputStream.socketRead(Unknown Source)
at java.base/java.net.SocketInputStream.read(Unknown Source)
at java.base/java.net.SocketInputStream.read(Unknown Source)
at
org.apache.hudi.org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:139)
at
org.apache.hudi.org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:155)
at
org.apache.hudi.org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:284)
at
org.apache.hudi.org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
at
org.apache.hudi.org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
at
org.apache.hudi.org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261)
at
org.apache.hudi.org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:165)
at
org.apache.hudi.org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:167)
at
org.apache.hudi.org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272)
at
org.apache.hudi.org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124)
at
org.apache.hudi.org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:271)
at
org.apache.hudi.org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
at
org.apache.hudi.org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
at
org.apache.hudi.org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
at
org.apache.hudi.org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
at
org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at
org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
at
org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
at
org.apache.hudi.org.apache.http.client.fluent.Request.execute(Request.java:151)
at
org.apache.hudi.table.marker.TimelineServerBasedWriteMarkers.executeRequestToTimelineServer(TimelineServerBasedWriteMarkers.java:233)
at
org.apache.hudi.table.marker.TimelineServerBasedWriteMarkers.executeCreateMarkerRequest(TimelineServerBasedWriteMarkers.java:184)
... 41 more```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]