hgudladona opened a new issue, #13356:
URL: https://github.com/apache/hudi/issues/13356

   **Describe the problem you faced**
   
   Hello, We are seeing intermittent timeouts with timeline server based write 
markers. The stack trace is below. All the marker configurations are default. 
However timeout value we have set is 30s while the default is 5m. Our workloads 
run on EKS cluster with oss spark and hudi. This issue does not show up 
consistently, and cannot provide a reproducible steps. However it tends to show 
up, sometimes, under higher load. Due to this behavior we  have reset the 
marker behavior to DIRECT at the expense of S3 API costs. We need help 
identifying the problem and fixing this. 
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   We cannot guarantee the behavior can be reproduced consistently. But here is 
our setup 
   
   1. Run Delta streamer from kafka in continuous mode with timeline server 
based markers
   2. In each commit write ~1200 partitions (implicitly parquet files and 
marker requests)
   3. Run 150 Executors with 2 cores each and sufficient memory
   4. Wait for this to fail :)
   
   **Expected behavior**
   
   Timeline server based markers consistently succeed. 
   
   **Environment Description**
   
   * Hudi version :  0.14.1
   
   * Spark version : 3.4.x
   
   * Hadoop version : 3.3
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : yes
    
   * Driver resources: 
    ```
       Limits:
         cpu:     8
         memory:  6758Mi
       Requests:
         cpu:     8
         memory:  6758Mi
   ```
   
   **Additional context**
   
   We suspect this to be a bug in the locking behavior in the MarkerDirState, 
which shows up only on a higher load. As stated above the driver itself is 
sufficiently resourced. We tried to profile the driver using JProfiler during 
the occurrence of the problem but we could not find any obvious problems. 
Kindly, let me know what additional information you need here. 
   
   **Stacktrace**
   
   ```2025-05-24T16:24:36,310 WARN [task-result-getter-0] 
org.apache.spark.internal.Logging:Lost task 648.1 in stage 22.0 (TID 13747) 
(10.41.173.157 executor 171): org.apache.hudi.exception.HoodieUpsertException: 
Error upserting bucketType UPDATE for partition :648
        at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:342)
        at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleInsertPartition(BaseSparkCommitActionExecutor.java:348)
        at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:259)
        at 
org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102)
        at 
org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102)
        at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:905)
        at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:905)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
        at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:377)
        at 
org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1552)
        at 
org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1462)
        at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1526)
        at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1349)
        at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:375)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:326)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
        at 
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
        at org.apache.spark.scheduler.Task.run(Task.scala:139)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
Source)
        at java.base/java.lang.Thread.run(Unknown Source)
   Caused by: org.apache.hudi.exception.HoodieRemoteException: Failed to create 
marker file 
tenant=xxxxxx/date=20250524/3427d9c9-ea16-4af1-89f0-9d6fc1045cdf-0_648-22-13747_20250524161054306.parquet.marker.MERGE
   Read timed out
        at 
org.apache.hudi.table.marker.TimelineServerBasedWriteMarkers.executeCreateMarkerRequest(TimelineServerBasedWriteMarkers.java:187)
        at 
org.apache.hudi.table.marker.TimelineServerBasedWriteMarkers.create(TimelineServerBasedWriteMarkers.java:143)
        at 
org.apache.hudi.table.marker.WriteMarkers.create(WriteMarkers.java:95)
        at 
org.apache.hudi.io.HoodieWriteHandle.createMarkerFile(HoodieWriteHandle.java:144)
        at org.apache.hudi.io.HoodieMergeHandle.init(HoodieMergeHandle.java:198)
        at 
org.apache.hudi.io.HoodieMergeHandle.<init>(HoodieMergeHandle.java:134)
        at 
org.apache.hudi.io.HoodieMergeHandle.<init>(HoodieMergeHandle.java:125)
        at 
org.apache.hudi.io.HoodieMergeHandleFactory.create(HoodieMergeHandleFactory.java:68)
        at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.getUpdateHandle(BaseSparkCommitActionExecutor.java:400)
        at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:368)
        at 
org.apache.hudi.table.action.deltacommit.BaseSparkDeltaCommitActionExecutor.handleUpdate(BaseSparkDeltaCommitActionExecutor.java:79)
        at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:335)
        ... 30 more
   Caused by: java.net.SocketTimeoutException: Read timed out
        at java.base/java.net.SocketInputStream.socketRead0(Native Method)
        at java.base/java.net.SocketInputStream.socketRead(Unknown Source)
        at java.base/java.net.SocketInputStream.read(Unknown Source)
        at java.base/java.net.SocketInputStream.read(Unknown Source)
        at 
org.apache.hudi.org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:139)
        at 
org.apache.hudi.org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:155)
        at 
org.apache.hudi.org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:284)
        at 
org.apache.hudi.org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
        at 
org.apache.hudi.org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
        at 
org.apache.hudi.org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261)
        at 
org.apache.hudi.org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:165)
        at 
org.apache.hudi.org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:167)
        at 
org.apache.hudi.org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272)
        at 
org.apache.hudi.org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124)
        at 
org.apache.hudi.org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:271)
        at 
org.apache.hudi.org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
        at 
org.apache.hudi.org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
        at 
org.apache.hudi.org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
        at 
org.apache.hudi.org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
        at 
org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
        at 
org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
        at 
org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
        at 
org.apache.hudi.org.apache.http.client.fluent.Request.execute(Request.java:151)
        at 
org.apache.hudi.table.marker.TimelineServerBasedWriteMarkers.executeRequestToTimelineServer(TimelineServerBasedWriteMarkers.java:233)
        at 
org.apache.hudi.table.marker.TimelineServerBasedWriteMarkers.executeCreateMarkerRequest(TimelineServerBasedWriteMarkers.java:184)
        ... 41 more```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to