gudladona opened a new issue, #11203:
URL: https://github.com/apache/hudi/issues/11203

   Hello,
   
   We have an interesting problem that happens intermittently in our 
environment that causes the S3 PUT via HTTP PUT operation stall between 17-19 
minutes. Let me try to describe this in detail.
    
   First of, Environment details. We are running OSS spark and Hadoop on EKS 
with Karpenter.
   
   JDK version : 11.0.19
   Spark Version: 3.4.1
   Hadoop Version: 3.3.4
   EKS Version: 1.26
   Hudi Version: 0.14.x
   OS: Verified on both Bottlerocket & AL2
    
   Issue Details:
    
   Occasionally, we notice that Spark stage & few tasks get stalled for about 
17 minutes, this delay is consistent whenever it happens. We have noticed that 
this is due to a stalled socket write on a close() within AWS SDK which uses 
Apache HTTP Client. When we expect a bad TLS connection, and the underlying 
socket should be terminated eagerly for a retry we don’t see that happening. 
Instead, the Socket is left until OS triggers a terminate. This seems to be due 
to the implementation of socket Linger 
[option](https://docs.oracle.com/javase%2F8%2Fdocs%2Fapi%2F%2F/java/net/SocketOptions.html#SO_LINGER)
 which is set to -1 by default in the JDK. An option exists to set Linger to 0 
which means bad connections are immediately removed. But neither the AWS SDK 
nor the Apache HTTP Client sets this option to alter the default Linger 
behavior in the JDK.
    
   Attached are the logs with slightly different errors with DEBUG level for 
AWS SDK and Hadoop S3a and Apache HTTP Client with when the issue is 
encountered.
    
   After further investigation we have found this JDK bug : 
https://bugs.openjdk.org/browse/JDK-8241239. This perfectly describes and 
reproduces the issue we are having.
   
   We have tried to fork the aws sdk by adding the LINGER option with default 
to 0 in 
[here](https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-core/src/main/java/com/amazonaws/ClientConfiguration.java)
 and set it to the SSL socket options 
[here](https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-core/src/main/java/com/amazonaws/http/conn/ssl/SdkTLSSocketFactory.java#L141).
 But that did not fix the issue, which could be due to how the JDK version is 
treating the socket options. 
   
   ### Expected Behavior
   
   The socket file descriptor should close non-gracefully/"prematurely", 
forcing the write to terminate immediately.
   
   ### Current Behavior
   
   close() blocks until the OS forces the socket closed at the transport layer, 
causing the socket write to fail
   
   ### Reproduction Steps
   
   As mentioned in the [JDK Bug 
Report](https://bugs.openjdk.org/browse/JDK-8241239)
   
   1. establish a connection between two hosts/VMs, have the client side 
perform sizable writes (enough to fill up socket buffers etc.), the server just 
reads and discards.
   2. introduce a null route on either side (or otherwise prevent transmission 
of TCP acks from the server to the client) force the client to attempt 
retransmits
   3. wait until you're stuck in a write() (check stack dumps), then call 
close() on the client-side socket.
   
   **Environment Description**
   
   * Hudi version : 0.14.1
   
   * Spark version : 3.4.1
   
   * Hive version : NA
   
   * Hadoop version : 3.3.4
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : yes
   
   
   **Additional context**
   
   
![spark-task-ui](https://github.com/apache/hudi/assets/7864088/f6f3d7c7-696a-4ace-b02a-1ec03b30e6a8)
   [debug.log](https://github.com/apache/hudi/files/15296862/debug.log)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to