[jira] [Commented] (FLINK-4660) HadoopFileSystem (with S3A) may leak connections, which cause job to stuck in a restarting loop

Vishnu Viswanath (JIRA) Tue, 26 Sep 2017 02:09:19 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16180517#comment-16180517
 ]


Vishnu Viswanath commented on FLINK-4660:
-----------------------------------------

in which version is this fixed? I am using 1.3.1 and getting similar exception 
when reading input split from S3.
{code}
2017-09-26 08:47:27,220 INFO  
org.apache.flink.api.common.io.LocatableInputSplitAssigner    - Assigning 
remote split to host ip-10-150-98-185
2017-09-26 08:47:27,344 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph        - CHAIN 
DataSource (at .......Job$$anonfun$main$4$$anonfun$apply$3.apply(Job.scala:138) 
(org.apache.flink.api.java.io.TextInputFormat)) -> FlatMap (FlatMap at 
......sources.SourceSelector$.selectSource(SourceSelector.scala:17)) -> Map 
(from: ....) (6/8) (df8e44219270f80170e6d027b77b246f) switched from RUNNING to 
FAILED.
com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout 
waiting for connection from pool
        at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:972)
        at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:676)
        at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:650)
        at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:633)
        at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$300(AmazonHttpClient.java:601)
        at 
com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:583)
        at 
com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:447)
        at 
com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4137)
        at 
com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1346)
        at 
io.grhodes.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:72)
        at 
io.grhodes.hadoop.fs.s3a.S3AInputStream.openIfNeeded(S3AInputStream.java:43)
        at io.grhodes.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:137)
        at java.io.DataInputStream.read(DataInputStream.java:149)
        at 
org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:72)
        at 
org.apache.flink.api.common.io.DelimitedInputFormat.fillBuffer(DelimitedInputFormat.java:669)
        at 
org.apache.flink.api.common.io.DelimitedInputFormat.open(DelimitedInputFormat.java:490)
        at 
org.apache.flink.api.common.io.DelimitedInputFormat.open(DelimitedInputFormat.java:48)
        at 
org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:145)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting 
for connection from pool
        at 
org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:286)
        at 
org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:263)
        at sun.reflect.GeneratedMethodAccessor31.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)
        at com.amazonaws.http.conn.$Proxy16.get(Unknown Source)
        at 
org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:190)
        at 
org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
        at 
org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
        at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
        at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
        at 
com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
        at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1115)
        at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:964)
        ... 19 more
2017-09-26 08:47:27,345 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job 
Job_at_09/26/2017_08:44:08 (74a0b9f0eab746705ad88817849e5c4b) switched from 
state RUNNING to FAILING.
com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout 
waiting for connection from pool
        at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:972)
        at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:676)
        at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:650)
        at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:633)
        at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$300(AmazonHttpClient.java:601)
        at 
com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:583)
        at 
com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:447)
        at 
com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4137)
        at 
com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1346)
        at 
io.grhodes.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:72)
        at 
io.grhodes.hadoop.fs.s3a.S3AInputStream.openIfNeeded(S3AInputStream.java:43)
        at io.grhodes.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:137)
        at java.io.DataInputStream.read(DataInputStream.java:149)
        at 
org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:72)
        at 
org.apache.flink.api.common.io.DelimitedInputFormat.fillBuffer(DelimitedInputFormat.java:669)
        at 
org.apache.flink.api.common.io.DelimitedInputFormat.open(DelimitedInputFormat.java:490)
        at 
org.apache.flink.api.common.io.DelimitedInputFormat.open(DelimitedInputFormat.java:48)
        at 
org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:145)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting 
for connection from pool
        at 
org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:286)
        at 
org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:263)
        at sun.reflect.GeneratedMethodAccessor31.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)
        at com.amazonaws.http.conn.$Proxy16.get(Unknown Source)
        at 
org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:190)
        at 
org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
        at 
org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
        at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
        at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
        at 
com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
        at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1115)
        at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:964)
        ... 19 more
{code}

> HadoopFileSystem (with S3A) may leak connections, which cause job to stuck in 
> a restarting loop
> -----------------------------------------------------------------------------------------------
>
>                 Key: FLINK-4660
>                 URL: https://issues.apache.org/jira/browse/FLINK-4660
>             Project: Flink
>          Issue Type: Bug
>          Components: State Backends, Checkpointing
>            Reporter: Zhenzhong Xu
>            Priority: Critical
>         Attachments: Screen Shot 2016-09-20 at 2.49.14 PM.png, Screen Shot 
> 2016-09-20 at 2.49.32 PM.png
>
>
> Flink job with checkpoints enabled and configured to use S3A file system 
> backend, sometimes experiences checkpointing failure due to S3 consistency 
> issue. This behavior is also reported by other people and documented in 
> https://issues.apache.org/jira/browse/FLINK-4218.
> This problem gets magnified by current HadoopFileSystem implementation, which 
> can potentially leak S3 client connections, and eventually get into a 
> restarting loop with “Timeout waiting for a connection from pool” exception 
> thrown from aws client.
> I looked at the code, seems HadoopFileSystem.java never invoke close() method 
> on fs object upon failure, but the FileSystem may be re-initialized every 
> time the job gets restarted.
> A few evidence I observed:
> 1. When I set the connection pool limit to 128, and below commands shows 128 
> connections are stuck in CLOSE_WAIT state.
> !Screen Shot 2016-09-20 at 2.49.14 PM.png|align=left, vspace=5! 
> 2. task manager logs indicates that state backend file system consistently 
> getting initialized upon job restarting.
> !Screen Shot 2016-09-20 at 2.49.32 PM.png!
> 3. Log indicates there is NPE during cleanning up of stream task which was 
> caused by “Timeout waiting for connection from pool” exception when trying to 
> create a directory in S3 bucket.
> 2016-09-02 08:17:50,886 ERROR 
> org.apache.flink.streaming.runtime.tasks.StreamTask - Error during cleanup of 
> stream task
> java.lang.NullPointerException
> at 
> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.cleanup(OneInputStreamTask.java:73)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:323)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:589)
> at java.lang.Thread.run(Thread.java:745)
> 4.It appears StreamTask from invoking checkpointing operation, to handling 
> failure, there is no logic associated with closing Hadoop File System object 
> (which internally includes S3 aws client object), which resides in 
> HadoopFileSystem.java.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (FLINK-4660) HadoopFileSystem (with S3A) may leak connections, which cause job to stuck in a restarting loop

Reply via email to