[
https://issues.apache.org/jira/browse/FLINK-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16180517#comment-16180517
]
Vishnu Viswanath commented on FLINK-4660:
-----------------------------------------
in which version is this fixed? I am using 1.3.1 and getting similar exception
when reading input split from S3.
{code}
2017-09-26 08:47:27,220 INFO
org.apache.flink.api.common.io.LocatableInputSplitAssigner - Assigning
remote split to host ip-10-150-98-185
2017-09-26 08:47:27,344 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN
DataSource (at .......Job$$anonfun$main$4$$anonfun$apply$3.apply(Job.scala:138)
(org.apache.flink.api.java.io.TextInputFormat)) -> FlatMap (FlatMap at
......sources.SourceSelector$.selectSource(SourceSelector.scala:17)) -> Map
(from: ....) (6/8) (df8e44219270f80170e6d027b77b246f) switched from RUNNING to
FAILED.
com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout
waiting for connection from pool
at
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:972)
at
com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:676)
at
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:650)
at
com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:633)
at
com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$300(AmazonHttpClient.java:601)
at
com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:583)
at
com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:447)
at
com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4137)
at
com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1346)
at
io.grhodes.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:72)
at
io.grhodes.hadoop.fs.s3a.S3AInputStream.openIfNeeded(S3AInputStream.java:43)
at io.grhodes.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:137)
at java.io.DataInputStream.read(DataInputStream.java:149)
at
org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:72)
at
org.apache.flink.api.common.io.DelimitedInputFormat.fillBuffer(DelimitedInputFormat.java:669)
at
org.apache.flink.api.common.io.DelimitedInputFormat.open(DelimitedInputFormat.java:490)
at
org.apache.flink.api.common.io.DelimitedInputFormat.open(DelimitedInputFormat.java:48)
at
org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:145)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting
for connection from pool
at
org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:286)
at
org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:263)
at sun.reflect.GeneratedMethodAccessor31.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)
at com.amazonaws.http.conn.$Proxy16.get(Unknown Source)
at
org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:190)
at
org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
at
org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
at
com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
at
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1115)
at
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:964)
... 19 more
2017-09-26 08:47:27,345 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph - Job
Job_at_09/26/2017_08:44:08 (74a0b9f0eab746705ad88817849e5c4b) switched from
state RUNNING to FAILING.
com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout
waiting for connection from pool
at
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:972)
at
com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:676)
at
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:650)
at
com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:633)
at
com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$300(AmazonHttpClient.java:601)
at
com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:583)
at
com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:447)
at
com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4137)
at
com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1346)
at
io.grhodes.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:72)
at
io.grhodes.hadoop.fs.s3a.S3AInputStream.openIfNeeded(S3AInputStream.java:43)
at io.grhodes.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:137)
at java.io.DataInputStream.read(DataInputStream.java:149)
at
org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:72)
at
org.apache.flink.api.common.io.DelimitedInputFormat.fillBuffer(DelimitedInputFormat.java:669)
at
org.apache.flink.api.common.io.DelimitedInputFormat.open(DelimitedInputFormat.java:490)
at
org.apache.flink.api.common.io.DelimitedInputFormat.open(DelimitedInputFormat.java:48)
at
org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:145)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting
for connection from pool
at
org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:286)
at
org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:263)
at sun.reflect.GeneratedMethodAccessor31.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)
at com.amazonaws.http.conn.$Proxy16.get(Unknown Source)
at
org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:190)
at
org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
at
org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
at
com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
at
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1115)
at
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:964)
... 19 more
{code}
> HadoopFileSystem (with S3A) may leak connections, which cause job to stuck in
> a restarting loop
> -----------------------------------------------------------------------------------------------
>
> Key: FLINK-4660
> URL: https://issues.apache.org/jira/browse/FLINK-4660
> Project: Flink
> Issue Type: Bug
> Components: State Backends, Checkpointing
> Reporter: Zhenzhong Xu
> Priority: Critical
> Attachments: Screen Shot 2016-09-20 at 2.49.14 PM.png, Screen Shot
> 2016-09-20 at 2.49.32 PM.png
>
>
> Flink job with checkpoints enabled and configured to use S3A file system
> backend, sometimes experiences checkpointing failure due to S3 consistency
> issue. This behavior is also reported by other people and documented in
> https://issues.apache.org/jira/browse/FLINK-4218.
> This problem gets magnified by current HadoopFileSystem implementation, which
> can potentially leak S3 client connections, and eventually get into a
> restarting loop with “Timeout waiting for a connection from pool” exception
> thrown from aws client.
> I looked at the code, seems HadoopFileSystem.java never invoke close() method
> on fs object upon failure, but the FileSystem may be re-initialized every
> time the job gets restarted.
> A few evidence I observed:
> 1. When I set the connection pool limit to 128, and below commands shows 128
> connections are stuck in CLOSE_WAIT state.
> !Screen Shot 2016-09-20 at 2.49.14 PM.png|align=left, vspace=5!
> 2. task manager logs indicates that state backend file system consistently
> getting initialized upon job restarting.
> !Screen Shot 2016-09-20 at 2.49.32 PM.png!
> 3. Log indicates there is NPE during cleanning up of stream task which was
> caused by “Timeout waiting for connection from pool” exception when trying to
> create a directory in S3 bucket.
> 2016-09-02 08:17:50,886 ERROR
> org.apache.flink.streaming.runtime.tasks.StreamTask - Error during cleanup of
> stream task
> java.lang.NullPointerException
> at
> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.cleanup(OneInputStreamTask.java:73)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:323)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:589)
> at java.lang.Thread.run(Thread.java:745)
> 4.It appears StreamTask from invoking checkpointing operation, to handling
> failure, there is no logic associated with closing Hadoop File System object
> (which internally includes S3 aws client object), which resides in
> HadoopFileSystem.java.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)