xccui opened a new issue, #8554:
URL: https://github.com/apache/hudi/issues/8554
We hit some S3 http connection pool issues when running a Flink writer job
and it caused the connection pool on `StreamWriteOperatorCoordinator` to close.
However, after failure recovery, the connection pool won't be reset. I feel
that we should reset the connection pool, as well as some other resources
during a failover to avoid being trapped in an unhealthy loop.
Our job kept restarting and throwing the following exception.
```
2023-04-24 03:42:43 [pool-25-thread-1] ERROR
org.apache.hudi.sink.StreamWriteOperatorCoordinator [] - Executor
executes action [initialize instant ] error
java.lang.IllegalStateException: Connection pool shut down
at
com.amazonaws.thirdparty.apache.http.util.Asserts.check(Asserts.java:34)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
com.amazonaws.thirdparty.apache.http.impl.conn.PoolingHttpClientConnectionManager.requestConnection(PoolingHttpClientConnectionManager.java:269)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at jdk.internal.reflect.GeneratedMethodAccessor35.invoke(Unknown Source)
~[?:?]
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
Source) ~[?:?]
at java.lang.reflect.Method.invoke(Unknown Source) ~[?:?]
at
com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at com.amazonaws.http.conn.$Proxy52.requestConnection(Unknown Source)
~[?:?]
at
com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:176)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
com.amazonaws.thirdparty.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
com.amazonaws.thirdparty.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1346)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1157)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:814)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:781)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:755)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:715)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:697)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:561)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:541)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5456)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5403)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1372)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$getObjectMetadata$10(S3AFileSystem.java:2545)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:414)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:377)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:2533)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:2513)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3776)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3688)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$getFileStatus$24(S3AFileSystem.java:3556)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:499)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:444)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2337)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2356)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:3554)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
org.apache.hudi.common.fs.HoodieWrapperFileSystem.lambda$getFileStatus$17(HoodieWrapperFileSystem.java:410)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
org.apache.hudi.common.fs.HoodieWrapperFileSystem.executeFuncWithTimeMetrics(HoodieWrapperFileSystem.java:114)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
org.apache.hudi.common.fs.HoodieWrapperFileSystem.getFileStatus(HoodieWrapperFileSystem.java:404)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
org.apache.hudi.exception.TableNotFoundException.checkTableValidity(TableNotFoundException.java:51)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
org.apache.hudi.common.table.HoodieTableMetaClient.<init>(HoodieTableMetaClient.java:137)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
org.apache.hudi.common.table.HoodieTableMetaClient.newMetaClient(HoodieTableMetaClient.java:689)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
org.apache.hudi.common.table.HoodieTableMetaClient.access$000(HoodieTableMetaClient.java:81)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
org.apache.hudi.common.table.HoodieTableMetaClient$Builder.build(HoodieTableMetaClient.java:770)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
org.apache.hudi.table.HoodieFlinkTable.create(HoodieFlinkTable.java:62)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
org.apache.hudi.client.HoodieFlinkTableServiceClient.createTable(HoodieFlinkTableServiceClient.java:172)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:706)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
org.apache.hudi.client.BaseHoodieWriteClient.lambda$startCommit$afea71c0$1(BaseHoodieWriteClient.java:810)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
org.apache.hudi.common.util.CleanerUtils.rollbackFailedWrites(CleanerUtils.java:156)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
org.apache.hudi.client.BaseHoodieWriteClient.startCommit(BaseHoodieWriteClient.java:809)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
org.apache.hudi.sink.StreamWriteOperatorCoordinator.startInstant(StreamWriteOperatorCoordinator.java:399)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
org.apache.hudi.sink.StreamWriteOperatorCoordinator.lambda$initInstant$6(StreamWriteOperatorCoordinator.java:426)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at
org.apache.hudi.sink.utils.NonThrownExecutor.lambda$wrapAction$0(NonThrownExecutor.java:130)
~[blob_p-abdf98cc6fdb80521c5886e97d0250884f55321b-5fd12d7a052c31efa7e4c3e5be67b915:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
[?:?]
at java.lang.Thread.run(Unknown Source) [?:?]
```
As a workaround, we killed the JobManager by force and then the job can
successfully recover.
**Environment Description**
* Hudi version : bdb50ddccc9631317dfb06a06abc38cbd3714ce8
* Flink version : 1.16.1
* Storage (HDFS/S3/GCS..) : S3
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]