huaxiangsun opened a new issue, #8218:
URL: https://github.com/apache/iceberg/issues/8218
### Apache Iceberg version
1.0.0
### Query engine
Spark
### Please describe the bug 🐞
We run into a weird Aws S3 exceptions "Unable to execute HTTP request: The
target server failed to respond" during table merge, which causes lots of task
failures.
Here is our case in detail. We have one table with a certain # of
partitions. Initially, each partition has 1 DataFile with ~2GB.
It is a v2 table format and we use Merge-on-read to do daily table
update(table merge), using Sparks' HashShuffleJoin with Storage Partition Join
enabled. All data is at aws s3.
For daily update, there are some updates and some new inserts, no deletes.
It starts with a DataFile0
After day 1's table update (using table merge), it creates 1 DeleteFIle-d1,
1 DataFile-d1-update (for update) and 1 DataFile-d1-insert (for new inserts).
After day 2's table update (using table merge), it creates 1 DeleteFIle-d2,
1 DataFile-d2-update (for update) and 1 DataFile-d2-insert (for new inserts).
After day 3's table update (using table merge), it creates 1 DeleteFIle-d3,
1 DataFile-d3-update (for update) and 1 DataFile-d3-insert (for new inserts).
All went ok though there are a very small number of the aws s3 exceptions
mentioned above.
For day 4's table update (using table merge), For each executor, after the
first set of tasks finished and it picks up a new set of tasks, for these new
tasks, we observed a huge number of the following aws s3 exceptions which
caused task to fail. We suspected it is related with number of DataFiles and
DeleteFiles as there are 7 DataFiles and 1 DeleteFile for one partition. But we
could not figure out the root cause, not sure if this is caused by s3 backend
or aws sdk client side.
Attached is the exception stack.
```
23/07/27 17:30:05 ERROR BaseReader: Error reading file(s): s3a://****
software.amazon.awssdk.core.exception.SdkClientException: Unable to execute
HTTP request: The target server failed to respond
at
software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:98)
~[sdk-core-2.15.40.jar:?]
at
software.amazon.awssdk.core.exception.SdkClientException.create(SdkClientException.java:43)
~[sdk-core-2.15.40.jar:?]
at
software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:205)
~[sdk-core-2.15.40.jar:?]
at
software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:66)
~[sdk-core-2.15.40.jar:?]
at
software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:34)
~[sdk-core-2.15.40.jar:?]
at
software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
~[sdk-core-2.15.40.jar:?]
at
software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:56)
~[sdk-core-2.15.40.jar:?]
at
software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:36)
~[sdk-core-2.15.40.jar:?]
at
software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.executeWithTimer(ApiCallTimeoutTrackingStage.java:80)
~[sdk-core-2.15.40.jar:?]
at
software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:60)
~[sdk-core-2.15.40.jar:?]
at
software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:42)
~[sdk-core-2.15.40.jar:?]
at
software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:48)
~[sdk-core-2.15.40.jar:?]
at
software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:31)
~[sdk-core-2.15.40.jar:?]
at
software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
~[sdk-core-2.15.40.jar:?]
at
software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
~[sdk-core-2.15.40.jar:?]
at
software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:37)
~[sdk-core-2.15.40.jar:?]
at
software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26)
~[sdk-core-2.15.40.jar:?]
at
software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:193)
~[sdk-core-2.15.40.jar:?]
at
software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:133)
~[sdk-core-2.15.40.jar:?]
at
software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:159)
~[sdk-core-2.15.40.jar:?]
at
software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:112)
~[sdk-core-2.15.40.jar:?]
at
software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:167)
~[sdk-core-2.15.40.jar:?]
at
software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:94)
~[sdk-core-2.15.40.jar:?]
at
software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45)
~[sdk-core-2.15.40.jar:?]
at
software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:55)
~[aws-core-2.15.40.jar:?]
at
software.amazon.awssdk.services.s3.DefaultS3Client.headObject(DefaultS3Client.java:5011)
~[s3-2.15.40.jar:?]
at
org.apache.iceberg.aws.s3.BaseS3File.getObjectMetadata(BaseS3File.java:85)
~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at
org.apache.iceberg.aws.s3.S3InputFile.getLength(S3InputFile.java:75)
~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at
org.apache.iceberg.parquet.ParquetIO$ParquetInputFile.getLength(ParquetIO.java:179)
~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at
org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:534)
~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at
org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:777)
~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at
org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:658)
~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at org.apache.iceberg.parquet.ReadConf.newReader(ReadConf.java:245)
~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at org.apache.iceberg.parquet.ReadConf.<init>(ReadConf.java:81)
~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at
org.apache.iceberg.parquet.ParquetReader.init(ParquetReader.java:71)
~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at
org.apache.iceberg.parquet.ParquetReader.iterator(ParquetReader.java:91)
~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at
org.apache.iceberg.parquet.ParquetReader.iterator(ParquetReader.java:37)
~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at org.apache.iceberg.util.Filter.lambda$filter$0(Filter.java:34)
~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at
org.apache.iceberg.io.CloseableIterable$2.iterator(CloseableIterable.java:72)
~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at
org.apache.iceberg.io.CloseableIterable$7$1.<init>(CloseableIterable.java:188)
~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at
org.apache.iceberg.io.CloseableIterable$7.iterator(CloseableIterable.java:187)
~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at
org.apache.iceberg.io.CloseableIterable$ConcatCloseableIterable$ConcatCloseableIterator.hasNext(CloseableIterable.java:257)
~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at java.lang.Iterable.forEach(Iterable.java:74) ~[?:?]
at
org.apache.iceberg.deletes.Deletes.toPositionIndex(Deletes.java:138)
~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at
org.apache.iceberg.deletes.Deletes.toPositionIndex(Deletes.java:132)
~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at
org.apache.iceberg.data.DeleteFilter.applyPosDeletes(DeleteFilter.java:250)
~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at
org.apache.iceberg.data.DeleteFilter.filter(DeleteFilter.java:154)
~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at
org.apache.iceberg.spark.source.RowDataReader.open(RowDataReader.java:92)
~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at
org.apache.iceberg.spark.source.RowDataReader.open(RowDataReader.java:42)
~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at
org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:135)
~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at
org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:119)
~[spark-sql_2.12-3.3.0.48.jar]
at
org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:156)
~[spark-sql_2.12-3.3.0.48.jar]
at
org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1(DataSourceRDD.scala:63)
~[spark-sql_2.12-3.3.0.48.jar]
at
org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(DataSourceRDD.scala:63)
~[spark-sql_2.12-3.3.0.48.jar]
at scala.Option.exists(Option.scala:376)
~[scala-library-2.12.15.jar:?]
at
org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)
~[spark-sql_2.12-3.3.0.48.jar]
at
org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:97)
~[spark-sql_2.12-3.3.0.48.jar]
at
org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)
~[spark-sql_2.12-3.3.0.48.jar]
at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
~[spark-core_2.12-3.3.0.48.jar]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
~[scala-library-2.12.15.jar:?]
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source) ~[?:?]
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
~[spark-sql_2.12-3.3.0.48.jar]
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
~[spark-sql_2.12-3.3.0.48.jar]
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
~[scala-library-2.12.15.jar:?]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
~[scala-library-2.12.15.jar:?]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
~[scala-library-2.12.15.jar:?]
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:513)
~[scala-library-2.12.15.jar:?]
at
org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:225)
~[spark-sql_2.12-3.3.0.48.jar]
at
org.apache.spark.sql.execution.SortExec.$anonfun$doExecute$1(SortExec.scala:119)
~[spark-sql_2.12-3.3.0.48.jar]
at
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
~[spark-core_2.12-3.3.0.48.jar]
at
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
~[spark-core_2.12-3.3.0.48.jar]
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
~[spark-core_2.12-3.3.0.48.jar]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
~[spark-core_2.12-3.3.0.48.jar]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
~[spark-core_2.12-3.3.0.48.jar]
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
~[spark-core_2.12-3.3.0.48.jar]
at org.apache.spark.scheduler.Task.run(Task.scala:136)
~[spark-core_2.12-3.3.0.48.jar]
at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
~[spark-core_2.12-3.3.0.48.jar]
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1513)
~[spark-core_2.12-3.3.0.48.jar]
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
~[spark-core_2.12-3.3.0.48.jar]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
~[?:?]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
~[?:?]
at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: org.apache.http.NoHttpResponseException: The target server failed
to respond
at
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:141)
~[httpclient-4.5.13.jar:4.5.13]
at
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56)
~[httpclient-4.5.13.jar:4.5.13]
at
org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259)
~[httpcore-4.4.14.jar:4.4.14]
at
org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163)
~[httpcore-4.4.14.jar:4.4.14]
at
org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:157)
~[httpclient-4.5.13.jar:4.5.13]
at
org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273)
~[httpcore-4.4.14.jar:4.4.14]
at
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
~[httpcore-4.4.14.jar:4.4.14]
at
org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272)
~[httpclient-4.5.13.jar:4.5.13]
at
org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
~[httpclient-4.5.13.jar:4.5.13]
at
org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
~[httpclient-4.5.13.jar:4.5.13]
at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
~[httpclient-4.5.13.jar:4.5.13]
at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
~[httpclient-4.5.13.jar:4.5.13]
at
software.amazon.awssdk.http.apache.internal.impl.ApacheSdkHttpClient.execute(ApacheSdkHttpClient.java:72)
~[apache-client-2.15.40.jar:?]
at
software.amazon.awssdk.http.apache.ApacheHttpClient.execute(ApacheHttpClient.java:253)
~[apache-client-2.15.40.jar:?]
at
software.amazon.awssdk.http.apache.ApacheHttpClient.access$500(ApacheHttpClient.java:106)
~[apache-client-2.15.40.jar:?]
at
software.amazon.awssdk.http.apache.ApacheHttpClient$1.call(ApacheHttpClient.java:232)
~[apache-client-2.15.40.jar:?]
at
software.amazon.awssdk.http.apache.ApacheHttpClient$1.call(ApacheHttpClient.java:229)
~[apache-client-2.15.40.jar:?]
at
software.amazon.awssdk.core.internal.util.MetricUtils.measureDurationUnsafe(MetricUtils.java:64)
~[sdk-core-2.15.40.jar:?]
at
software.amazon.awssdk.core.internal.http.pipeline.stages.MakeHttpRequestStage.executeHttpRequest(MakeHttpRequestStage.java:76)
~[sdk-core-2.15.40.jar:?]
at
software.amazon.awssdk.core.internal.http.pipeline.stages.MakeHttpRequestStage.execute(MakeHttpRequestStage.java:55)
~[sdk-core-2.15.40.jar:?]
at
software.amazon.awssdk.core.internal.http.pipeline.stages.MakeHttpRequestStage.execute(MakeHttpRequestStage.java:39)
~[sdk-core-2.15.40.jar:?]
at
software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
~[sdk-core-2.15.40.jar:?]
at
software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
~[sdk-core-2.15.40.jar:?]
at
software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
~[sdk-core-2.15.40.jar:?]
at
software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
~[sdk-core-2.15.40.jar:?]
at
software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:73)
~[sdk-core-2.15.40.jar:?]
at
software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:42)
~[sdk-core-2.15.40.jar:?]
at
software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:77)
~[sdk-core-2.15.40.jar:?]
at
software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:39)
~[sdk-core-2.15.40.jar:?]
at
software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:50)
~[sdk-core-2.15.40.jar:?]
at
software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:36)
~[sdk-core-2.15.40.jar:?]
at
software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:64)
~[sdk-core-2.15.40.jar:?]
... 78 more
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]