[
https://issues.apache.org/jira/browse/HADOOP-19347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17901225#comment-17901225
]
Steve Loughran commented on HADOOP-19347:
-----------------------------------------
ull (sanitized) stack trace
{code}
ERROR : Job Commit failed with exception
'org.apache.hadoop.hive.ql.metadata.HiveException(org.apache.hadoop.fs.s3a.AWSS3IOException:
rename
s3a://bucket/hive/table/warehouse/tablespace/external/hive/table/-tmp.-ext-10000
to
s3a://bucket/hive/table/warehouse/tablespace/external/hive/table/-tmp.-ext-10000.moved
on
s3a://bucket/hive/table/warehouse/tablespace/external/hive/table/-tmp.-ext-10000:
org.apache.hadoop.fs.s3a.impl.MultiObjectDeleteException:
[S3Error(Key=table/warehouse/tablespace/external/hive/table/-tmp.-ext-10000/file/,
Code=InternalError, Message=We encountered an internal error. Please try
again.)]: InternalError:
table/warehouse/tablespace/external/hive/table/-tmp.-ext-10000/file/: We
encountered an internal error. Please try again.
:
[S3Error(Key=table/warehouse/tablespace/external/hive/table/-tmp.-ext-10000/file/,
Code=InternalError, Message=We encountered an internal error. Please try
again.)])'
org.apache.hadoop.hive.ql.metadata.HiveException:
org.apache.hadoop.fs.s3a.AWSS3IOException: rename
s3a://bucket/hive/table/warehouse/tablespace/external/hive/table/-tmp.-ext-10000
to
s3a://bucket/hive/table/warehouse/tablespace/external/hive/table/-tmp.-ext-10000.moved
on
s3a://bucket/hive/table/warehouse/tablespace/external/hive/table/-tmp.-ext-10000:
org.apache.hadoop.fs.s3a.impl.MultiObjectDeleteException:
[S3Error(Key=table/warehouse/tablespace/external/hive/table/-tmp.-ext-10000/file/,
Code=InternalError, Message=We encountered an internal error. Please try
again.)]: InternalError:
table/warehouse/tablespace/external/hive/table/-tmp.-ext-10000/file/: We
encountered an internal error. Please try again.
:
[S3Error(Key=table/warehouse/tablespace/external/hive/table/-tmp.-ext-10000/file/,
Code=InternalError, Message=We encountered an internal error. Please try
again.)]
at
org.apache.hadoop.hive.ql.exec.FileSinkOperator.jobCloseOp(FileSinkOperator.java:1632)
at org.apache.hadoop.hive.ql.exec.Operator.jobClose(Operator.java:797)
at org.apache.hadoop.hive.ql.exec.Operator.jobClose(Operator.java:802)
at org.apache.hadoop.hive.ql.exec.tez.TezTask.close(TezTask.java:703)
at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:371)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213)
at
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105)
at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:356)
at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:329)
at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:246)
at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:107)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:809)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:546)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:540)
at
org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:190)
at
org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:235)
at
org.apache.hive.service.cli.operation.SQLOperation.access$700(SQLOperation.java:92)
at
org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:340)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at
org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:360)
at
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.hadoop.fs.s3a.AWSS3IOException: rename
s3a://bucket/hive/table/warehouse/tablespace/external/hive/table/-tmp.-ext-10000
to
s3a://bucket/hive/table/warehouse/tablespace/external/hive/table/-tmp.-ext-10000.moved
on
s3a://bucket/hive/table/warehouse/tablespace/external/hive/table/-tmp.-ext-10000:
org.apache.hadoop.fs.s3a.impl.MultiObjectDeleteException:
[S3Error(Key=table/warehouse/tablespace/external/hive/table/-tmp.-ext-10000/file/,
Code=InternalError, Message=We encountered an internal error. Please try
again.)]: InternalError:
table/warehouse/tablespace/external/hive/table/-tmp.-ext-10000/file/: We
encountered an internal error. Please try again.
:
[S3Error(Key=table/warehouse/tablespace/external/hive/table/-tmp.-ext-10000/file/,
Code=InternalError, Message=We encountered an internal error. Please try
again.)]
at
org.apache.hadoop.fs.s3a.impl.MultiObjectDeleteException.translateException(MultiObjectDeleteException.java:101)
at
org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:347)
at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:124)
at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:163)
at
org.apache.hadoop.fs.s3a.impl.RenameOperation.removeSourceObjects(RenameOperation.java:623)
at
org.apache.hadoop.fs.s3a.impl.RenameOperation.completeActiveCopiesAndDeleteSources(RenameOperation.java:266)
at
org.apache.hadoop.fs.s3a.impl.RenameOperation.recursiveDirectoryRename(RenameOperation.java:456)
at
org.apache.hadoop.fs.s3a.impl.RenameOperation.execute(RenameOperation.java:291)
at
org.apache.hadoop.fs.s3a.S3AFileSystem.innerRename(S3AFileSystem.java:2456)
at
org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$rename$6(S3AFileSystem.java:2305)
at
org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.invokeTrackingDuration(IOStatisticsBinding.java:543)
at
org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:524)
at
org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:445)
at
org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2776)
at
org.apache.hadoop.fs.s3a.S3AFileSystem.rename(S3AFileSystem.java:2303)
at org.apache.hadoop.hive.ql.exec.Utilities.rename(Utilities.java:1174)
at
org.apache.hadoop.hive.ql.exec.Utilities.mvFileToFinalPath(Utilities.java:1551)
at
org.apache.hadoop.hive.ql.exec.FileSinkOperator.jobCloseOp(FileSinkOperator.java:1617)
... 28 more
Caused by: org.apache.hadoop.fs.s3a.impl.MultiObjectDeleteException:
[S3Error(Key=table/warehouse/tablespace/external/hive/table/-tmp.-ext-10000/file/,
Code=InternalError, Message=We encountered an internal error. Please try
again.)]
at
org.apache.hadoop.fs.s3a.S3AFileSystem.deleteObjects(S3AFileSystem.java:3186)
at
org.apache.hadoop.fs.s3a.S3AFileSystem.removeKeysS3(S3AFileSystem.java:3422)
at
org.apache.hadoop.fs.s3a.S3AFileSystem.removeKeys(S3AFileSystem.java:3481)
at
org.apache.hadoop.fs.s3a.S3AFileSystem$OperationCallbacksImpl.removeKeys(S3AFileSystem.java:2558)
at
org.apache.hadoop.fs.s3a.impl.RenameOperation.lambda$removeSourceObjects$3(RenameOperation.java:625)
at org.apache.hadoop.fs.s3a.Invoker.lambda$once$0(Invoker.java:165)
at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:122)
... 43 more
{code}
> AWS SDK deleteObjects() and S3Store.deleteObjects() don't handle 500 failures
> of individual objects
> ---------------------------------------------------------------------------------------------------
>
> Key: HADOOP-19347
> URL: https://issues.apache.org/jira/browse/HADOOP-19347
> Project: Hadoop Common
> Issue Type: Bug
> Components: fs/s3
> Affects Versions: 3.4.1
> Reporter: Steve Loughran
> Priority: Minor
>
> S3Store.deleteObjects() encountered 500 error and didn't recover.
> We normally assume that 500 errors are already retried by the SDK so our own
> retry logic doesn't bother
> The root cause is that the 500 errors can surface within the bulk delete.
> * The delete POST returns 200, so SDK is happy
> * but one of the rows in the request is reports the S3Error "InternalError":
> {{Code=InternalError, Message=We encountered an internal error. Please try
> again.)]}}
> Proposed.
> * bulk delete invoker must map "InternalError" to AWSStatus500Exception and
> throw that.
> * Add a retry policy for bulk deletes which considers AWSStatus500Exception
> as retriable. retry. We currently don't on the assumption that the SDK will
> retry, which it does for base retries, but clearly not for multiobject delete.
> * Maybe also consider possibility that a partial 503 response could be
> generated? that is: only part of the delete throttled?
> {code}
> Caused by: org.apache.hadoop.fs.s3a.impl.MultiObjectDeleteException:
> [S3Error(Key=table/warehouse/tablespace/external/hive/table/-tmp.-ext-10000/file/,
> Code=InternalError, Message=We encountered an internal error. Please try
> again.)]
> at
> org.apache.hadoop.fs.s3a.S3AFileSystem.deleteObjects(S3AFileSystem.java:3186)
> at
> org.apache.hadoop.fs.s3a.S3AFileSystem.removeKeysS3(S3AFileSystem.java:3422)
> at
> org.apache.hadoop.fs.s3a.S3AFileSystem.removeKeys(S3AFileSystem.java:3481)
> at
> org.apache.hadoop.fs.s3a.S3AFileSystem$OperationCallbacksImpl.removeKeys(S3AFileSystem.java:2558)
> at
> org.apache.hadoop.fs.s3a.impl.RenameOperation.lambda$removeSourceObjects$3(RenameOperation.java:625)
> at org.apache.hadoop.fs.s3a.Invoker.lambda$once$0(Invoker.java:165)
> at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:122)
>
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]