[
https://issues.apache.org/jira/browse/HADOOP-17190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183452#comment-17183452
]
Steve Loughran commented on HADOOP-17190:
-----------------------------------------
I've looked more and can see the problem: clock skew between s3guard and S3
timestamps cause the container localizer to fail
{code}
2020-08-24 14:58:31,059 [IPC Server handler 3 on 65048] WARN
localizer.ResourceLocalizationService
(ResourceLocalizationService.java:processHeartbeat(1152)) - {
s3a://stevel-london/terasort-directory/sortout/_partition.lst, 1598277507027,
FILE, null } failed: Resource
s3a://stevel-london/terasort-directory/sortout/_partition.lst changed on src
filesystem (expected 1598277507027, was 1598277507000
java.io.IOException: Resource
s3a://stevel-london/terasort-directory/sortout/_partition.lst changed on src
filesystem (expected 1598277507027, was 1598277507000
at
org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:273)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:67)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:414)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:411)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:411)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:248)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:241)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:229)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}
its intermittent as it only happens when there's a mismatch in time between
when the upload completed and a timestamp was added to the s3guard table, and
that of the S3A FS.
This is the localizer being brittle to clock errors, really it needs a range
value over which it doesn't overreact about changed files
> Intermittent ITestTerasortOnS3A.test_120_terasort failure
> ---------------------------------------------------------
>
> Key: HADOOP-17190
> URL: https://issues.apache.org/jira/browse/HADOOP-17190
> Project: Hadoop Common
> Issue Type: Bug
> Components: fs/s3
> Affects Versions: 3.3.0
> Reporter: Mukund Thakur
> Priority: Minor
>
> [*INFO*] Running org.apache.hadoop.fs.s3a.commit.terasort.*ITestTerasortOnS3A*
> [*ERROR*] *Tests* *run: 14*, *Failures: 2*, Errors: 0, *Skipped: 2*, Time
> elapsed: 110.43 s *<<< FAILURE!* - in
> org.apache.hadoop.fs.s3a.commit.terasort.*ITestTerasortOnS3A*
> [*ERROR*]
> test_120_terasort[directory](org.apache.hadoop.fs.s3a.commit.terasort.ITestTerasortOnS3A)
> Time elapsed: 6.261 s <<< FAILURE!
> java.lang.AssertionError:
> terasort(s3a://mthakur-data/terasort-directory/sortin,
> s3a://mthakur-data/terasort-directory/sortout) failed expected:<0> but was:<1>
> at
> org.apache.hadoop.fs.s3a.commit.terasort.ITestTerasortOnS3A.executeStage(ITestTerasortOnS3A.java:241)
> at
> org.apache.hadoop.fs.s3a.commit.terasort.ITestTerasortOnS3A.test_120_terasort(ITestTerasortOnS3A.java:291)
>
> [*ERROR*]
> test_120_terasort[magic](org.apache.hadoop.fs.s3a.commit.terasort.ITestTerasortOnS3A)
> Time elapsed: 5.962 s <<< FAILURE!
> java.lang.AssertionError: terasort(s3a://mthakur-data/terasort-magic/sortin,
> s3a://mthakur-data/terasort-magic/sortout) failed expected:<0> but was:<1>
> at
> org.apache.hadoop.fs.s3a.commit.terasort.ITestTerasortOnS3A.executeStage(ITestTerasortOnS3A.java:241)
> at
> org.apache.hadoop.fs.s3a.commit.terasort.ITestTerasortOnS3A.test_120_terasort(ITestTerasortOnS3A.java:291)
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]