[ 
https://issues.apache.org/jira/browse/HADOOP-17190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183452#comment-17183452
 ] 

Steve Loughran commented on HADOOP-17190:
-----------------------------------------

I've looked more and can see the problem: clock skew between s3guard and S3 
timestamps cause the container localizer to fail
{code}
2020-08-24 14:58:31,059 [IPC Server handler 3 on 65048] WARN  
localizer.ResourceLocalizationService 
(ResourceLocalizationService.java:processHeartbeat(1152)) - { 
s3a://stevel-london/terasort-directory/sortout/_partition.lst, 1598277507027, 
FILE, null } failed: Resource 
s3a://stevel-london/terasort-directory/sortout/_partition.lst changed on src 
filesystem (expected 1598277507027, was 1598277507000
java.io.IOException: Resource 
s3a://stevel-london/terasort-directory/sortout/_partition.lst changed on src 
filesystem (expected 1598277507027, was 1598277507000
        at 
org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:273)
        at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:67)
        at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:414)
        at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:411)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
        at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:411)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:248)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:241)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:229)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
{code}

its intermittent as it only happens when there's a mismatch in time between 
when the upload completed and a timestamp was added to the s3guard table, and 
that of the S3A FS.

This is the localizer being brittle to clock errors, really it needs a range 
value over which it doesn't overreact about changed files

> Intermittent ITestTerasortOnS3A.test_120_terasort failure
> ---------------------------------------------------------
>
>                 Key: HADOOP-17190
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17190
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/s3
>    Affects Versions: 3.3.0
>            Reporter: Mukund Thakur
>            Priority: Minor
>
> [*INFO*] Running org.apache.hadoop.fs.s3a.commit.terasort.*ITestTerasortOnS3A*
> [*ERROR*] *Tests* *run: 14*, *Failures: 2*, Errors: 0, *Skipped: 2*, Time 
> elapsed: 110.43 s *<<< FAILURE!* - in 
> org.apache.hadoop.fs.s3a.commit.terasort.*ITestTerasortOnS3A*
> [*ERROR*] 
> test_120_terasort[directory](org.apache.hadoop.fs.s3a.commit.terasort.ITestTerasortOnS3A)
>   Time elapsed: 6.261 s  <<< FAILURE!
> java.lang.AssertionError: 
> terasort(s3a://mthakur-data/terasort-directory/sortin, 
> s3a://mthakur-data/terasort-directory/sortout) failed expected:<0> but was:<1>
>  at 
> org.apache.hadoop.fs.s3a.commit.terasort.ITestTerasortOnS3A.executeStage(ITestTerasortOnS3A.java:241)
>  at 
> org.apache.hadoop.fs.s3a.commit.terasort.ITestTerasortOnS3A.test_120_terasort(ITestTerasortOnS3A.java:291)
>  
> [*ERROR*] 
> test_120_terasort[magic](org.apache.hadoop.fs.s3a.commit.terasort.ITestTerasortOnS3A)
>   Time elapsed: 5.962 s  <<< FAILURE!
> java.lang.AssertionError: terasort(s3a://mthakur-data/terasort-magic/sortin, 
> s3a://mthakur-data/terasort-magic/sortout) failed expected:<0> but was:<1>
>  at 
> org.apache.hadoop.fs.s3a.commit.terasort.ITestTerasortOnS3A.executeStage(ITestTerasortOnS3A.java:241)
>  at 
> org.apache.hadoop.fs.s3a.commit.terasort.ITestTerasortOnS3A.test_120_terasort(ITestTerasortOnS3A.java:291)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to