[jira] [Commented] (HADOOP-16644) Intermittent failure of ITestS3ATerasortOnS3A: timestamp differences
[ https://issues.apache.org/jira/browse/HADOOP-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947525#comment-16947525 ] Steve Loughran commented on HADOOP-16644: - yeah, I'd just seen that too, it comes back in the metadata. I just need to pass it in through the finishedWrite. My initial PR always does the HEAD on a non-dir PUT; we can enhance that. There's a risk for overwrites the HEAD returns the previous version. If we have the version ID all is good, but if not we can use the etag to verify we have the right value -we'd have to retry to get the new one. And as we know, those load balancers can cache for many seconds. regarding localisation and credentials, see HADOOP-16233 -we have to mark the status entries as encrypted so the shared cache is not used (it checks for "world readable and ! encrypted for the shared cache). With that patch in, the localisation is done as the user, and uses their DT. I believe that this will then use the jobconf -we would have to check. > Intermittent failure of ITestS3ATerasortOnS3A: timestamp differences > > > Key: HADOOP-16644 > URL: https://issues.apache.org/jira/browse/HADOOP-16644 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3, test >Affects Versions: 3.3.0 > Environment: -Dparallel-tests -DtestsThreadCount=8 > -Dfailsafe.runOrder=balanced -Ds3guard -Ddynamo -Dscale > h2. Hypothesis: > the timestamp of the source file is being picked up from S3Guard, but when > the NM does a getFileStatus call, a HEAD check is made -and this (due to the > overloaded test system) is out of sync with the listing. S3Guard is updated, > the corrected date returned and the localisation fails. >Reporter: Steve Loughran >Priority: Major > > Terasort of directory committer failing in resource localisaton -the > partitions.lst file has a different TS from that expected > Happens under loaded integration tests (threads = 8; not standalone); > non-auth s3guard > {code} > 2019-10-08 11:50:29,774 [IPC Server handler 4 on 55983] WARN > localizer.ResourceLocalizationService > (ResourceLocalizationService.java:processHeartbeat(1150)) - { > s3a://hwdev-steve-ireland-new/terasort-directory/sortout/_partition.lst, > 1570531828143, FILE, null } failed: Resource > s3a://hwdev-steve-ireland-new/terasort-directory/sortout/_partition.lst > changed on src filesystem (expected 1570531828143, was 1570531828000 > java.io.IOException: Resource > s3a://hwdev-steve-ireland-new/terasort-directory/sortout/_partition.lst > changed on src filesystem (expected 1570531828143, was 1570531828000 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-16644) Intermittent failure of ITestS3ATerasortOnS3A: timestamp differences
[ https://issues.apache.org/jira/browse/HADOOP-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947209#comment-16947209 ] Siddharth Seth commented on HADOOP-16644: - Looks like a PUTRequest gives back the modification time, a multipart upload does not. Given a multipart upload is likely a long operation anyway - a HEAD request following a MultiPartComplete call likely doesn't add a large percentage to the operation time (only is S3Guard enabled). For a direct PUT - we have the data anyway. Will definitely make me happy to avoid writing to DDB during a getSTatus operation. Using S3 for resource localization - that's got at least one issue which I'm aware of. Need to test this, and then file a YARN jira. Essentially - I suspect the localizer does not use the JobClient config - so any credentials there will not be available to YARN for localization (e.g. client sets up access_key and secret_key in config). > Intermittent failure of ITestS3ATerasortOnS3A: timestamp differences > > > Key: HADOOP-16644 > URL: https://issues.apache.org/jira/browse/HADOOP-16644 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3, test >Affects Versions: 3.3.0 > Environment: -Dparallel-tests -DtestsThreadCount=8 > -Dfailsafe.runOrder=balanced -Ds3guard -Ddynamo -Dscale > h2. Hypothesis: > the timestamp of the source file is being picked up from S3Guard, but when > the NM does a getFileStatus call, a HEAD check is made -and this (due to the > overloaded test system) is out of sync with the listing. S3Guard is updated, > the corrected date returned and the localisation fails. >Reporter: Steve Loughran >Priority: Major > > Terasort of directory committer failing in resource localisaton -the > partitions.lst file has a different TS from that expected > Happens under loaded integration tests (threads = 8; not standalone); > non-auth s3guard > {code} > 2019-10-08 11:50:29,774 [IPC Server handler 4 on 55983] WARN > localizer.ResourceLocalizationService > (ResourceLocalizationService.java:processHeartbeat(1150)) - { > s3a://hwdev-steve-ireland-new/terasort-directory/sortout/_partition.lst, > 1570531828143, FILE, null } failed: Resource > s3a://hwdev-steve-ireland-new/terasort-directory/sortout/_partition.lst > changed on src filesystem (expected 1570531828143, was 1570531828000 > java.io.IOException: Resource > s3a://hwdev-steve-ireland-new/terasort-directory/sortout/_partition.lst > changed on src filesystem (expected 1570531828143, was 1570531828000 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-16644) Intermittent failure of ITestS3ATerasortOnS3A: timestamp differences
[ https://issues.apache.org/jira/browse/HADOOP-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946787#comment-16946787 ] Steve Loughran commented on HADOOP-16644: - We really need a way of getting that FS timestamp off the store. I am "reluctant" to do it in a HEAD straight after the create, but it is the only way to guarantee consistency. Doing the head/update during the PUT would also address HADOOP-16412 (etag and version) and keep [~sseth] happy. +![~gabor.bota], [~fabbri] *we could always think about making that HEAD/PUT async, though that could lead to even more inconsistency pain. > Intermittent failure of ITestS3ATerasortOnS3A: timestamp differences > > > Key: HADOOP-16644 > URL: https://issues.apache.org/jira/browse/HADOOP-16644 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3, test >Affects Versions: 3.3.0 > Environment: -Dparallel-tests -DtestsThreadCount=8 > -Dfailsafe.runOrder=balanced -Ds3guard -Ddynamo -Dscale > h2. Hypothesis: > the timestamp of the source file is being picked up from S3Guard, but when > the NM does a getFileStatus call, a HEAD check is made -and this (due to the > overloaded test system) is out of sync with the listing. S3Guard is updated, > the corrected date returned and the localisation fails. >Reporter: Steve Loughran >Priority: Major > > Terasort of directory committer failing in resource localisaton -the > partitions.lst file has a different TS from that expected > Happens under loaded integration tests (threads = 8; not standalone); > non-auth s3guard > {code} > 2019-10-08 11:50:29,774 [IPC Server handler 4 on 55983] WARN > localizer.ResourceLocalizationService > (ResourceLocalizationService.java:processHeartbeat(1150)) - { > s3a://hwdev-steve-ireland-new/terasort-directory/sortout/_partition.lst, > 1570531828143, FILE, null } failed: Resource > s3a://hwdev-steve-ireland-new/terasort-directory/sortout/_partition.lst > changed on src filesystem (expected 1570531828143, was 1570531828000 > java.io.IOException: Resource > s3a://hwdev-steve-ireland-new/terasort-directory/sortout/_partition.lst > changed on src filesystem (expected 1570531828143, was 1570531828000 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-16644) Intermittent failure of ITestS3ATerasortOnS3A: timestamp differences
[ https://issues.apache.org/jira/browse/HADOOP-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946781#comment-16946781 ] Steve Loughran commented on HADOOP-16644: - {code} 2019-10-08 11:50:29,774 [IPC Server handler 4 on 55983] WARN localizer.ResourceLocalizationService (ResourceLocalizationService.java:processHeartbeat(1150)) - { s3a://hwdev-steve-ireland-new/terasort-directory/sortout/_partition.lst, 1570531828143, FILE, null } failed: Resource s3a://hwdev-steve-ireland-new/terasort-directory/sortout/_partition.lst changed on src filesystem (expected 1570531828143, was 1570531828000 java.io.IOException: Resource s3a://hwdev-steve-ireland-new/terasort-directory/sortout/_partition.lst changed on src filesystem (expected 1570531828143, was 1570531828000 at org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:273) at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:67) at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:414) at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:411) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:411) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:248) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:241) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:229) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} > Intermittent failure of ITestS3ATerasortOnS3A: timestamp differences > > > Key: HADOOP-16644 > URL: https://issues.apache.org/jira/browse/HADOOP-16644 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3, test >Affects Versions: 3.3.0 > Environment: -Dparallel-tests -DtestsThreadCount=8 > -Dfailsafe.runOrder=balanced -Ds3guard -Ddynamo -Dscale > h2. Hypothesis: > the timestamp of the source file is being picked up from S3Guard, but when > the NM does a getFileStatus call, a HEAD check is made -and this (due to the > overloaded test system) is out of sync with the listing. S3Guard is updated, > the corrected date returned and the localisation fails. >Reporter: Steve Loughran >Priority: Major > > Terasort of directory committer failing in resource localisaton -the > partitions.lst file has a different TS from that expected > Happens under loaded integration tests (threads = 8; not standalone); > non-auth s3guard > {code} > 2019-10-08 11:50:29,774 [IPC Server handler 4 on 55983] WARN > localizer.ResourceLocalizationService > (ResourceLocalizationService.java:processHeartbeat(1150)) - { > s3a://hwdev-steve-ireland-new/terasort-directory/sortout/_partition.lst, > 1570531828143, FILE, null } failed: Resource > s3a://hwdev-steve-ireland-new/terasort-directory/sortout/_partition.lst > changed on src filesystem (expected 1570531828143, was 1570531828000 > java.io.IOException: Resource > s3a://hwdev-steve-ireland-new/terasort-directory/sortout/_partition.lst > changed on src filesystem (expected 1570531828143, was 1570531828000 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org