[jira] [Commented] (HADOOP-16644) Intermittent failure of ITestS3ATerasortOnS3A: timestamp differences

2019-10-09 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947525#comment-16947525
 ] 

Steve Loughran commented on HADOOP-16644:
-

yeah, I'd just seen that too, it comes back in the metadata. I just need to 
pass it in through the finishedWrite.

My initial PR always does the HEAD on a non-dir PUT; we can enhance that. 
There's a risk for overwrites the HEAD returns the previous version. If we have 
the version ID all is good, but if not we can use the etag to verify  we have 
the right value -we'd have to retry to get the new one. And as we know, those 
load balancers can cache for many seconds.

regarding localisation and credentials, see HADOOP-16233 -we have to mark the 
status entries as encrypted so the shared cache is not used (it checks for 
"world readable and ! encrypted for the shared cache). With that patch in, the 
localisation is done as the user, and uses their DT.

I believe that this will then use the jobconf -we would have to check.



> Intermittent failure of ITestS3ATerasortOnS3A: timestamp differences
> 
>
> Key: HADOOP-16644
> URL: https://issues.apache.org/jira/browse/HADOOP-16644
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3, test
>Affects Versions: 3.3.0
> Environment: -Dparallel-tests -DtestsThreadCount=8 
> -Dfailsafe.runOrder=balanced -Ds3guard -Ddynamo -Dscale
> h2. Hypothesis:
> the timestamp of the source file is being picked up from S3Guard, but when 
> the NM does a getFileStatus call, a HEAD check is made -and this (due to the 
> overloaded test system) is out of sync with the listing. S3Guard is updated, 
> the corrected date returned and the localisation fails.
>Reporter: Steve Loughran
>Priority: Major
>
> Terasort of directory committer failing in resource localisaton -the 
> partitions.lst file has a different TS from that expected
> Happens under loaded integration tests (threads = 8; not standalone); 
> non-auth s3guard
> {code}
> 2019-10-08 11:50:29,774 [IPC Server handler 4 on 55983] WARN  
> localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:processHeartbeat(1150)) - { 
> s3a://hwdev-steve-ireland-new/terasort-directory/sortout/_partition.lst, 
> 1570531828143, FILE, null } failed: Resource 
> s3a://hwdev-steve-ireland-new/terasort-directory/sortout/_partition.lst 
> changed on src filesystem (expected 1570531828143, was 1570531828000
> java.io.IOException: Resource 
> s3a://hwdev-steve-ireland-new/terasort-directory/sortout/_partition.lst 
> changed on src filesystem (expected 1570531828143, was 1570531828000
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16644) Intermittent failure of ITestS3ATerasortOnS3A: timestamp differences

2019-10-08 Thread Siddharth Seth (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947209#comment-16947209
 ] 

Siddharth Seth commented on HADOOP-16644:
-

Looks like a PUTRequest gives back the modification time, a multipart upload 
does not. Given a multipart upload is likely a long operation anyway - a HEAD 
request following a MultiPartComplete call likely doesn't add a large 
percentage to the operation time (only is S3Guard enabled). For a direct PUT - 
we have the data anyway. Will definitely make me happy to avoid writing to DDB 
during a getSTatus operation.

Using S3 for resource localization - that's got at least one issue which I'm 
aware of. Need to test this, and then file a YARN jira. Essentially - I suspect 
the localizer does not use the JobClient config - so any credentials there will 
not be available to YARN for localization (e.g. client sets up access_key and 
secret_key in config).

> Intermittent failure of ITestS3ATerasortOnS3A: timestamp differences
> 
>
> Key: HADOOP-16644
> URL: https://issues.apache.org/jira/browse/HADOOP-16644
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3, test
>Affects Versions: 3.3.0
> Environment: -Dparallel-tests -DtestsThreadCount=8 
> -Dfailsafe.runOrder=balanced -Ds3guard -Ddynamo -Dscale
> h2. Hypothesis:
> the timestamp of the source file is being picked up from S3Guard, but when 
> the NM does a getFileStatus call, a HEAD check is made -and this (due to the 
> overloaded test system) is out of sync with the listing. S3Guard is updated, 
> the corrected date returned and the localisation fails.
>Reporter: Steve Loughran
>Priority: Major
>
> Terasort of directory committer failing in resource localisaton -the 
> partitions.lst file has a different TS from that expected
> Happens under loaded integration tests (threads = 8; not standalone); 
> non-auth s3guard
> {code}
> 2019-10-08 11:50:29,774 [IPC Server handler 4 on 55983] WARN  
> localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:processHeartbeat(1150)) - { 
> s3a://hwdev-steve-ireland-new/terasort-directory/sortout/_partition.lst, 
> 1570531828143, FILE, null } failed: Resource 
> s3a://hwdev-steve-ireland-new/terasort-directory/sortout/_partition.lst 
> changed on src filesystem (expected 1570531828143, was 1570531828000
> java.io.IOException: Resource 
> s3a://hwdev-steve-ireland-new/terasort-directory/sortout/_partition.lst 
> changed on src filesystem (expected 1570531828143, was 1570531828000
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16644) Intermittent failure of ITestS3ATerasortOnS3A: timestamp differences

2019-10-08 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946787#comment-16946787
 ] 

Steve Loughran commented on HADOOP-16644:
-

We really need a way of getting that FS timestamp off the store. I am 
"reluctant" to do it in a HEAD straight after the create, but it is the only 
way to guarantee consistency. Doing the head/update during the PUT would also 
address HADOOP-16412 (etag and version) and keep [~sseth] happy.

+![~gabor.bota], [~fabbri]

*we could always think about making that HEAD/PUT async, though that could lead 
to even more inconsistency pain.



> Intermittent failure of ITestS3ATerasortOnS3A: timestamp differences
> 
>
> Key: HADOOP-16644
> URL: https://issues.apache.org/jira/browse/HADOOP-16644
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3, test
>Affects Versions: 3.3.0
> Environment: -Dparallel-tests -DtestsThreadCount=8 
> -Dfailsafe.runOrder=balanced -Ds3guard -Ddynamo -Dscale
> h2. Hypothesis:
> the timestamp of the source file is being picked up from S3Guard, but when 
> the NM does a getFileStatus call, a HEAD check is made -and this (due to the 
> overloaded test system) is out of sync with the listing. S3Guard is updated, 
> the corrected date returned and the localisation fails.
>Reporter: Steve Loughran
>Priority: Major
>
> Terasort of directory committer failing in resource localisaton -the 
> partitions.lst file has a different TS from that expected
> Happens under loaded integration tests (threads = 8; not standalone); 
> non-auth s3guard
> {code}
> 2019-10-08 11:50:29,774 [IPC Server handler 4 on 55983] WARN  
> localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:processHeartbeat(1150)) - { 
> s3a://hwdev-steve-ireland-new/terasort-directory/sortout/_partition.lst, 
> 1570531828143, FILE, null } failed: Resource 
> s3a://hwdev-steve-ireland-new/terasort-directory/sortout/_partition.lst 
> changed on src filesystem (expected 1570531828143, was 1570531828000
> java.io.IOException: Resource 
> s3a://hwdev-steve-ireland-new/terasort-directory/sortout/_partition.lst 
> changed on src filesystem (expected 1570531828143, was 1570531828000
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16644) Intermittent failure of ITestS3ATerasortOnS3A: timestamp differences

2019-10-08 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946781#comment-16946781
 ] 

Steve Loughran commented on HADOOP-16644:
-

{code}
2019-10-08 11:50:29,774 [IPC Server handler 4 on 55983] WARN  
localizer.ResourceLocalizationService 
(ResourceLocalizationService.java:processHeartbeat(1150)) - { 
s3a://hwdev-steve-ireland-new/terasort-directory/sortout/_partition.lst, 
1570531828143, FILE, null } failed: Resource 
s3a://hwdev-steve-ireland-new/terasort-directory/sortout/_partition.lst changed 
on src filesystem (expected 1570531828143, was 1570531828000
java.io.IOException: Resource 
s3a://hwdev-steve-ireland-new/terasort-directory/sortout/_partition.lst changed 
on src filesystem (expected 1570531828143, was 1570531828000
at 
org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:273)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:67)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:414)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:411)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:411)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:248)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:241)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:229)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

{code}

> Intermittent failure of ITestS3ATerasortOnS3A: timestamp differences
> 
>
> Key: HADOOP-16644
> URL: https://issues.apache.org/jira/browse/HADOOP-16644
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3, test
>Affects Versions: 3.3.0
> Environment: -Dparallel-tests -DtestsThreadCount=8 
> -Dfailsafe.runOrder=balanced -Ds3guard -Ddynamo -Dscale
> h2. Hypothesis:
> the timestamp of the source file is being picked up from S3Guard, but when 
> the NM does a getFileStatus call, a HEAD check is made -and this (due to the 
> overloaded test system) is out of sync with the listing. S3Guard is updated, 
> the corrected date returned and the localisation fails.
>Reporter: Steve Loughran
>Priority: Major
>
> Terasort of directory committer failing in resource localisaton -the 
> partitions.lst file has a different TS from that expected
> Happens under loaded integration tests (threads = 8; not standalone); 
> non-auth s3guard
> {code}
> 2019-10-08 11:50:29,774 [IPC Server handler 4 on 55983] WARN  
> localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:processHeartbeat(1150)) - { 
> s3a://hwdev-steve-ireland-new/terasort-directory/sortout/_partition.lst, 
> 1570531828143, FILE, null } failed: Resource 
> s3a://hwdev-steve-ireland-new/terasort-directory/sortout/_partition.lst 
> changed on src filesystem (expected 1570531828143, was 1570531828000
> java.io.IOException: Resource 
> s3a://hwdev-steve-ireland-new/terasort-directory/sortout/_partition.lst 
> changed on src filesystem (expected 1570531828143, was 1570531828000
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org