[jira] [Commented] (HADOOP-16207) Fix ITestDirectoryCommitMRJob.testMRJob

2019-09-20 Thread Siddharth Seth (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-16207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934667#comment-16934667
 ] 

Siddharth Seth commented on HADOOP-16207:
-

Attached a simple patch which fixes just the test failures. Doesn't do anything 
with parallelism, changing dir names to be different across tests etc. Can 
submit this in a separate jira, if this one is being used for parallelizing the 
tests.

> Fix ITestDirectoryCommitMRJob.testMRJob
> ---
>
> Key: HADOOP-16207
> URL: https://issues.apache.org/jira/browse/HADOOP-16207
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3, test
>Affects Versions: 3.3.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Critical
> Attachments: HADOOP-16207.fixtestsonly.txt
>
>
> Reported failure of {{ITestDirectoryCommitMRJob}} in validation runs of 
> HADOOP-16186; assertIsDirectory with s3guard enabled and a parallel test run: 
> Path "is recorded as deleted by S3Guard"
> {code}
> waitForConsistency();
> assertIsDirectory(outputPath) /* here */
> {code}
> The file is there but there's a tombstone. Possibilities
> * some race condition with another test
> * tombstones aren't timing out
> * committers aren't creating that base dir in a way which cleans up S3Guard's 
> tombstones. 
> Remember: we do have to delete that dest dir before the committer runs unless 
> overwrite==true, so at the start of the run there will be a tombstone. It 
> should be overwritten by a success.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16207) Fix ITestDirectoryCommitMRJob.testMRJob

2019-09-19 Thread Siddharth Seth (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-16207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934024#comment-16934024
 ] 

Siddharth Seth commented on HADOOP-16207:
-

Also, to run the tests in parallel - the jobs need to start using a different 
directory name. Currently, all of them use testMRJob (The method name in the 
common class that all tests inherit from).
The issue with the local dir conflict is a MR configuration afaik (Likely the 
MR tmp dir config property). YARN clusters should already be able to run in 
parallel (different ports, random dir names, etc)

> Fix ITestDirectoryCommitMRJob.testMRJob
> ---
>
> Key: HADOOP-16207
> URL: https://issues.apache.org/jira/browse/HADOOP-16207
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3, test
>Affects Versions: 3.3.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Critical
>
> Reported failure of {{ITestDirectoryCommitMRJob}} in validation runs of 
> HADOOP-16186; assertIsDirectory with s3guard enabled and a parallel test run: 
> Path "is recorded as deleted by S3Guard"
> {code}
> waitForConsistency();
> assertIsDirectory(outputPath) /* here */
> {code}
> The file is there but there's a tombstone. Possibilities
> * some race condition with another test
> * tombstones aren't timing out
> * committers aren't creating that base dir in a way which cleans up S3Guard's 
> tombstones. 
> Remember: we do have to delete that dest dir before the committer runs unless 
> overwrite==true, so at the start of the run there will be a tombstone. It 
> should be overwritten by a success.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16207) Fix ITestDirectoryCommitMRJob.testMRJob

2019-09-19 Thread Siddharth Seth (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-16207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933718#comment-16933718
 ] 

Siddharth Seth commented on HADOOP-16207:
-

Seeing several MR job failures when running tests on HADOOP-16445.

{code}
[ERROR]   
ITestMagicCommitMRJob>AbstractITCommitMRJob.testMRJob:146->AbstractFSContractTestBase.assertIsDirectory:327
 » FileNotFound
[ERROR]   
ITestDirectoryCommitMRJob>AbstractITCommitMRJob.testMRJob:146->AbstractFSContractTestBase.assertIsDirectory:327
 » FileNotFound
[ERROR]   
ITestPartitionCommitMRJob>AbstractITCommitMRJob.testMRJob:146->AbstractFSContractTestBase.assertIsDirectory:327
 » FileNotFound
[ERROR]   
ITestStagingCommitMRJob>AbstractITCommitMRJob.testMRJob:146->AbstractFSContractTestBase.assertIsDirectory:327
 » FileNotFound
{code}
always fail when run with -Ds3guard -Ddynamo -Dauth (These fail when starting 
with a clean DDB table as well)

The test setup seems broken to me.
* Cluster set up happens with createCluster(new JobConf())
* After this, AbstractITCommitMRJob creates the MRJob with 
Job.getInstance(getClusterBinding().getConf() ... -> This will end up using the 
previously created JobConf
* JobConf will only read core-site.xml ... so the command line parameters 
-Ds3guard, -Ddynamo -Dauth don't make a difference.

Adding fs.s3a.metadatastore.authoritative=true, 
fs.s3a.metadatastore.impl=org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore
 in auth-keys.xml or core-site.xml fixed all the test failures for me. (With 
the additions, the JobConf used by the cluster has these configs, and the tests 
do what they're supposed to).

That isn't the correct fix though. Making sure the test configuration is used 
to create the JobConf for the cluster and jobs would allow the test properties 
to work.

That said, I did see 3 empty (and marked as deleted) files - part_, 
part_0001, _SUCCESS in the s3guard table. I suspect this is a result of the 
committer trying to access a file on the client, getting a cached FileSystem 
instance (same UGI), and the getFileStatus (maybe) creates these S3Guard DDB 
entries?

> Fix ITestDirectoryCommitMRJob.testMRJob
> ---
>
> Key: HADOOP-16207
> URL: https://issues.apache.org/jira/browse/HADOOP-16207
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3, test
>Affects Versions: 3.3.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Critical
>
> Reported failure of {{ITestDirectoryCommitMRJob}} in validation runs of 
> HADOOP-16186; assertIsDirectory with s3guard enabled and a parallel test run: 
> Path "is recorded as deleted by S3Guard"
> {code}
> waitForConsistency();
> assertIsDirectory(outputPath) /* here */
> {code}
> The file is there but there's a tombstone. Possibilities
> * some race condition with another test
> * tombstones aren't timing out
> * committers aren't creating that base dir in a way which cleans up S3Guard's 
> tombstones. 
> Remember: we do have to delete that dest dir before the committer runs unless 
> overwrite==true, so at the start of the run there will be a tombstone. It 
> should be overwritten by a success.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16207) Fix ITestDirectoryCommitMRJob.testMRJob

2019-07-18 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-16207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887945#comment-16887945
 ] 

Steve Loughran commented on HADOOP-16207:
-

Staging problem is fixed by MAPREDUCE-6521, and as it only seems to surface 
when the cluster FS is local (unconfirmed) then its not likely to be the cause 
of the previous failures (when HDFS was used as the cluster FS)

And, given it seems to be a race condition, doesn't explain why we'd see 
failures during sequential test runs.

> Fix ITestDirectoryCommitMRJob.testMRJob
> ---
>
> Key: HADOOP-16207
> URL: https://issues.apache.org/jira/browse/HADOOP-16207
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3, test
>Affects Versions: 3.3.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Critical
>
> Reported failure of {{ITestDirectoryCommitMRJob}} in validation runs of 
> HADOOP-16186; assertIsDirectory with s3guard enabled and a parallel test run: 
> Path "is recorded as deleted by S3Guard"
> {code}
> waitForConsistency();
> assertIsDirectory(outputPath) /* here */
> {code}
> The file is there but there's a tombstone. Possibilities
> * some race condition with another test
> * tombstones aren't timing out
> * committers aren't creating that base dir in a way which cleans up S3Guard's 
> tombstones. 
> Remember: we do have to delete that dest dir before the committer runs unless 
> overwrite==true, so at the start of the run there will be a tombstone. It 
> should be overwritten by a success.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16207) Fix ITestDirectoryCommitMRJob.testMRJob

2019-07-17 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-16207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887471#comment-16887471
 ] 

Steve Loughran commented on HADOOP-16207:
-

Working on this. Finally got a log. And (currently) /tmp/hadoop-yarn/staging/ 
doesn't exist.

Assumption: all the miniYarnClusters are sharing the same /tmp staging dir, so 
that when one is shutdown while another is running, the second one fails as all 
its staging files go away -in which case yes, it is a race condition. At least 
this time.

{code}
(TaskAttemptListenerImpl.java:fatalError(288)) - Task: 
attempt_1563401248365_0003_m_00_0 - exited : java.io.FileNotFoundException: 
File 
file:/tmp/hadoop-yarn/staging/stevel/.staging/job_1563401248365_0003/job.split 
does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:666)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:987)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:656)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:456)
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:153)
at 
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:354)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:917)
at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:362)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1891)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172)
{code}

> Fix ITestDirectoryCommitMRJob.testMRJob
> ---
>
> Key: HADOOP-16207
> URL: https://issues.apache.org/jira/browse/HADOOP-16207
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3, test
>Affects Versions: 3.3.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Critical
>
> Reported failure of {{ITestDirectoryCommitMRJob}} in validation runs of 
> HADOOP-16186; assertIsDirectory with s3guard enabled and a parallel test run: 
> Path "is recorded as deleted by S3Guard"
> {code}
> waitForConsistency();
> assertIsDirectory(outputPath) /* here */
> {code}
> The file is there but there's a tombstone. Possibilities
> * some race condition with another test
> * tombstones aren't timing out
> * committers aren't creating that base dir in a way which cleans up S3Guard's 
> tombstones. 
> Remember: we do have to delete that dest dir before the committer runs unless 
> overwrite==true, so at the start of the run there will be a tombstone. It 
> should be overwritten by a success.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16207) Fix ITestDirectoryCommitMRJob.testMRJob

2019-05-16 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-16207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841394#comment-16841394
 ] 

Steve Loughran commented on HADOOP-16207:
-

suspecting a race condition in >1 test. If we isolate paths this should go away

> Fix ITestDirectoryCommitMRJob.testMRJob
> ---
>
> Key: HADOOP-16207
> URL: https://issues.apache.org/jira/browse/HADOOP-16207
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3, test
>Affects Versions: 3.3.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Critical
>
> Reported failure of {{ITestDirectoryCommitMRJob}} in validation runs of 
> HADOOP-16186; assertIsDirectory with s3guard enabled and a parallel test run: 
> Path "is recorded as deleted by S3Guard"
> {code}
> waitForConsistency();
> assertIsDirectory(outputPath) /* here */
> {code}
> The file is there but there's a tombstone. Possibilities
> * some race condition with another test
> * tombstones aren't timing out
> * committers aren't creating that base dir in a way which cleans up S3Guard's 
> tombstones. 
> Remember: we do have to delete that dest dir before the committer runs unless 
> overwrite==true, so at the start of the run there will be a tombstone. It 
> should be overwritten by a success.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16207) Fix ITestDirectoryCommitMRJob.testMRJob

2019-04-24 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-16207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825115#comment-16825115
 ] 

Steve Loughran commented on HADOOP-16207:
-

FIx HADOOP-16184 and provided this is a non-auth test run, this will act as 
regression test to make sure the fix works in real situations

> Fix ITestDirectoryCommitMRJob.testMRJob
> ---
>
> Key: HADOOP-16207
> URL: https://issues.apache.org/jira/browse/HADOOP-16207
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3, test
>Affects Versions: 3.3.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Critical
>
> Reported failure of {{ITestDirectoryCommitMRJob}} in validation runs of 
> HADOOP-16186; assertIsDirectory with s3guard enabled and a parallel test run: 
> Path "is recorded as deleted by S3Guard"
> {code}
> waitForConsistency();
> assertIsDirectory(outputPath) /* here */
> {code}
> The file is there but there's a tombstone. Possibilities
> * some race condition with another test
> * tombstones aren't timing out
> * committers aren't creating that base dir in a way which cleans up S3Guard's 
> tombstones. 
> Remember: we do have to delete that dest dir before the committer runs unless 
> overwrite==true, so at the start of the run there will be a tombstone. It 
> should be overwritten by a success.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org