[
https://issues.apache.org/jira/browse/HADOOP-14036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925351#comment-15925351
]
Sean Mackrory edited comment on HADOOP-14036 at 3/15/17 12:59 AM:
------------------------------------------------------------------
Not proposing this for inclusion just yet (although it's possible this is
precisely the correct solution), but just a proof-of-concept of the problem. I
see paths getting added to the containers of objects to move here in the loop
I'm modifying, and then also down below the comment, "We moved all the
children, now move the top-level dir."
I should dig a bit into the listObjects call, as I'm curious why we don't have
this problem with a lot more tests / workloads that involve renames. I'm also
not entirely sure we do actually have to move the top-level dir last (although
my current fix ensures that it is added last). If the move isn't atomic, the
invariant that parent paths always exist is going to be violated for either the
new path or the old path sometime, and this particular operation is just adding
it to the collection to be broken into batches. Seems cleaner IMO to do it last
like we do, but I want to think through it a bit more. Speak up if you have any
insight or opinions there...
After applying this fix (checking if the directory we're adding matches the
name of the parent directory we add separately at the end, and then skipping
that part if it does), I was able to run that test over and over again without
problems, and after reverting it reproduced the issue at least 50% of the time.
On one run I had a bunch of failures listed below, and I'm positive no other
workload was using that bucket at the time, but I've been able to run each of
those tests successfully and do several more full runs without a problem:
{code}
Failed tests:
ITestS3GuardToolDynamoDB>S3GuardToolTestBase.testPruneCommandConf:157->S3GuardToolTestBase.testPruneCommand:135->Assert.assertEquals:542->Assert.assertEquals:555->Assert.assertEquals:118->Assert.failNotEquals:743->Assert.fail:88
expected:<2> but was:<1>
ITestS3AContractGetFileStatus>AbstractContractGetFileStatusTest.testListLocatedStatusFiltering:499->AbstractContractGetFileStatusTest.verifyListStatus:534->Assert.assertEquals:555->Assert.assertEquals:118->Assert.failNotEquals:743->Assert.fail:88
length of listStatus(s3a://mackrory/test,
org.apache.hadoop.fs.contract.AbstractContractGetFileStatusTest$AllPathsFilter@69b9805a
) expected:<2> but was:<1>
ITestS3AContractGetFileStatus>AbstractContractGetFileStatusTest.testListStatusFiltering:466->AbstractContractGetFileStatusTest.verifyListStatus:534->Assert.assertEquals:555->Assert.assertEquals:118->Assert.failNotEquals:743->Assert.fail:88
length of listStatus(s3a://mackrory/test,
org.apache.hadoop.fs.contract.AbstractContractGetFileStatusTest$MatchesNameFilter@4ce8f437
) expected:<1> but was:<0>
ITestS3AContractGetFileStatus>AbstractContractGetFileStatusTest.testComplexDirActions:143->AbstractContractGetFileStatusTest.checkListStatusStatusComplexDir:162->Assert.assertEquals:555->Assert.assertEquals:118->Assert.failNotEquals:743->Assert.fail:88
listStatus(): file count in 1 directory and 0 files expected:<4> but was:<0>
Tests in error:
ITestS3AEncryptionSSEKMSDefaultKey>AbstractTestS3AEncryption.testEncryptionOverRename:71
» FileNotFound
ITestS3AContractSeek>AbstractContractSeekTest.testReadSmallFile:531 »
FileNotFound
ITestS3AContractSeek>AbstractContractSeekTest.testNegativeSeek:181 »
FileNotFound
ITestS3AContractSeek>AbstractContractSeekTest.testSeekFile:207 » FileNotFound
...
ITestS3AContractSeek>AbstractContractSeekTest.testReadFullyPastEOF:467 »
FileNotFound
ITestS3AContractDistCp>AbstractContractDistCpTest.deepDirectoryStructureToRemote:90->AbstractContractDistCpTest.deepDirectoryStructure:139
» FileNotFound
ITestS3AContractDistCp>AbstractContractDistCpTest.largeFilesToRemote:96->AbstractContractDistCpTest.largeFiles:174
» FileNotFound
{code}
was (Author: mackrorysd):
Not proposing this for inclusion just yet (although it's possible this is
precisely the correct solution), but just a proof-of-concept of the problem. I
see paths getting added to the containers of objects to move here in the loop
I'm modifying, and then also down below the comment, "We moved all the
children, now move the top-level dir."
I should dig a bit into the listObjects call, as I'm curious why we don't have
this problem with a lot more tests / workloads that involve renames. I'm also
not entirely sure we do actually have to move the top-level dir last (although
my current fix ensures that it is added last). If the move isn't atomic, the
invariant that parent paths always exist is going to be violated for either the
new path or the old path sometime, and this particular operation is just adding
it to the collection to be broken into batches. Seems cleaner IMO to do it last
like we do, but I want to think through it a bit more. Speak up if you have any
insight or opinions there...
> S3Guard: intermittent duplicate item keys failure
> -------------------------------------------------
>
> Key: HADOOP-14036
> URL: https://issues.apache.org/jira/browse/HADOOP-14036
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/s3
> Affects Versions: HADOOP-13345
> Reporter: Aaron Fabbri
> Assignee: Mingliang Liu
> Attachments: HADOOP-14036-HADOOP-13345.000.patch
>
>
> I see this occasionally when running integration tests with -Ds3guard
> -Ddynamo:
> {noformat}
> testRenameToDirWithSamePrefixAllowed(org.apache.hadoop.fs.s3a.ITestS3AFileSystemContract)
> Time elapsed: 2.756 sec <<< ERROR!
> org.apache.hadoop.fs.s3a.AWSServiceIOException: move:
> com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException: Provided
> list of item keys contains duplicates (Service: AmazonDynamoDBv2; Status
> Code: 400; Error Code: ValidationException; Request ID:
> QSBVQV69279UGOB4AJ4NO9Q86VVV4KQNSO5AEMVJF66Q9ASUAAJG): Provided list of item
> keys contains duplicates (Service: AmazonDynamoDBv2; Status Code: 400; Error
> Code: ValidationException; Request ID:
> QSBVQV69279UGOB4AJ4NO9Q86VVV4KQNSO5AEMVJF66Q9ASUAAJG)
> at
> org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:178)
> at
> org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore.move(DynamoDBMetadataStore.java:408)
> at
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerRename(S3AFileSystem.java:869)
> at
> org.apache.hadoop.fs.s3a.S3AFileSystem.rename(S3AFileSystem.java:662)
> at
> org.apache.hadoop.fs.FileSystemContractBaseTest.rename(FileSystemContractBaseTest.java:525)
> at
> org.apache.hadoop.fs.FileSystemContractBaseTest.testRenameToDirWithSamePrefixAllowed(FileSystemContractBaseTest.java:669)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcces
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]