[
https://issues.apache.org/jira/browse/HADOOP-14512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16051023#comment-16051023
]
Mingliang Liu commented on HADOOP-14512:
----------------------------------------
Steve, sorry for the late report. I run all the unit and live tests against us
west. All pass. It's good convention that we post test report before commit.
{code}
hadoop-tools/hadoop-azure $ mvn test -q
-------------------------------------------------------
T E S T S
-------------------------------------------------------
-------------------------------------------------------
T E S T S
-------------------------------------------------------
Running org.apache.hadoop.fs.azure.contract.TestAzureNativeContractAppend
Tests run: 5, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 27.74 sec - in
org.apache.hadoop.fs.azure.contract.TestAzureNativeContractAppend
Running org.apache.hadoop.fs.azure.contract.TestAzureNativeContractCreate
Tests run: 11, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 30.296 sec -
in org.apache.hadoop.fs.azure.contract.TestAzureNativeContractCreate
Running org.apache.hadoop.fs.azure.contract.TestAzureNativeContractDelete
Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 27.426 sec - in
org.apache.hadoop.fs.azure.contract.TestAzureNativeContractDelete
Running org.apache.hadoop.fs.azure.contract.TestAzureNativeContractDistCp
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 210.658 sec -
in org.apache.hadoop.fs.azure.contract.TestAzureNativeContractDistCp
Running org.apache.hadoop.fs.azure.contract.TestAzureNativeContractGetFileStatus
Tests run: 18, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 67.542 sec -
in org.apache.hadoop.fs.azure.contract.TestAzureNativeContractGetFileStatus
Running org.apache.hadoop.fs.azure.contract.TestAzureNativeContractMkdir
Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 87.217 sec - in
org.apache.hadoop.fs.azure.contract.TestAzureNativeContractMkdir
Running org.apache.hadoop.fs.azure.contract.TestAzureNativeContractOpen
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 23.406 sec - in
org.apache.hadoop.fs.azure.contract.TestAzureNativeContractOpen
Running org.apache.hadoop.fs.azure.contract.TestAzureNativeContractRename
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 29.704 sec - in
org.apache.hadoop.fs.azure.contract.TestAzureNativeContractRename
Running org.apache.hadoop.fs.azure.contract.TestAzureNativeContractSeek
Tests run: 18, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 56.787 sec -
in org.apache.hadoop.fs.azure.contract.TestAzureNativeContractSeek
Running org.apache.hadoop.fs.azure.metrics.TestAzureFileSystemInstrumentation
Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 142.165 sec -
in org.apache.hadoop.fs.azure.metrics.TestAzureFileSystemInstrumentation
Running org.apache.hadoop.fs.azure.metrics.TestBandwidthGaugeUpdater
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.379 sec - in
org.apache.hadoop.fs.azure.metrics.TestBandwidthGaugeUpdater
Running
org.apache.hadoop.fs.azure.metrics.TestNativeAzureFileSystemMetricsSystem
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.734 sec - in
org.apache.hadoop.fs.azure.metrics.TestNativeAzureFileSystemMetricsSystem
Running org.apache.hadoop.fs.azure.metrics.TestRollingWindowAverage
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.217 sec - in
org.apache.hadoop.fs.azure.metrics.TestRollingWindowAverage
Running org.apache.hadoop.fs.azure.TestAzureConcurrentOutOfBandIo
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 39.986 sec - in
org.apache.hadoop.fs.azure.TestAzureConcurrentOutOfBandIo
Running org.apache.hadoop.fs.azure.TestAzureFileSystemErrorConditions
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 6.949 sec - in
org.apache.hadoop.fs.azure.TestAzureFileSystemErrorConditions
Running org.apache.hadoop.fs.azure.TestBlobDataValidation
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.96 sec - in
org.apache.hadoop.fs.azure.TestBlobDataValidation
Running org.apache.hadoop.fs.azure.TestBlobMetadata
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.786 sec - in
org.apache.hadoop.fs.azure.TestBlobMetadata
Running org.apache.hadoop.fs.azure.TestBlobTypeSpeedDifference
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 10.63 sec - in
org.apache.hadoop.fs.azure.TestBlobTypeSpeedDifference
Running org.apache.hadoop.fs.azure.TestContainerChecks
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 11.44 sec - in
org.apache.hadoop.fs.azure.TestContainerChecks
Running org.apache.hadoop.fs.azure.TestFileSystemOperationExceptionHandling
Tests run: 16, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 47.314 sec -
in org.apache.hadoop.fs.azure.TestFileSystemOperationExceptionHandling
Running org.apache.hadoop.fs.azure.TestFileSystemOperationExceptionMessage
Tests run: 47, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 373.217 sec -
in org.apache.hadoop.fs.azure.TestFileSystemOperationExceptionMessage
Running
org.apache.hadoop.fs.azure.TestFileSystemOperationsExceptionHandlingMultiThreaded
Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 36.973 sec -
in
org.apache.hadoop.fs.azure.TestFileSystemOperationsExceptionHandlingMultiThreaded
Running org.apache.hadoop.fs.azure.TestFileSystemOperationsWithThreads
Tests run: 19, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 399.735 sec -
in org.apache.hadoop.fs.azure.TestFileSystemOperationsWithThreads
Running org.apache.hadoop.fs.azure.TestNativeAzureFileSystemAppend
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 172.078 sec -
in org.apache.hadoop.fs.azure.TestNativeAzureFileSystemAppend
Running org.apache.hadoop.fs.azure.TestNativeAzureFileSystemAtomicRenameDirList
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 7.393 sec - in
org.apache.hadoop.fs.azure.TestNativeAzureFileSystemAtomicRenameDirList
Running org.apache.hadoop.fs.azure.TestNativeAzureFileSystemAuthorization
Tests run: 21, Failures: 0, Errors: 0, Skipped: 21, Time elapsed: 6.24 sec - in
org.apache.hadoop.fs.azure.TestNativeAzureFileSystemAuthorization
Running
org.apache.hadoop.fs.azure.TestNativeAzureFileSystemAuthorizationWithOwner
Tests run: 24, Failures: 0, Errors: 0, Skipped: 24, Time elapsed: 7.036 sec -
in org.apache.hadoop.fs.azure.TestNativeAzureFileSystemAuthorizationWithOwner
Running org.apache.hadoop.fs.azure.TestNativeAzureFileSystemBlockLocations
Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.822 sec - in
org.apache.hadoop.fs.azure.TestNativeAzureFileSystemBlockLocations
Running org.apache.hadoop.fs.azure.TestNativeAzureFileSystemClientLogging
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 3.837 sec - in
org.apache.hadoop.fs.azure.TestNativeAzureFileSystemClientLogging
Running org.apache.hadoop.fs.azure.TestNativeAzureFileSystemConcurrency
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.999 sec - in
org.apache.hadoop.fs.azure.TestNativeAzureFileSystemConcurrency
Running org.apache.hadoop.fs.azure.TestNativeAzureFileSystemConcurrencyLive
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 3.766 sec - in
org.apache.hadoop.fs.azure.TestNativeAzureFileSystemConcurrencyLive
Running org.apache.hadoop.fs.azure.TestNativeAzureFileSystemContractEmulator
Tests run: 43, Failures: 0, Errors: 0, Skipped: 43, Time elapsed: 0.432 sec -
in org.apache.hadoop.fs.azure.TestNativeAzureFileSystemContractEmulator
Running org.apache.hadoop.fs.azure.TestNativeAzureFileSystemContractLive
Tests run: 43, Failures: 0, Errors: 0, Skipped: 5, Time elapsed: 208.667 sec -
in org.apache.hadoop.fs.azure.TestNativeAzureFileSystemContractLive
Running org.apache.hadoop.fs.azure.TestNativeAzureFileSystemContractMocked
Tests run: 43, Failures: 0, Errors: 0, Skipped: 5, Time elapsed: 1.151 sec - in
org.apache.hadoop.fs.azure.TestNativeAzureFileSystemContractMocked
Running org.apache.hadoop.fs.azure.TestNativeAzureFileSystemContractPageBlobLive
Tests run: 43, Failures: 0, Errors: 0, Skipped: 5, Time elapsed: 218.064 sec -
in org.apache.hadoop.fs.azure.TestNativeAzureFileSystemContractPageBlobLive
Running org.apache.hadoop.fs.azure.TestNativeAzureFileSystemFileNameCheck
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.811 sec - in
org.apache.hadoop.fs.azure.TestNativeAzureFileSystemFileNameCheck
Running org.apache.hadoop.fs.azure.TestNativeAzureFileSystemLive
Tests run: 51, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 431.203 sec -
in org.apache.hadoop.fs.azure.TestNativeAzureFileSystemLive
Running org.apache.hadoop.fs.azure.TestNativeAzureFileSystemMocked
Tests run: 46, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 17.041 sec -
in org.apache.hadoop.fs.azure.TestNativeAzureFileSystemMocked
Running org.apache.hadoop.fs.azure.TestNativeAzureFileSystemOperationsMocked
Tests run: 50, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.346 sec - in
org.apache.hadoop.fs.azure.TestNativeAzureFileSystemOperationsMocked
Running org.apache.hadoop.fs.azure.TestNativeAzureFileSystemUploadLogic
Tests run: 3, Failures: 0, Errors: 0, Skipped: 3, Time elapsed: 0.058 sec - in
org.apache.hadoop.fs.azure.TestNativeAzureFileSystemUploadLogic
Running org.apache.hadoop.fs.azure.TestNativeAzureFSPageBlobLive
Tests run: 46, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 443.228 sec -
in org.apache.hadoop.fs.azure.TestNativeAzureFSPageBlobLive
Running org.apache.hadoop.fs.azure.TestOutOfBandAzureBlobOperations
Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.843 sec - in
org.apache.hadoop.fs.azure.TestOutOfBandAzureBlobOperations
Running org.apache.hadoop.fs.azure.TestOutOfBandAzureBlobOperationsLive
Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 13.627 sec - in
org.apache.hadoop.fs.azure.TestOutOfBandAzureBlobOperationsLive
Running org.apache.hadoop.fs.azure.TestReadAndSeekPageBlobAfterWrite
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 299.048 sec -
in org.apache.hadoop.fs.azure.TestReadAndSeekPageBlobAfterWrite
Running org.apache.hadoop.fs.azure.TestShellDecryptionKeyProvider
Tests run: 2, Failures: 0, Errors: 0, Skipped: 2, Time elapsed: 0.105 sec - in
org.apache.hadoop.fs.azure.TestShellDecryptionKeyProvider
Running org.apache.hadoop.fs.azure.TestWasbFsck
Tests run: 2, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 0.728 sec - in
org.apache.hadoop.fs.azure.TestWasbFsck
Running org.apache.hadoop.fs.azure.TestWasbRemoteCallHelper
Tests run: 8, Failures: 0, Errors: 0, Skipped: 8, Time elapsed: 2.237 sec - in
org.apache.hadoop.fs.azure.TestWasbRemoteCallHelper
Running org.apache.hadoop.fs.azure.TestWasbUriAndConfiguration
Tests run: 18, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 10.17 sec - in
org.apache.hadoop.fs.azure.TestWasbUriAndConfiguration
Results :
Tests run: 704, Failures: 0, Errors: 0, Skipped: 119
{code}
> WASB atomic rename should not throw exception if the file is neither in src
> nor in dst when doing the rename
> ------------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-14512
> URL: https://issues.apache.org/jira/browse/HADOOP-14512
> Project: Hadoop Common
> Issue Type: Bug
> Components: fs/azure
> Affects Versions: 2.8.0
> Reporter: Duo Xu
> Assignee: Duo Xu
> Fix For: 3.0.0-alpha4, 2.8.2
>
> Attachments: HADOOP-14512.001.patch, HADOOP-14512.002.patch
>
>
> During atomic rename operation, WASB creates a rename pending json file to
> document which files need to be renamed and the destination. Then WASB will
> read this file and rename all the files one by one.
> There is a recent customer incident in HBase showing a potential bug in the
> atomic rename implementation,
> For example, below is a rename pending json file,
> {code}
> {
> FormatVersion: "1.0",
> OperationUTCTime: "2017-04-29 06:08:57.465",
> OldFolderName: "hbase\/data\/default\/abc",
> NewFolderName: "hbase\/.tmp\/data\/default\/abc",
> FileList: [
> ".tabledesc",
> ".tabledesc\/.tableinfo.0000000001",
> ".tmp",
> "08e698e0b7d4132c0456b16dcf3772af",
> "08e698e0b7d4132c0456b16dcf3772af\/.regioninfo",
> "08e698e0b7d4132c0456b16dcf3772af\/0\/617294e0737e4d37920e1609cf539a83",
> "08e698e0b7d4132c0456b16dcf3772af\/recovered.edits\/185.seqid",
> "08e698e0b7d4132c0456b16dcf3772af\/.regioninfo",
> "08e698e0b7d4132c0456b16dcf3772af\/0",
> "08e698e0b7d4132c0456b16dcf3772af\/0\/617294e0737e4d37920e1609cf539a83",
> "08e698e0b7d4132c0456b16dcf3772af\/recovered.edits",
> "08e698e0b7d4132c0456b16dcf3772af\/recovered.edits\/185.seqid"
> ]
> }
> {code}
> When HBase regionserver process (underlying is using WASB driver) was
> renaming "08e698e0b7d4132c0456b16dcf3772af\/.regioninfo", the regionserver
> process crashed or the VM got rebooted due to system maintenence. When the
> regionserver process started running again, it found the rename pending json
> file and tried to redo the rename operation.
> However, when it read the first file ".tabledesc" in the file list, it could
> not find this file in src folder and it also could not find the file in
> destination folder. It could not find it in src folder because the file had
> already been renamed/moved to the destination folder. It could not find it in
> destination folder because when HBase starts, it will clean up all the files
> under /hbase/.tmp.
> The current implementation will throw exceptions saying
> {code}
> else {
> throw new IOException(
> "Attempting to complete rename of file " + srcKey + "/" + fileName
> + " during folder rename redo, and file was not found in source "
> + "or destination.");
> }
> {code}
> This will cause HBase HMaster initialization failure and restart HMaster will
> not work because the same exception will throw again.
> My proposal is that if during the redo, WASB finds a file not in src and not
> in dst, WASB should just skip this file and process the next file rather than
> throw the error and let user manually fix it. Reasons are
> 1. Since the rename pending json file contains file A, if the file A is not
> in src, it must have been renamed.
> 2. if the file A is not in src and not in dst, the upper layer service must
> have removed it. One thing to note is that during the atomic rename, the
> folder is locked. So the only situation the file gets deleted is when VM
> reboots or service process crashes. When service process restarts, there
> might be some operations happening before the atomic rename redo, like the
> HBase example above.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]