Duo Xu created HADOOP-14512:
-------------------------------
Summary: WASB atomic rename should not throw exception if the file
is neither in src nor in dst when dong the rename
Key: HADOOP-14512
URL: https://issues.apache.org/jira/browse/HADOOP-14512
Project: Hadoop Common
Issue Type: Bug
Components: fs/azure
Reporter: Duo Xu
During atomic rename operation, WASB creates a rename pending json file to
document which files need to be renamed and the destination. Then WASB will
read this file and rename all the files one by one.
There is a recent customer incident in HBase showing a potential bug in the
atomic rename implementation,
For example, below is a rename pending json file,
{code}
{
FormatVersion: "1.0",
OperationUTCTime: "2017-04-29 06:08:57.465",
OldFolderName: "hbase\/data\/default\/abc",
NewFolderName: "hbase\/.tmp\/data\/default\/abc",
FileList: [
".tabledesc",
".tabledesc\/.tableinfo.0000000001",
".tmp",
"08e698e0b7d4132c0456b16dcf3772af",
"08e698e0b7d4132c0456b16dcf3772af\/.regioninfo",
"08e698e0b7d4132c0456b16dcf3772af\/0\/617294e0737e4d37920e1609cf539a83",
"08e698e0b7d4132c0456b16dcf3772af\/recovered.edits\/185.seqid",
"08e698e0b7d4132c0456b16dcf3772af\/.regioninfo",
"08e698e0b7d4132c0456b16dcf3772af\/0",
"08e698e0b7d4132c0456b16dcf3772af\/0\/617294e0737e4d37920e1609cf539a83",
"08e698e0b7d4132c0456b16dcf3772af\/recovered.edits",
"08e698e0b7d4132c0456b16dcf3772af\/recovered.edits\/185.seqid"
]
}
{code}
When HBase regionserver process (underlying is using WASB driver) was renaming
"08e698e0b7d4132c0456b16dcf3772af\/.regioninfo", the regionserver process
crashed or the VM got rebooted due to system maintenence. When the regionserver
process started running again, it found the rename pending json file and tried
to redo the rename operation.
However, when it read the first file ".tabledesc" in the file list, it could
not find this file in src folder and it also could not find the file in
destination folder. It could not find it in src folder because the file had
already been renamed/moved to the destination folder. It could not find it in
destination folder because when HBase starts, it will clean up all the files
under /hbase/.tmp.
The current implementation will throw exceptions saying
{code}
else {
throw new IOException(
"Attempting to complete rename of file " + srcKey + "/" + fileName
+ " during folder rename redo, and file was not found in source "
+ "or destination.");
}
{code}
This will cause HBase HMaster initialization failure and restart HMaster will
not work because the same exception will throw again.
My proposal is that if during the redo, WASB finds a file not in src and not in
dst, WASB should just skip this file and process the next file rather than
throw the error and let user manually fix it. Reasons are
1. Since the rename pending json file contains file A, if the file A is not in
src, it must have been renamed.
2. if the file A is not in src and not in dst, the upper layer service must
have removed it. One thing to note is that during the atomic rename, the
folder is locked. So the only situation the file gets deleted is when VM
reboots or service process crashes. When service process restarts, there might
be some operations happening before the atomic rename redo, like the HBase
example above.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]