jiangyu created HDFS-7385:
-----------------------------
Summary: ThreadLocal used in FSEditLog class lead FSImage
permission mess up
Key: HDFS-7385
URL: https://issues.apache.org/jira/browse/HDFS-7385
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Affects Versions: 2.5.0, 2.4.0
Reporter: jiangyu
We migrated our NameNodes from low configuration to high configuration
machines last week. Firstly,we imported the current directory including
fsimage and editlog files from original ActiveNameNode to new ActiveNameNode
and started the New NameNode, then changed the configuration of all datanodes
and restarted all of datanodes , then blockreport to new NameNodes at once and
send heartbeat after that.
Everything seemed perfect, but after we restarted Resoucemanager , most
of the users compained that their jobs couldn't be executed for the reason of
permission problem.
We applied Acls in our clusters, and after migrated we found most of the
directories and files which were not set Acls before now had the properties of
Acls. That is the reason why users could not execute their jobs.So we had to
change most of the files permission to a+r and directories permission to a+rx
to make sure the jobs can be executed.
After searching this problem for some days, i found there is a bug in
FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the
proper value in logMkdir and logOpenFile functions. Here is the code of
logMkdir:
public void logMkDir(String path, INode newNode) {
PermissionStatus permissions = newNode.getPermissionStatus();
MkdirOp op = MkdirOp.getInstance(cache.get())
.setInodeId(newNode.getId())
.setPath(path)
.setTimestamp(newNode.getModificationTime())
.setPermissionStatus(permissions);
AclFeature f = newNode.getAclFeature();
if (f != null) {
op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode));
}
logEdit(op);
}
For example, if we mkdir with Acls through one handler(Thread indeed), we
set the AclEntries to the op from the cache. After that, if we mkdir without
any Acls setting and set through the same handler, the AclEnties from the cache
is the same with the last one which set the Acls, and because the newNode have
no AclFeature, we don’t have any chance to change it. Then the editlog is
wrong,record the wrong Acls. After the Standby load the editlogs from
journalnodes and apply them to memory in SNN then savenamespace and transfer
the wrong fsimage to ANN, all the fsimages get wrong. The only solution is to
save namespace from ANN and you can get the right fsimage.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)