[ https://issues.apache.org/jira/browse/HDFS-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14211495#comment-14211495 ]
Gera Shegalov commented on HDFS-7385: ------------------------------------- It may be cleaner to introduce an abstract {{FSEditLogOp#reset()}} and call it in {{FSEditLog#logEdit(FSEditLogOp)}} that all ops need to override. {code} try { editLogStream.write(op); } catch (IOException ex) { // All journals failed, it is handled in logSync. } finally { op.reset(); } {code} to catch similar problems with garbage sitting in TLS in the future. > ThreadLocal used in FSEditLog class causes FSImage permission mess up > --------------------------------------------------------------------- > > Key: HDFS-7385 > URL: https://issues.apache.org/jira/browse/HDFS-7385 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.4.0, 2.5.0 > Reporter: jiangyu > Assignee: jiangyu > Priority: Blocker > Fix For: 2.6.0 > > Attachments: HDFS-7385.2.patch, HDFS-7385.patch > > > We migrated our NameNodes from low configuration to high configuration > machines last week. Firstly,we imported the current directory including > fsimage and editlog files from original ActiveNameNode to new ActiveNameNode > and started the New NameNode, then changed the configuration of all > datanodes and restarted all of datanodes , then blockreport to new NameNodes > at once and send heartbeat after that. > Everything seemed perfect, but after we restarted Resoucemanager , > most of the users compained that their jobs couldn't be executed for the > reason of permission problem. > We applied Acls in our clusters, and after migrated we found most of > the directories and files which were not set Acls before now had the > properties of Acls. That is the reason why users could not execute their > jobs.So we had to change most of the files permission to a+r and directories > permission to a+rx to make sure the jobs can be executed. > After searching this problem for some days, i found there is a bug in > FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the > proper value in logMkdir and logOpenFile functions. Here is the code of > logMkdir: > public void logMkDir(String path, INode newNode) { > PermissionStatus permissions = newNode.getPermissionStatus(); > MkdirOp op = MkdirOp.getInstance(cache.get()) > .setInodeId(newNode.getId()) > .setPath(path) > .setTimestamp(newNode.getModificationTime()) > .setPermissionStatus(permissions); > AclFeature f = newNode.getAclFeature(); > if (f != null) { > op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode)); > } > logEdit(op); > } > For example, if we mkdir with Acls through one handler(Thread indeed), > we set the AclEntries to the op from the cache. After that, if we mkdir > without any Acls setting and set through the same handler, the AclEnties from > the cache is the same with the last one which set the Acls, and because the > newNode have no AclFeature, we don’t have any chance to change it. Then the > editlog is wrong,record the wrong Acls. After the Standby load the editlogs > from journalnodes and apply them to memory in SNN then savenamespace and > transfer the wrong fsimage to ANN, all the fsimages get wrong. The only > solution is to save namespace from ANN and you can get the right fsimage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)