Hi Harsh I check branch-1 and there is no such problem in it. This bug is only related to CDH3 version.Sorry for bothering you.
On Mon, Apr 8, 2013 at 5:24 PM, Harsh J <ha...@cloudera.com> wrote: > Anty, > > Worth checking if you can spot the bug in branch-1 as well: > > http://svn.apache.org/repos/asf/hadoop/common/branches/branch-1/src/hdfs/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java > (and other refs under > > http://svn.apache.org/repos/asf/hadoop/common/branches/branch-1/src/hdfs/org/apache/hadoop/hdfs/server/namenode/ > ) > > On Mon, Apr 8, 2013 at 2:44 PM, Anty <anty....@gmail.com> wrote: > > @Harsh > > I'm using CDH3u4.However, the processing logical with regard to FSEditlog > > between CDH3U4 and hadoop 1.0.2 is almost the same. > > So i'm not sure it is proper to file a JIRA? > > > > On Mon, Apr 8, 2013 at 4:33 PM, Harsh J <ha...@cloudera.com> wrote: > > > >> Thanks for analyzing and reporting this Anty, > >> > >> What version of Apache Hadoop 1.x are you encountering this on? If > >> you've spotted the code issue on branch-1, please do log a HDFS JIRA > >> with some NN logs and your other details. > >> > >> On Sun, Apr 7, 2013 at 3:18 PM, Anty <anty....@gmail.com> wrote: > >> > Hi:ALL > >> > > >> > In our cluster, we configure the NameNode to write to both local and > NFS > >> > mounted directories. When the NFS mounted directory is inaccessible, > the > >> > NameNode should keep running without error, but our namenode crash > with > >> > following stack trace. > >> > > >> > 2013-04-02 23:35:21,535 WARN > >> org.apache.hadoop.hdfs.server.common.Storage: > >> >> Removing storage dir /nfs2-mount/onest3/dfs/name > >> >> 2013-04-02 23:35:21,536 FATAL > >> >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Unable to find > >> edits > >> >> stream with IO error > >> >> java.lang.Exception: Unable to find edits stream with IO error > >> >> at > >> >> > >> > org.apache.hadoop.hdfs.server.namenode.FSEditLog.fatalExit(FSEditLog.java:430) > >> >> at > >> >> > >> > org.apache.hadoop.hdfs.server.namenode.FSEditLog.removeEditsStreamsAndStorageDirs(FSEditLog.java:519) > >> >> at > >> >> > >> > org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:1139) > >> >> at > >> >> > >> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:1641) > >> >> at > >> >> > >> > org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:689) > >> >> at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown > Source) > >> >> at > >> >> > >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > >> >> at java.lang.reflect.Method.invoke(Method.java:597) > >> >> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557) > >> >> at > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434) > >> >> at > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430) > >> >> at java.security.AccessController.doPrivileged(Native Method) > >> >> at javax.security.auth.Subject.doAs(Subject.java:396) > >> >> at > >> >> > >> > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177) > >> >> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428) > >> >> 2013-04-02 23:35:21,539 INFO > >> >> org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: > >> >> > >> > > >> > According to the stack trace, When NameNode tries to sync edit log, it > >> does > >> > identify the mounted NFS directory is inaccessible, and attempt to > remove > >> > it from the FSEditLog#editStreams. However, it found the edit stream > >> > corresponding to the mounted NFS has already been removed. Under this > >> > circumstance, NameNode just kill itself, aborted! > >> > > >> > After looking through the source code of HDFS, I found there is > another > >> > code path of removing edit stream from FSEditLog#editStreams, which > can > >> > cause above race condition. In method FSEditLog#logEdit > >> > > >> > > >> >> if (getNumEditStreams() < 1) > >> >> { > >> >> throw new AssertionError("No edit streams to log to"); > >> >> } > >> >> long start = FSNamesystem.now(); > >> >> for (int idx = 0; idx < editStreams.size(); idx++) > >> >> { > >> >> EditLogOutputStream eStream = editStreams.get(idx); > >> >> try > >> >> { > >> >> eStream.write(op, writables); > >> >> } > >> >> catch (IOException ioe) > >> >> { > >> >> removeEditsAndStorageDir(idx); > >> >> idx--; > >> >> } > >> >> } > >> >> > >> >> > >> > The cause of this race condition lie in FSEditLog#logSync method, > there > >> are > >> > two steps in FSEditLog#logSync > >> > > >> > 1. Do sync operation, if one edit stream is accessible, put it into > >> > error stream list.(un-synchronized) > >> > 2. Delete error stream from FSEditLog#editStreams(synchronized) > >> > > >> > Step #1 isn’t synchronized, so there is a possibility that after > step#1 > >> and > >> > before step #2 the error stream has already been removed from other > >> thread > >> > by invoking FSEditLog#logEdit > >> > > >> > If this was exactly a bug, my proposal fix is in method > FSEditlog#logsync > >> > ignore or print some warning message instead of aborting the namenode > >> when > >> > error edit stream doesn't exist in FSEditLog#editStreams > >> > > >> > > >> > > >> > -- > >> > Best Regards > >> > Anty Rao > >> > >> > >> > >> -- > >> Harsh J > >> > > > > > > > > -- > > Best Regards > > Anty Rao > > > > -- > Harsh J > -- Best Regards Anty Rao