Hi Guy, Several questions come to mind here: - What was the exact WARN level message you saw? - Did you have multiple dfs.name.dirs configured as recommended by most setup guides? - Did you try entering safemode and then running saveNamespace to persist the image before shutting down the NN? This would have saved your data. - What exact version of HDFS were you running?
This is certainly not expected behavior... all of the places where an edit log fails have a check against there being 0 edit logs remaining and should issue a FATAL level message followed by a System.exit(-1). -Todd On Thu, Dec 15, 2011 at 1:16 AM, Guy Doulberg <[email protected]> wrote: > Hi guys, > > We recently had the following problem on our production cluster: > > The filesystem containing the editlog and fsimage had no free inodes. > As a result the namenode wasn't able to obtain an inode for the fsimage and > editlog after a checkpiot has been reached, while the previous files were > freed. > Unfortunately, we had no monitoring on the inodes number, so it happens > that the namenode ran in this state for a few hours. > > We have noticed this failure in its DFS-status page. > > But the namenode didn't enter safe-mode, so all the writes were made > couldn't be persisted to the editlog. > > > After discovering the problem we freed inodes, and the file-system seemed to > be okay again, we tried to force the namenode to persist to editlog with no > success, > > Eventually, we restarted the namenode -which of-course caused us to lose all > the data that was written to the hdfs during these few hours (fortunately we > have backup of the recent writes - so we restored the data from there ) > > This situation raises some severe concerns, > 1. How come the namenode identified a failure in persisting its editlog and > didn't enter safe-mode? (The exception was given only a WARN -severity and > not a CRITICAL) > 2. How come after we freed inodes, we couldn't persist the namenode? Maybe > there should be a command in the CLI to should enable us to force the > namenode to persist its editlog > > Do you know of a JIRA opened for these issue, or should I open one? > > Thanks Guy > > -- Todd Lipcon Software Engineer, Cloudera
