Hi guys,

We recently had the following problem  on our production cluster:

The filesystem containing the editlog and fsimage had no free inodes.
As a result the namenode wasn't able to obtain an inode for the fsimage and editlog after a checkpiot has been reached, while the previous files were freed. Unfortunately, we had no monitoring on the inodes number, so it happens that the namenode ran in this state for a few hours.

We have noticed this failure in its DFS-status page.

But the namenode didn't enter safe-mode, so all the writes were made couldn't be persisted to the editlog.


After discovering the problem we freed inodes, and the file-system seemed to be okay again, we tried to force the namenode to persist to editlog with no success,

Eventually, we restarted the namenode -which of-course caused us to lose all the data that was written to the hdfs during these few hours (fortunately we have backup of the recent writes - so we restored the data from there )

This situation raises some severe concerns,
1. How come the namenode identified a failure in persisting its editlog and didn't enter safe-mode? (The exception was given only a WARN -severity and not a CRITICAL) 2. How come after we freed inodes, we couldn't persist the namenode? Maybe there should be a command in the CLI to should enable us to force the namenode to persist its editlog

Do you know of a JIRA opened for these issue, or should I open one?

Thanks Guy


Reply via email to