Hi guys,
We recently had the following problem on our production cluster:
The filesystem containing the editlog and fsimage had no free inodes.
As a result the namenode wasn't able to obtain an inode for the
fsimage and editlog after a checkpiot has been reached, while the
previous files were freed.
Unfortunately, we had no monitoring on the inodes number, so it
happens that the namenode ran in this state for a few hours.
We have noticed this failure in its DFS-status page.
But the namenode didn't enter safe-mode, so all the writes were made
couldn't be persisted to the editlog.
After discovering the problem we freed inodes, and the file-system
seemed to be okay again, we tried to force the namenode to persist to
editlog with no success,
Eventually, we restarted the namenode -which of-course caused us to lose
all the data that was written to the hdfs during these few hours
(fortunately we have backup of the recent writes - so we restored the
data from there )
This situation raises some severe concerns,
1. How come the namenode identified a failure in persisting its editlog
and didn't enter safe-mode? (The exception was given only a WARN
-severity and not a CRITICAL)
2. How come after we freed inodes, we couldn't persist the namenode?
Maybe there should be a command in the CLI to should enable us to force
the namenode to persist its editlog
Do you know of a JIRA opened for these issue, or should I open one?
Thanks Guy