Hi Guy,

Several questions come to mind here:
- What was the exact WARN level message you saw?
- Did you have multiple dfs.name.dirs configured as recommended by
most setup guides?
- Did you try entering safemode and then running saveNamespace to
persist the image before shutting down the NN? This would have saved
your data.
- What exact version of HDFS were you running?

This is certainly not expected behavior... all of the places where an
edit log fails have a check against there being 0 edit logs remaining
and should issue a FATAL level message followed by a System.exit(-1).

-Todd

On Thu, Dec 15, 2011 at 1:16 AM, Guy Doulberg <[email protected]> wrote:
> Hi guys,
>
> We recently had the following problem  on our production cluster:
>
> The filesystem containing the editlog and fsimage had no free inodes.
>  As a result the namenode wasn't able to obtain an inode for the fsimage and
>  editlog after a checkpiot has been reached, while the previous files were
> freed.
>  Unfortunately, we had no monitoring on the inodes number, so it happens
> that the namenode ran in this state for a few hours.
>
> We have noticed this failure in its DFS-status page.
>
> But the namenode didn't enter safe-mode, so all the writes were made
> couldn't be persisted to the editlog.
>
>
> After discovering the problem we freed inodes, and the file-system seemed to
> be okay again, we tried to force the namenode to persist to editlog with no
> success,
>
> Eventually, we restarted the namenode -which of-course caused us to lose all
> the data that was written to the hdfs during these few hours (fortunately we
> have backup of the recent writes - so we restored the data from there )
>
> This situation raises some severe concerns,
> 1. How come the namenode identified  a failure in persisting its editlog and
> didn't enter safe-mode? (The exception was given only a WARN -severity and
> not a CRITICAL)
> 2. How come after we freed  inodes, we couldn't persist the namenode? Maybe
> there should be a command in the CLI to should enable us to force the
> namenode to persist its editlog
>
> Do you know of a JIRA opened for these issue, or should I open one?
>
> Thanks Guy
>
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to