[ 
https://issues.apache.org/jira/browse/CHUKWA-487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866033#action_12866033
 ] 

Bill Graham commented on CHUKWA-487:
------------------------------------

Actually, looking closer I can't say for sure that I had data loss. It could be 
just that the bounce of the NN made the file unavailable. It appears in my case 
that perhaps the file couldn't have been closed or rotated because the NN had 
gone down.

The only way that you could have data lose AFAIK would be if the current data 
dir including the edit log got corrupted since the last SNN checkpoint. I think 
this is something that is rare enough to not worry about. My concern was more 
of how to make sure the collector isn't left in a bad state if part of an 
un-closed file was lost. 

The crash and reboot scenario is better than what we have now, but a 
self-recovering solution would be ideal. This way if the NN crashed 
unexpectedly (perhaps during of-business hours), all the collectors wouldn't 
need to be restarted. Again though, this is probably a rare occurrence.

> Collector left in a bad state after temprorary NN outage
> --------------------------------------------------------
>
>                 Key: CHUKWA-487
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-487
>             Project: Hadoop Chukwa
>          Issue Type: Bug
>          Components: data collection
>    Affects Versions: 0.4.0
>            Reporter: Bill Graham
>            Priority: Blocker
>
> When the name node returns errors to the collector, at some point the 
> collector dies half way. This behavior should be changed to either resemble 
> the agents and keep trying, or to completely shutdown. Instead, what I'm 
> seeing is that the collector logs that it's shutting down, and the 
> var/pidDir/Collector.pid file gets removed, but the collector continues to 
> run, albeit not handling new data. Instead, this log entry is repeated ad 
> infinitum:
> 2010-05-06 17:35:06,375 INFO Timer-1 root - 
> stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
> 2010-05-06 17:36:06,379 INFO Timer-1 root - 
> stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
> 2010-05-06 17:37:06,384 INFO Timer-1 root - 
> stats:ServletCollector,numberHTTPConnection:0,numberchunks:0

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to