[ 
https://issues.apache.org/jira/browse/CHUKWA-487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12865946#action_12865946
 ] 

Ari Rabkin commented on CHUKWA-487:
-----------------------------------

Agents will retry if a collector goes down, without losing data, so it should 
be totally safe to let it crash and reboot.  And I think restarting from an 
entirely-clean state is preferable to fixing it.  This also sidesteps questions 
of file handles breaking.

How common are SSN restores with data loss?   I had assumed that after a file 
was closed, file data was as durable as NN metadata on disk.

> Collector left in a bad state after temprorary NN outage
> --------------------------------------------------------
>
>                 Key: CHUKWA-487
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-487
>             Project: Hadoop Chukwa
>          Issue Type: Bug
>          Components: data collection
>    Affects Versions: 0.4.0
>            Reporter: Bill Graham
>            Priority: Blocker
>
> When the name node returns errors to the collector, at some point the 
> collector dies half way. This behavior should be changed to either resemble 
> the agents and keep trying, or to completely shutdown. Instead, what I'm 
> seeing is that the collector logs that it's shutting down, and the 
> var/pidDir/Collector.pid file gets removed, but the collector continues to 
> run, albeit not handling new data. Instead, this log entry is repeated ad 
> infinitum:
> 2010-05-06 17:35:06,375 INFO Timer-1 root - 
> stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
> 2010-05-06 17:36:06,379 INFO Timer-1 root - 
> stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
> 2010-05-06 17:37:06,384 INFO Timer-1 root - 
> stats:ServletCollector,numberHTTPConnection:0,numberchunks:0

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to