[
https://issues.apache.org/jira/browse/CHUKWA-487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12865943#action_12865943
]
Bill Graham commented on CHUKWA-487:
------------------------------------
I think the latter approach (trying to handle it in Chukwa) would be
preferable. The question then is what to do with incoming data when HDFS is
down. Could we go into a mode where we stop receiving new data from the agents
while we retry HDFS every N seconds for up to M seconds before exiting? We'd
also need to think through what to do i the current file being written to
disappears due to a SNN restore with data loss.
> Collector left in a bad state after temprorary NN outage
> --------------------------------------------------------
>
> Key: CHUKWA-487
> URL: https://issues.apache.org/jira/browse/CHUKWA-487
> Project: Hadoop Chukwa
> Issue Type: Bug
> Components: data collection
> Affects Versions: 0.4.0
> Reporter: Bill Graham
> Priority: Blocker
>
> When the name node returns errors to the collector, at some point the
> collector dies half way. This behavior should be changed to either resemble
> the agents and keep trying, or to completely shutdown. Instead, what I'm
> seeing is that the collector logs that it's shutting down, and the
> var/pidDir/Collector.pid file gets removed, but the collector continues to
> run, albeit not handling new data. Instead, this log entry is repeated ad
> infinitum:
> 2010-05-06 17:35:06,375 INFO Timer-1 root -
> stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
> 2010-05-06 17:36:06,379 INFO Timer-1 root -
> stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
> 2010-05-06 17:37:06,384 INFO Timer-1 root -
> stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.