[ 
https://issues.apache.org/jira/browse/CHUKWA-533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920283#action_12920283
 ] 

Bill Graham commented on CHUKWA-533:
------------------------------------

Examples from the logs when a NN gets unexpectedly rebooted:

- From an active collector taking traffic:
{noformat}
2010-10-12 04:05:13,721 INFO Timer-1 root - 
stats:ServletCollector,numberHTTPConnection:2,numberchunks:105
2010-10-12 04:05:15,508 INFO Timer-3 SeqFileWriter - 
stat:datacollection.writer.hdfs dataSize=24724 dataRate=823
2010-10-12 04:05:45,515 INFO Timer-3 SeqFileWriter - 
stat:datacollection.writer.hdfs dataSize=0 dataRate=0
2010-10-12 04:05:46,894 INFO LeaseChecker Client - Retrying connect to server: 
hadoop-nn.site.com/10.10.10.111:9000. Already tried 0 time(s).
2010-10-12 04:05:59,899 INFO LeaseChecker Client - Retrying connect to server: 
hadoop-nn.site.com/10.10.10.111:9000. Already tried 0 time(s).
2010-10-12 04:06:03,903 INFO LeaseChecker Client - Retrying connect to server: 
hadoop-nn.site.com/10.10.10.111:9000. Already tried 1 time(s).
2010-10-12 04:06:07,502 INFO LeaseChecker Client - Retrying connect to server: 
hadoop-nn.site.com/10.10.10.111:9000. Already tried 2 time(s).
2010-10-12 04:06:11,506 INFO LeaseChecker Client - Retrying connect to server: 
hadoop-nn.site.com/10.10.10.111:9000. Already tried 3 time(s).
2010-10-12 04:06:13,733 INFO Timer-1 root - 
stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
2010-10-12 04:06:15,509 INFO LeaseChecker Client - Retrying connect to server: 
hadoop-nn.site.com/10.10.10.111:9000. Already tried 4 time(s).
2010-10-12 04:06:15,521 INFO Timer-3 SeqFileWriter - 
stat:datacollection.writer.hdfs dataSize=0 dataRate=0
2010-10-12 04:06:19,512 INFO LeaseChecker Client - Retrying connect to server: 
hadoop-nn.site.com/10.10.10.111:9000. Already tried 5 time(s).
2010-10-12 04:06:23,517 INFO LeaseChecker Client - Retrying connect to server: 
hadoop-nn.site.com/10.10.10.111:9000. Already tried 6 time(s).
2010-10-12 04:06:27,521 INFO LeaseChecker Client - Retrying connect to server: 
hadoop-nn.site.com/10.10.10.111:9000. Already tried 7 time(s).
2010-10-12 04:06:31,525 INFO LeaseChecker Client - Retrying connect to server: 
hadoop-nn.site.com/10.10.10.111:9000. Already tried 8 time(s).
2010-10-12 04:06:35,529 INFO LeaseChecker Client - Retrying connect to server: 
hadoop-nn.site.com/10.10.10.111:9000. Already tried 9 time(s).
2010-10-12 04:06:38,534 WARN LeaseChecker DFSClient - Problem renewing lease 
for DFSClient_-1129462781
2010-10-12 04:06:43,545 INFO LeaseChecker Client - Retrying connect to server: 
hadoop-nn.site.com/10.10.10.111:9000. Already tried 0 time(s).
2010-10-12 04:06:45,527 INFO Timer-3 SeqFileWriter - 
stat:datacollection.writer.hdfs dataSize=0 dataRate=0
2010-10-12 04:06:47,550 INFO LeaseChecker Client - Retrying connect to server: 
hadoop-nn.site.com/10.10.10.111:9000. Already tried 1 time(s).
2010-10-12 04:06:51,553 INFO LeaseChecker Client - Retrying connect to server: 
hadoop-nn.site.com/10.10.10.111:9000. Already tried 2 time(s).
2010-10-12 04:06:55,556 INFO LeaseChecker Client - Retrying connect to server: 
hadoop-nn.site.com/10.10.10.111:9000. Already tried 3 time(s).
2010-10-12 04:06:59,215 INFO LeaseChecker Client - Retrying connect to server: 
hadoop-nn.site.com/10.10.10.111:9000. Already tried 4 time(s).
2010-10-12 04:07:03,219 INFO LeaseChecker Client - Retrying connect to server: 
hadoop-nn.site.com/10.10.10.111:9000. Already tried 5 time(s).
2010-10-12 04:07:07,222 INFO LeaseChecker Client - Retrying connect to server: 
hadoop-nn.site.com/10.10.10.111:9000. Already tried 6 time(s).
2010-10-12 04:07:11,225 INFO LeaseChecker Client - Retrying connect to server: 
hadoop-nn.site.com/10.10.10.111:9000. Already tried 7 time(s).
2010-10-12 04:07:13,746 INFO Timer-1 root - 
stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
2010-10-12 04:07:15,230 INFO LeaseChecker Client - Retrying connect to server: 
hadoop-nn.site.com/10.10.10.111:9000. Already tried 8 time(s).
2010-10-12 04:07:15,534 INFO Timer-3 SeqFileWriter - 
stat:datacollection.writer.hdfs dataSize=0 dataRate=0
2010-10-12 04:07:19,235 INFO LeaseChecker Client - Retrying connect to server: 
hadoop-nn.site.com/10.10.10.111:9000. Already tried 9 time(s).
2010-10-12 04:07:22,237 WARN LeaseChecker DFSClient - Problem renewing lease 
for DFSClient_-1129462781
2010-10-12 04:07:27,242 INFO LeaseChecker Client - Retrying connect to server: 
hadoop-nn.site.com/10.10.10.111:9000. Already tried 0 time(s).
2010-10-12 04:07:31,246 INFO LeaseChecker Client - Retrying connect to server: 
hadoop-nn.site.com/10.10.10.111:9000. Already tried 1 time(s).
2010-10-12 04:07:35,251 INFO LeaseChecker Client - Retrying connect to server: 
hadoop-nn.site.com/10.10.10.111:9000. Already tried 2 time(s).
2010-10-12 04:07:39,254 INFO LeaseChecker Client - Retrying connect to server: 
hadoop-nn.site.com/10.10.10.111:9000. Already tried 3 time(s).
2010-10-12 04:07:43,258 INFO LeaseChecker Client - Retrying connect to server: 
hadoop-nn.site.com/10.10.10.111:9000. Already tried 4 time(s).
2010-10-12 04:07:45,541 INFO Timer-3 SeqFileWriter - 
stat:datacollection.writer.hdfs dataSize=0 dataRate=0
2010-10-12 04:07:47,261 INFO LeaseChecker Client - Retrying connect to server: 
hadoop-nn.site.com/10.10.10.111:9000. Already tried 5 time(s).
{noformat}

- From an idle collector that got traffic as soon as the active collector died:
{noformat}
2010-10-12 04:10:33,690 INFO Timer-3 SeqFileWriter - 
stat:datacollection.writer.hdfs dataSize=0 dataRate=0
2010-10-12 04:11:02,165 INFO Timer-1 root - 
stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
2010-10-12 04:11:03,688 WARN Timer-196 SeqFileWriter - Got an exception in 
rotate
2010-10-12 04:11:03,688 WARN LeaseChecker DFSClient - Problem renewing lease 
for DFSClient_23442132
2010-10-12 04:11:03,693 FATAL Timer-196 SeqFileWriter - IO Exception in rotate. 
Exiting!
2010-10-12 04:11:03,696 INFO Timer-3 SeqFileWriter - 
stat:datacollection.writer.hdfs dataSize=0 dataRate=0
2010-10-12 04:11:03,697 WARN Shutdown SeqFileWriter - cannot rename dataSink 
file:/chukwa/logs/201012035922632_c18rbhadoopwkrr10n1cnetcom_4435f4d212b9ca438d77e7e.chukwa
{noformat}

> Improve fault-tolerance of collectors.
> --------------------------------------
>
>                 Key: CHUKWA-533
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-533
>             Project: Chukwa
>          Issue Type: Improvement
>          Components: data collection
>            Reporter: Bill Graham
>
> There are currently a number of ways that a collector can die, typically due 
> to errors on a DN or a NN that's being restarted. A collector should have 
> some combination of retry logic followed by failing back to the agent, but 
> the collector process should not die.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to