[ 
https://issues.apache.org/jira/browse/CHUKWA-487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867657#action_12867657
 ] 

Jerome Boulon commented on CHUKWA-487:
--------------------------------------

Hi,
I looked quickly at the code and that's something I changed in Honu because of 
a possible "virtual dead-lock".
The main thread (writer) tries to acquire the lock to write to the sequence 
file while the shutdownHook will also try to get it.
The issue is that the Writer get the lock for about 2 minutes (NN retries) and 
chance are that statistically the next one to get the lock will also be a queue 
add instead of the close.
One quick workaround for now will be to wait no-longer than xxx sec on the 
close and also to configure your NN client to fail fast instead of retrying 
again and again in the specific case of NN is not available.

Longer term could be to back off at the SeqFileWriter.add method. 
In Honu, I've put some timeOut aorund all locks & addQueue to make sure that 
nobody get blocked on a lock + I have a TRY_LATER error that is returned if a 
the add takes more than 2 seconds and READY/!READY state to accept/reject 
incoming requests.

> Collector left in a bad state after temprorary NN outage
> --------------------------------------------------------
>
>                 Key: CHUKWA-487
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-487
>             Project: Hadoop Chukwa
>          Issue Type: Bug
>          Components: data collection
>    Affects Versions: 0.4.0
>            Reporter: Bill Graham
>            Priority: Blocker
>         Attachments: CHUKWA-487.threaddump.txt
>
>
> When the name node returns errors to the collector, at some point the 
> collector dies half way. This behavior should be changed to either resemble 
> the agents and keep trying, or to completely shutdown. Instead, what I'm 
> seeing is that the collector logs that it's shutting down, and the 
> var/pidDir/Collector.pid file gets removed, but the collector continues to 
> run, albeit not handling new data. Instead, this log entry is repeated ad 
> infinitum:
> 2010-05-06 17:35:06,375 INFO Timer-1 root - 
> stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
> 2010-05-06 17:36:06,379 INFO Timer-1 root - 
> stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
> 2010-05-06 17:37:06,384 INFO Timer-1 root - 
> stats:ServletCollector,numberHTTPConnection:0,numberchunks:0

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to