[ 
https://issues.apache.org/jira/browse/CHUKWA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12739727#action_12739727
 ] 

Ari Rabkin commented on CHUKWA-369:
-----------------------------------

@Jerome: I don't see how bailing out to avoid disk-full solves the problem of 
collectors crashing.  The failure scenario I'm worried about is that 
LocalWriter writes the data, and then the collector dies in a non-recoverable 
way.  The data on disk is now useless, and the Right Thing is for the agent to 
retransmit to a different collector.

Certainly, HDFS improvements would help reduce this problem. But I think I can 
implement my proposal here in a week or so -- and that gets us reliability even 
with previous versions of the filesystem.

@Eric: The point about overloading the name node is a fair one. Let me propose 
the following modification:

- Instead of querying HDFS directly, agents should do a GET request to a 
collector.  The collector has to do only a single list every few minutes, and 
cache the results, to satisfy all the agents. This radically cuts down on 
traffic to namenode.  Plus isolates the Chukwa DFS from agents.   



> proposed reliability mechanism
> ------------------------------
>
>                 Key: CHUKWA-369
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-369
>             Project: Hadoop Chukwa
>          Issue Type: New Feature
>          Components: data collection
>    Affects Versions: 0.3.0
>            Reporter: Ari Rabkin
>             Fix For: 0.3.0
>
>
> We like to say that Chukwa is a system for reliable log collection. It isn't, 
> quite, since we don't handle collector crashes.  Here's a proposed 
> reliability mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to