[jira] Commented: (CHUKWA-369) proposed reliability mechanism

Eric Yang (JIRA) Wed, 05 Aug 2009 18:21:39 -0700

    [ 
https://issues.apache.org/jira/browse/CHUKWA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12739857#action_12739857
 ]


Eric Yang commented on CHUKWA-369:
----------------------------------

HTTP return code should be the only contract between agent and collector.  If 
http return code returns 200, then the data should be managed by the collector 
at that point. The async status check from agent to the collector will only 
complicate things because collector could be busy and unable to answer the 
status check request.  It creates domino effects for the agent to resend chunks 
because the second status check may fail more than once on the busy collector.

As a summary of the states, there are 3 ways to solve the problem.

1. Having synchronized pipeline.  (Agent write 1 minute worth of data, wait for 
collector to close the file after 1 minute mark then return HTTP code).  If 
collector does not close the file properly, no HTTP code is return, and agent 
resend the 1 minute worth of data (or since last check point).  This depends on 
HDFS IO performance, previous experience with 0.18 and 0.20 yield around 20MB/s.

2. Having asynchronous pipeline, difficult to track progress of each agent with 
the collectors.  A lot of memory overhead to keep track of agent status inside 
collector.  Status check request may not respond, and cause retransmission 
frequently.

3. Use localWriter to write data on collector node first, and data is uploaded 
to HDFS asynchronously.  Down side of this is, collector disk is stressed, the 
wear and tear of collector disk could result of bad data being injected to HDFS 
without crc check.  Collector disk crash = data lost.

There is really no perfect solution here, but  option 1 is less error prone.  
As long as Hadoop improve performance, Chukwa benefits too.



> proposed reliability mechanism
> ------------------------------
>
>                 Key: CHUKWA-369
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-369
>             Project: Hadoop Chukwa
>          Issue Type: New Feature
>          Components: data collection
>    Affects Versions: 0.3.0
>            Reporter: Ari Rabkin
>             Fix For: 0.3.0
>
>
> We like to say that Chukwa is a system for reliable log collection. It isn't, 
> quite, since we don't handle collector crashes.  Here's a proposed 
> reliability mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CHUKWA-369) proposed reliability mechanism

Reply via email to