[
https://issues.apache.org/jira/browse/CHUKWA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12739857#action_12739857
]
Eric Yang commented on CHUKWA-369:
----------------------------------
HTTP return code should be the only contract between agent and collector. If
http return code returns 200, then the data should be managed by the collector
at that point. The async status check from agent to the collector will only
complicate things because collector could be busy and unable to answer the
status check request. It creates domino effects for the agent to resend chunks
because the second status check may fail more than once on the busy collector.
As a summary of the states, there are 3 ways to solve the problem.
1. Having synchronized pipeline. (Agent write 1 minute worth of data, wait for
collector to close the file after 1 minute mark then return HTTP code). If
collector does not close the file properly, no HTTP code is return, and agent
resend the 1 minute worth of data (or since last check point). This depends on
HDFS IO performance, previous experience with 0.18 and 0.20 yield around 20MB/s.
2. Having asynchronous pipeline, difficult to track progress of each agent with
the collectors. A lot of memory overhead to keep track of agent status inside
collector. Status check request may not respond, and cause retransmission
frequently.
3. Use localWriter to write data on collector node first, and data is uploaded
to HDFS asynchronously. Down side of this is, collector disk is stressed, the
wear and tear of collector disk could result of bad data being injected to HDFS
without crc check. Collector disk crash = data lost.
There is really no perfect solution here, but option 1 is less error prone.
As long as Hadoop improve performance, Chukwa benefits too.
> proposed reliability mechanism
> ------------------------------
>
> Key: CHUKWA-369
> URL: https://issues.apache.org/jira/browse/CHUKWA-369
> Project: Hadoop Chukwa
> Issue Type: New Feature
> Components: data collection
> Affects Versions: 0.3.0
> Reporter: Ari Rabkin
> Fix For: 0.3.0
>
>
> We like to say that Chukwa is a system for reliable log collection. It isn't,
> quite, since we don't handle collector crashes. Here's a proposed
> reliability mechanism.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.