[
https://issues.apache.org/jira/browse/CHUKWA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12739727#action_12739727
]
Ari Rabkin commented on CHUKWA-369:
-----------------------------------
@Jerome: I don't see how bailing out to avoid disk-full solves the problem of
collectors crashing. The failure scenario I'm worried about is that
LocalWriter writes the data, and then the collector dies in a non-recoverable
way. The data on disk is now useless, and the Right Thing is for the agent to
retransmit to a different collector.
Certainly, HDFS improvements would help reduce this problem. But I think I can
implement my proposal here in a week or so -- and that gets us reliability even
with previous versions of the filesystem.
@Eric: The point about overloading the name node is a fair one. Let me propose
the following modification:
- Instead of querying HDFS directly, agents should do a GET request to a
collector. The collector has to do only a single list every few minutes, and
cache the results, to satisfy all the agents. This radically cuts down on
traffic to namenode. Plus isolates the Chukwa DFS from agents.
> proposed reliability mechanism
> ------------------------------
>
> Key: CHUKWA-369
> URL: https://issues.apache.org/jira/browse/CHUKWA-369
> Project: Hadoop Chukwa
> Issue Type: New Feature
> Components: data collection
> Affects Versions: 0.3.0
> Reporter: Ari Rabkin
> Fix For: 0.3.0
>
>
> We like to say that Chukwa is a system for reliable log collection. It isn't,
> quite, since we don't handle collector crashes. Here's a proposed
> reliability mechanism.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.