[
https://issues.apache.org/jira/browse/CHUKWA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12739099#action_12739099
]
Ari Rabkin commented on CHUKWA-369:
-----------------------------------
Right now, collectors just blindly send back OK after every chunk, even if the
data isn't stable on disk. The OK is sent *after* the data is handed to a
Writer, and therefore after Writer.add() returns. But Writer.add() is void,
and so we get no verification that the write committed.
I'd like to have Writer.add() return one of two things:
either an OK, or else a "Witness string", which get passed back to the client.
"OK" means that the data is now the collector's responsibility, and the agent
should advance its checkpointed state.
The witness string is a filename in HDFS and file length. Periodically, the
agent checks the length of the file; if it exceeds the specified length, then
the data has been committed to the file, and the agent can again advance its
checkpoint. If the data hasn't committed within the specified period, than the
agent stops all running adaptors, and resumes from the last checkpoint.
This is much easier to implement if we can assume a few things:
1) A single collector will commit data from a single Agent in order.
2) A single agent won't fail-over to a new collector unless the previous
collector failed: therefore, even if writes are split across collectors, we're
still guaranteed commit-in-order.
3) Collector failures are rare, and therefore agents don't need to update their
checkpoints all that often, and can safely rewind several minutes in the event
of failure.
All these assumptions are currently true; I just want to document them and
explain clearly that they can't be violated without breaking the reliability
mechanism.
> proposed reliability mechanism
> ------------------------------
>
> Key: CHUKWA-369
> URL: https://issues.apache.org/jira/browse/CHUKWA-369
> Project: Hadoop Chukwa
> Issue Type: New Feature
> Components: data collection
> Affects Versions: 0.3.0
> Reporter: Ari Rabkin
> Fix For: 0.3.0
>
>
> We like to say that Chukwa is a system for reliable log collection. It isn't,
> quite, since we don't handle collector crashes. Here's a proposed
> reliability mechanism.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.