[
https://issues.apache.org/jira/browse/CHUKWA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744862#action_12744862
]
Ari Rabkin commented on CHUKWA-369:
-----------------------------------
OK. I've now modified AsycAckSender so that it can take a separate list of
collectors that should be used for checking file lengths.
But I just realized there are two deeper problems with my approach.
1) Suppose that an Ack doesn't arrive. What then? The code to rewind adaptors
to the last checkpoint and resume hasn't been written yet. But I think it's
pretty straightforward.
2) It's possible that an agent writes chunks 1,2 and 3 to collector A. And
then fails over to collector B and writes chunks 4 and 5. Supposing we get
Acks for 1,2,4,5. The right thing to do is to apply the acks for 1+2, hold the
acks for 4 and 5, and then if the timeout occurs, to restart from 3. But right
now, we just assume that an ack for chunk n+1 implies that chunks 0-n have all
committed. This isn't really right.
There's two plausible fixes. The first is to automatically reset each running
adaptor whenever we switch collectors. This makes (2) very easy to solve, at
the expense of making dynamic load-balancing harder. The second is to use
timeouts, and to really confront (2) head-on.
> proposed reliability mechanism
> ------------------------------
>
> Key: CHUKWA-369
> URL: https://issues.apache.org/jira/browse/CHUKWA-369
> Project: Hadoop Chukwa
> Issue Type: New Feature
> Components: data collection
> Affects Versions: 0.3.0
> Reporter: Ari Rabkin
> Assignee: Ari Rabkin
> Fix For: 0.3.0
>
> Attachments: delayedAcks.patch
>
>
> We like to say that Chukwa is a system for reliable log collection. It isn't,
> quite, since we don't handle collector crashes. Here's a proposed
> reliability mechanism.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.