[jira] Commented: (CHUKWA-369) proposed reliability mechanism

Ari Rabkin (JIRA) Tue, 18 Aug 2009 20:45:39 -0700

    [ 
https://issues.apache.org/jira/browse/CHUKWA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744862#action_12744862
 ]


Ari Rabkin commented on CHUKWA-369:
-----------------------------------

OK.  I've now modified AsycAckSender so that it can take a separate list of 
collectors that should be used for checking file lengths.

But I just realized there are two deeper problems with my approach. 

1) Suppose that an Ack doesn't arrive. What then?  The code to rewind adaptors 
to the last checkpoint and resume hasn't been written yet.  But I think it's 
pretty straightforward.
2) It's possible that an agent writes chunks 1,2 and 3 to collector A.  And 
then fails over to collector B and writes chunks 4 and 5.  Supposing we get 
Acks for 1,2,4,5. The right thing to do is to apply the acks for 1+2, hold the 
acks for 4 and 5, and then if the timeout occurs, to restart from 3.  But right 
now, we just assume that an ack for chunk n+1 implies that chunks 0-n have all 
committed. This isn't really right.  

There's two plausible fixes. The first is to automatically reset each running 
adaptor whenever we switch collectors. This makes (2) very easy to solve, at 
the expense of making dynamic load-balancing harder.  The second is to use 
timeouts, and to really confront (2) head-on.

> proposed reliability mechanism
> ------------------------------
>
>                 Key: CHUKWA-369
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-369
>             Project: Hadoop Chukwa
>          Issue Type: New Feature
>          Components: data collection
>    Affects Versions: 0.3.0
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>             Fix For: 0.3.0
>
>         Attachments: delayedAcks.patch
>
>
> We like to say that Chukwa is a system for reliable log collection. It isn't, 
> quite, since we don't handle collector crashes.  Here's a proposed 
> reliability mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CHUKWA-369) proposed reliability mechanism

Reply via email to