[ 
https://issues.apache.org/jira/browse/CHUKWA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12739099#action_12739099
 ] 

Ari Rabkin commented on CHUKWA-369:
-----------------------------------

Right now, collectors just blindly send back OK after every chunk, even if the 
data isn't stable on disk.  The OK is sent *after* the data is handed to a 
Writer, and therefore after Writer.add() returns.  But Writer.add() is void, 
and so we get no verification that the write committed.

I'd like to have Writer.add() return one of two things:
either an OK, or else a "Witness string", which get passed back to the client.  
"OK" means that the data is now the collector's responsibility, and the agent 
should advance its checkpointed state.  

The witness string is a filename in HDFS and file length.  Periodically, the 
agent checks the length of the file; if it exceeds the specified length, then 
the data has been committed to the file, and the agent can again advance its 
checkpoint.  If the data hasn't committed within the specified period, than the 
agent stops all running adaptors, and resumes from the last checkpoint.  

This is much easier to implement if we can assume a few things:
1) A single collector will commit data from a single Agent in order.
2) A single agent won't fail-over to a new collector unless the previous 
collector failed: therefore, even if writes are split across collectors, we're 
still guaranteed commit-in-order.
3) Collector failures are rare, and therefore agents don't need to update their 
checkpoints all that often, and can safely rewind several minutes in the event 
of failure.

All these assumptions are currently true; I just want to document them and 
explain clearly that they can't be violated without breaking the reliability 
mechanism.

> proposed reliability mechanism
> ------------------------------
>
>                 Key: CHUKWA-369
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-369
>             Project: Hadoop Chukwa
>          Issue Type: New Feature
>          Components: data collection
>    Affects Versions: 0.3.0
>            Reporter: Ari Rabkin
>             Fix For: 0.3.0
>
>
> We like to say that Chukwa is a system for reliable log collection. It isn't, 
> quite, since we don't handle collector crashes.  Here's a proposed 
> reliability mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to