[ 
https://issues.apache.org/jira/browse/CHUKWA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12739751#action_12739751
 ] 

Ari Rabkin commented on CHUKWA-369:
-----------------------------------

@Jerome:

The proposal is as follows:
1) In response to a PUT,the collector returns the filename and position in the 
sink file where the data will be written, if it gets written. Since files have 
exactly one writer, we're guaranteed that no other writer can write to that 
offset. And if the write succeeds, it'll be the write corresponding to that PUT.
2) Some minutes later, the agent asks a collector, any collector, how long the 
indicated sink file (or corresponding .done file) is.  If it's greater than the 
indicated length, the write succeeded.  

There's one small wrinkle.  
2a) If a .done was created, and then removed by demux or archiving, collectors 
should continue to show it as having been written.  There's a couple ways to do 
this. For instance, collectors could also look in the archive input and output 
dirs, to see if the .done file is there.  And could remember the .dones they 
saw previously, on the assumption that if it ever existed, it's somewhere in 
the processing pipeline and the data is safe.

Furthermore, if we go this route, we really ought to do something about 
"marooned" .chukwa files.  Right now, if a collector crashes or is stopped, it 
leaves a .chukwa file in the sink. And these files never get processed and 
never get deleted.  Some other collector ought to rename it and make it 
available for processing.  This is probably a good thing in general, but not 
actually required for the reliability mechanism I'm proposing.

> proposed reliability mechanism
> ------------------------------
>
>                 Key: CHUKWA-369
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-369
>             Project: Hadoop Chukwa
>          Issue Type: New Feature
>          Components: data collection
>    Affects Versions: 0.3.0
>            Reporter: Ari Rabkin
>             Fix For: 0.3.0
>
>
> We like to say that Chukwa is a system for reliable log collection. It isn't, 
> quite, since we don't handle collector crashes.  Here's a proposed 
> reliability mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to