[
https://issues.apache.org/jira/browse/CHUKWA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12739751#action_12739751
]
Ari Rabkin commented on CHUKWA-369:
-----------------------------------
@Jerome:
The proposal is as follows:
1) In response to a PUT,the collector returns the filename and position in the
sink file where the data will be written, if it gets written. Since files have
exactly one writer, we're guaranteed that no other writer can write to that
offset. And if the write succeeds, it'll be the write corresponding to that PUT.
2) Some minutes later, the agent asks a collector, any collector, how long the
indicated sink file (or corresponding .done file) is. If it's greater than the
indicated length, the write succeeded.
There's one small wrinkle.
2a) If a .done was created, and then removed by demux or archiving, collectors
should continue to show it as having been written. There's a couple ways to do
this. For instance, collectors could also look in the archive input and output
dirs, to see if the .done file is there. And could remember the .dones they
saw previously, on the assumption that if it ever existed, it's somewhere in
the processing pipeline and the data is safe.
Furthermore, if we go this route, we really ought to do something about
"marooned" .chukwa files. Right now, if a collector crashes or is stopped, it
leaves a .chukwa file in the sink. And these files never get processed and
never get deleted. Some other collector ought to rename it and make it
available for processing. This is probably a good thing in general, but not
actually required for the reliability mechanism I'm proposing.
> proposed reliability mechanism
> ------------------------------
>
> Key: CHUKWA-369
> URL: https://issues.apache.org/jira/browse/CHUKWA-369
> Project: Hadoop Chukwa
> Issue Type: New Feature
> Components: data collection
> Affects Versions: 0.3.0
> Reporter: Ari Rabkin
> Fix For: 0.3.0
>
>
> We like to say that Chukwa is a system for reliable log collection. It isn't,
> quite, since we don't handle collector crashes. Here's a proposed
> reliability mechanism.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.