My understanding is that snappy is a block compression scheme.
When using HDFS sink with snappy, I am wondering if 1 batch of events 
corresponds to 1 compressed chunk in the snappy file ?

This is  interesting in the face of HDFS failures... if the sink is int the 
middle of writing a batch when Hdfs connection has an error, then we have a 
partially written snappy file.

If each flume sink batch corresponds to one snappy chunk, then only the last 
chunk in the snappy file will be unreadable and that's ok.. Since that last 
batch will be redelivered to another file. However if multiple batches end up 
in a single snappy chunk then the last few batches will be unrecoverable from 
the snappy file... leading to data loss.

-roshan

Reply via email to