I have a SequenceFile.Writer that I obtained via SequenceFile.createWriter
and write to using append(key, value). Because the writer volume is low,
it's not uncommon for it to take over a day for my appends to finally be
flushed to HDFS (e.g. the new file will sit at 0 bytes for over a day).
Because I am running map/reduce tasks on this data multiple times a day, I
want to "flush" the sequence file so the mapred jobs can pick it up when
they run.
What's the right way to do this? I'm assuming it's a fairly common use
case. Also -- are writes to the sequence files atomic? (e.g. if I am
actively appending to a sequence file, is it always safe to read from that
same file in a mapred job?)

To be clear, I want the flushing to be time based (controlled explicitly by
the app), not size based. Will this create waste in HDFS somehow?

Thanks,
Brian

Reply via email to