SequenceFiles, checkpoints, block size (Was: How to flush SequenceFile.Writer?)

Brian Long Mon, 02 Feb 2009 09:54:58 -0800

Let me rephrase this problem... as stated below, when I start writing to a
SequenceFile from an HDFS client, nothing is visible in HDFS until I've
written 64M of data. This presents three problems: fsck reports the file
system as corrupt until the first block is finally written out, the presence
of the file (without any data) seems to blow up my mapred jobs that try to
make use of it under my input path, and finally, I want to basically flush
every 15 minutes or so so I can mapred the latest data.
I don't see any programmatic way to force the file to flush in 17.2.
Additionally, "dfs.checkpoint.period" does not seem to be obeyed. Does that
not do what I think it does? What controls the 64M limit, anyway? Is it
"dfs.checkpoint.size" or "dfs.block.size"? Is the buffering happening on the
client, or on data nodes? Or in the namenode?

It seems really bad that a SequenceFile, upon creation, is in an unusable
state from the perspective of a mapred job, and also leaves fsck in a
corrupt state. Surely I must be doing something wrong... but what? How can I
ensure that a SequenceFile is immediately usable (but empty) on creation,
and how can I make things flush on some regular time interval?

Thanks,
Brian

On Thu, Jan 29, 2009 at 4:17 PM, Brian Long <br...@dotspots.com> wrote:

> I have a SequenceFile.Writer that I obtained via SequenceFile.createWriter
> and write to using append(key, value). Because the writer volume is low,
> it's not uncommon for it to take over a day for my appends to finally be
> flushed to HDFS (e.g. the new file will sit at 0 bytes for over a day).
> Because I am running map/reduce tasks on this data multiple times a day, I
> want to "flush" the sequence file so the mapred jobs can pick it up when
> they run.
> What's the right way to do this? I'm assuming it's a fairly common use
> case. Also -- are writes to the sequence files atomic? (e.g. if I am
> actively appending to a sequence file, is it always safe to read from that
> same file in a mapred job?)
>
> To be clear, I want the flushing to be time based (controlled explicitly by
> the app), not size based. Will this create waste in HDFS somehow?
>
> Thanks,
> Brian
>
>

SequenceFiles, checkpoints, block size (Was: How to flush SequenceFile.Writer?)

Reply via email to