Toby DiPasquale wrote:
I have a question about the MapReduce and NDFS implementations. When
writing records into an NDFS file, how does one make sure that records
terminate cleanly on block boundaries such that a Map job's input does not
span multiple physical blocks?

We do not currently guarantee that. A task's input may span multiple blocks. We try to split things into block-sized chunks, but the last few records (up to the first sync mark past the split point) may be in the next block. So a bit of i/o will happen over the network, but not the vast majority.

It also appears as if NDFS does not have an explicit "record append"
operation. Is this the case?

Yes.  DFS currently is write-once.

Please note that the MapReduce and DFS code has moved from Nutch to the Hadoop project. Such questions are more appropriately asked there.

Doug

Reply via email to