Re: How to use MapFile?

Doug Cutting Mon, 13 Nov 2006 10:59:51 -0800

A good way to update a very large MapFile-based dataset is to:


1. Add new entries to SequenceFile's in a dataset.add directory.

2. Run a MapReduce job specifying input directories of both dataset anddataset.new. If you need to update existing entries, specify a reducefunction that merges existing entries with new entries. SpecifyMapFileOutputFormat. Specify dataset.new as the output directory.

3. Rename dataset.new to dataset.

4. Use MapFileOutputFormat.getReaders() andMapFileOutputFormat.getEntry() to randomly access entries in the datasetwith a single read (the indexes are read into memory). Or, for batchoperations, use MapReduce directly on the dataset (as an inputdirectory) to generate derivative datasets.


This is the way that, e.g., Nutch updates it's crawl DB.

Doug

张茂森 wrote:

Hi all:

Now I want to do some operations like ‘update’ or ‘insert’, which can
describe like this:

1. I have a base dataset

2. Everyday I will get more data from other places, and then I want to

update or insert these new data into my base dataset.

3. After I’ve read API Doc, I think MapFile is a good way to solve this
problem. As far as I know, I only need to append my new data at the end of
base dataset, and update the index file of MapFile. I understand right?

4. If I am right, I want to know how to do these operations using MapFile.

Firstly, I could only find MapFileOutputFormat and couldn’t find
MapFileInputFormat, so how to read the MapFile?

Secondly, how to update the index and append the data? Do you have some
experience or samples?

Any suggestion would be appreciated.

Thank you!

Re: How to use MapFile?

Reply via email to