A good way to update a very large MapFile-based dataset is to:
1. Add new entries to SequenceFile's in a dataset.add directory.
2. Run a MapReduce job specifying input directories of both dataset and
dataset.new. If you need to update existing entries, specify a reduce
function that merges existing entries with new entries. Specify
MapFileOutputFormat. Specify dataset.new as the output directory.
3. Rename dataset.new to dataset.
4. Use MapFileOutputFormat.getReaders() and
MapFileOutputFormat.getEntry() to randomly access entries in the dataset
with a single read (the indexes are read into memory). Or, for batch
operations, use MapReduce directly on the dataset (as an input
directory) to generate derivative datasets.
This is the way that, e.g., Nutch updates it's crawl DB.
Doug
张茂森 wrote:
Hi all:
Now I want to do some operations like ‘update’ or ‘insert’, which can
describe like this:
1. I have a base dataset
2. Everyday I will get more data from other places, and then I want to
update or insert these new data into my base dataset.
3. After I’ve read API Doc, I think MapFile is a good way to solve this
problem. As far as I know, I only need to append my new data at the end of
base dataset, and update the index file of MapFile. I understand right?
4. If I am right, I want to know how to do these operations using MapFile.
Firstly, I could only find MapFileOutputFormat and couldn’t find
MapFileInputFormat, so how to read the MapFile?
Secondly, how to update the index and append the data? Do you have some
experience or samples?
Any suggestion would be appreciated.
Thank you!