Re: using a set of MapFiles - getting the right partition

2008-03-20 Thread Doug Cutting

http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/MapFileOutputFormat.html#getEntry(org.apache.hadoop.io.MapFile.Reader[],%20org.apache.hadoop.mapred.Partitioner,%20K,%20V)

MapFileOutputFormat#getEntry() does this.

Use MapFileOutputFormat#getReaders() to create the readers parameter.

Doug

Chris Dyer wrote:

Hi all--
I would like to have a reducer generate a MapFile so that in later
processes I can look up the values associated with a few keys without
processing an entire sequence file.  However, if I have N reducers, I
will generate N different map files, so to pick the right map file I
will need to use the same partitioner as was used when partitioning
the keys to reducers (the reducer I have running emits one value for
each key it receives and no others).  Should this be done manually, ie
something like readers[partioner.getPartition(...)] or is there
another recommended method?

Eventually, I'm going to migrate to using HBase to store the key/value
pairs (since I'd to take advantage of HBase's ability to cache common
pairs in memory for faster retrieval), but I'm interested in seeing
what the performance is like just using MapFiles.

Thanks,
Chris




using a set of MapFiles - getting the right partition

2008-03-20 Thread Chris Dyer
Hi all--
I would like to have a reducer generate a MapFile so that in later
processes I can look up the values associated with a few keys without
processing an entire sequence file.  However, if I have N reducers, I
will generate N different map files, so to pick the right map file I
will need to use the same partitioner as was used when partitioning
the keys to reducers (the reducer I have running emits one value for
each key it receives and no others).  Should this be done manually, ie
something like readers[partioner.getPartition(...)] or is there
another recommended method?

Eventually, I'm going to migrate to using HBase to store the key/value
pairs (since I'd to take advantage of HBase's ability to cache common
pairs in memory for faster retrieval), but I'm interested in seeing
what the performance is like just using MapFiles.

Thanks,
Chris