I don't see why a HAR archive needs to be involved. You can use a MapFile to create a scannable index over a SequenceFile and do lookups that way.
But if A is small enough to fit in RAM, then there is a much simpler way: Write it out to a file and disseminate to all mappers via the DistributedCache. They then reach read in the entire A set into a HashSet or other data structure during configure(), before they scan through their slices of B. They then emit only the B values which hit in A. This is called a "map-side join." If you don't care about sorted ordering of your results, you can then disable the reducers entirely. Hive already supports this behavior; but you have to explicitly tell it to enable map-side joins for each query because only you know that one data set is "small enough" ahead of time. If your A set doesn't fit in RAM, you'll need to get more creative. One possibility is to do the same thing as above, but instead of reading all of A into memory, use a hash function to squash the keys from A into some bounded amount of RAM. For example, allocate yourself a 256 MB bitvector; for each key in A, set bitvector[hash(A_key) % len(bitvector)] = 1. Then for each B key in the mapper, if bitvector[hash(B_key) % len(bitvector)] == 1, then it may match an A key; if it's 0 then it definitely does not match an A key. For each potential match, send it to the reducer. Send all the A keys to the reducer as well, where the precise joining will occur. (Note: this is effectively the same thing as a Bloom Filter.) This will send much less data to each reducer and should see better throughput. - Aaron On Wed, Feb 11, 2009 at 4:07 PM, Amit Chandel <amitchan...@gmail.com> wrote: > Are the keys in collection B unique? > > If so, I would like to try this approach: > For each <key, value> of collection B, make a file out of it with file name > given by MD5 hash of the "key", and "value" being its content, and then > store all these files into a HAR archive. > The HAR archive will create an index for you over the keys. > Now you can iterate over the collection A, get the MD5 hash of the key, and > look up in the archive for the file (to get the value). > > On Wed, Feb 11, 2009 at 4:39 PM, Thibaut_ <tbr...@blue.lu> wrote: > > > > > Hi, > > > > Let's say the smaller subset has name A. It is a relatively small > > collection > > < 100 000 entries (could also be only 100), with nearly no payload as > > value. > > Collection B is a big collection with >10 000 000 entries (Each key of A > > also exists in the collection B), where the value for each key is > > relatively > > big (> 100 KB) > > > > For all the keys in A, I need to get the corresponding value from B and > > collect it in the output. > > > > > > - I can do this by reading in both files, and on the reduce step, do my > > computations and collect only those which are both in A and B. The map > > phase > > however will take very long as all the key/value pairs of collection B > need > > to be sorted (and each key's value is >100 KB) at the end of the map > phase, > > which is overkill if A is very small. > > > > What I would need is an option to somehow make the intersection first > > (Mapper only on keys, then a reduce functino based only on keys and not > the > > corresponding values which collects the keys I want to take), and then > > running the map input and filtering the output collector or the input > based > > on the results from the reduce phase. > > > > Or is there another faster way? Collection A could be so big that it > > doesn't > > fit into the memory. I could split collection A up into multiple smaller > > collections, but that would make it more complicated, so I want to evade > > that route. (This is similar to the approach I described above, just a > > manual approach) > > > > Thanks, > > Thibaut > > -- > > View this message in context: > > > http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21964853.html > > Sent from the Hadoop core-user mailing list archive at Nabble.com. > > > > >