I don't see why a HAR archive needs to be involved. You can use a MapFile to
create a scannable index over a SequenceFile and do lookups that way.

But if A is small enough to fit in RAM, then there is a much simpler way:
Write it out to a file and disseminate to all mappers via the
DistributedCache. They then reach read in the entire A set into a HashSet or
other data structure during configure(), before they scan through their
slices of B. They then emit only the B values which hit in A. This is called
a "map-side join." If you don't care about sorted ordering of your results,
you can then disable the reducers entirely.

Hive already supports this behavior; but you have to explicitly tell it to
enable map-side joins for each query because only you know that one data set
is "small enough" ahead of time.

If your A set doesn't fit in RAM, you'll need to get more creative. One
possibility is to do the same thing as above, but instead of reading all of
A into memory, use a hash function to squash the keys from A into some
bounded amount of RAM. For example, allocate yourself a 256 MB bitvector;
for each key in A, set bitvector[hash(A_key) % len(bitvector)] = 1. Then for
each B key in the mapper, if bitvector[hash(B_key) % len(bitvector)] == 1,
then it may match an A key; if it's 0 then it definitely does not match an A
key. For each potential match, send it to the reducer. Send all the A keys
to the reducer as well, where the  precise joining will occur. (Note: this
is effectively the same thing as a Bloom Filter.)

This will send much less data to each reducer and should see better
throughput.

- Aaron


On Wed, Feb 11, 2009 at 4:07 PM, Amit Chandel <amitchan...@gmail.com> wrote:

> Are the keys in collection B unique?
>
> If so, I would like to try this approach:
> For each <key, value> of collection B, make a file out of it with file name
> given by MD5 hash of the "key", and "value" being its content, and then
> store all these files into a HAR archive.
> The HAR archive will create an index for you over the keys.
> Now you can iterate over the collection A, get the MD5 hash of the key, and
> look up in the archive for the file (to get the value).
>
> On Wed, Feb 11, 2009 at 4:39 PM, Thibaut_ <tbr...@blue.lu> wrote:
>
> >
> > Hi,
> >
> > Let's say the smaller subset has name A. It is a relatively small
> > collection
> > < 100 000 entries (could also be only 100), with nearly no payload as
> > value.
> > Collection B is a big collection with >10 000 000 entries (Each key of A
> > also exists in the collection B), where the value for each key is
> > relatively
> > big (> 100 KB)
> >
> > For all the keys in A, I need to get the corresponding value from B and
> > collect it in the output.
> >
> >
> > - I can do this by reading in both files, and on the reduce step, do my
> > computations and collect only those which are both in A and B. The map
> > phase
> > however will take very long as all the key/value pairs of collection B
> need
> > to be sorted (and each key's value is >100 KB) at the end of the map
> phase,
> > which is overkill if A is very small.
> >
> > What I would need is an option to somehow make the intersection first
> > (Mapper only on keys, then a reduce functino based only on keys and not
> the
> > corresponding values which collects the keys I want to take), and then
> > running the map input and filtering the output collector or the input
> based
> > on the results from the reduce phase.
> >
> > Or is there another faster way? Collection A could be so big that it
> > doesn't
> > fit into the memory. I could split collection A up into multiple smaller
> > collections, but that would make it more complicated, so I want to evade
> > that route. (This is similar to the approach I described above, just a
> > manual approach)
> >
> > Thanks,
> > Thibaut
> > --
> > View this message in context:
> >
> http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21964853.html
> > Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >
> >
>

Reply via email to