Hi,

The bloomfilter solution works great, but I still have to copy the data
around sometimes.

I'm still wondering if I can reduce the associated data to the keys to a
reference or something small (the >100 KB of data are very big), with which
I can then later fetch the data in the reduce step.

In the past I was using hbase to store the associated data in it (but
unfortunately hbase proved to be very unreliable in my case). I will
probably also start to compress the data in the value store, which will
probably increase sorting speed (as the data is there probably
uncompressed).
Is there something else I could do to speed this process up?

Thanks,
Thibaut
-- 
View this message in context: 
http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22081608.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Reply via email to