Hi, The bloomfilter solution works great, but I still have to copy the data around sometimes.
I'm still wondering if I can reduce the associated data to the keys to a reference or something small (the >100 KB of data are very big), with which I can then later fetch the data in the reduce step. In the past I was using hbase to store the associated data in it (but unfortunately hbase proved to be very unreliable in my case). I will probably also start to compress the data in the value store, which will probably increase sorting speed (as the data is there probably uncompressed). Is there something else I could do to speed this process up? Thanks, Thibaut -- View this message in context: http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22081608.html Sent from the Hadoop core-user mailing list archive at Nabble.com.