I think that this would be nearly equivalent to the Lucene solution that I mentioned ... good for real-time single document queries.
I would be very surprised if this were able to out-do the MR version for the all-pairs problem. On Sat, Jul 18, 2009 at 1:30 AM, Miles Osborne <[email protected]> wrote: > you could probably eliminate phase 2 if the output of phase 1 was stored in > Perfect Hashing table (say using Hypertable). this works by storing a > fingerprint for each shingle/count pair (a few bits) and organising the > hash > table such that you never get collisions (hence the Perfect Hashing). > -- Ted Dunning, CTO DeepDyve
