Good point, Toke. Forgot about that. Of course doubling the number of hash algos used to 4 increases the space massively.
On 21 Oct 2010, at 22:51, Toke Eskildsen <t...@statsbiblioteket.dk> wrote: > Mark Harwood [markharw...@yahoo.co.uk]: >> Given a large range of IDs (eg your 300 million) you could constrain >> the number of unique terms using a double-hashing technique e.g. >> Pick a number "n" for the max number of unique terms you'll tolerate >> e.g. 1 million and store 2 terms for every primary key using a >> different hashing function e.g. > >> int hashedKey1=hashFunction1(myKey)%maxNumUniqueTerms. >> int hashedKey2=hashFunction2(myKey)%maxNumUniqueTerms. > >> Then queries to retrieve/delete a record use a search for hashedKey1 >> AND hashedKey2. The probability of having the same collision on two >> different hashing functions is minimal and should return the original record >> only. > > I am sorry, but this won't work. It is a variation of the birthday paradox: > http://en.wikipedia.org/wiki/Birthday_problem > > Assuming the two hash-functions are ideal so that there will be 1M different > values from each after the modulo, the probability for any given pair of > different UIDs having the same hashes is 1/(1M * 1M). That's very low. > Another way to look at it would be to say that there are 1M * 1M possible > values for the aggregated hash function. > > Using the recipe from > http://en.wikipedia.org/wiki/Birthday_problem#Cast_as_a_collision_problem > we have > n = 300M > d = 1M * 1M > and the formula 1-((d-1)/d)^(n*(n-1)/2) which gets expanded to > http://www.google.com/search?q=1-(((1000000000000-1)/1000000000000)^(300000000*(300000000-1)/2) > > We see that the probability of a collision is ... 1. Or rather, so close to 1 > that Google's calculator will not show any decimals. Turning the number of > UIDs down to just 3M, we still get the probability 0.988889881 for a > collision. It does not really help to increase the number unique hashes as > there are simply too many possible pairs of UIDs. > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org