Mark Harwood [markharw...@yahoo.co.uk]: > Given a large range of IDs (eg your 300 million) you could constrain > the number of unique terms using a double-hashing technique e.g. > Pick a number "n" for the max number of unique terms you'll tolerate > e.g. 1 million and store 2 terms for every primary key using a > different hashing function e.g.
> int hashedKey1=hashFunction1(myKey)%maxNumUniqueTerms. > int hashedKey2=hashFunction2(myKey)%maxNumUniqueTerms. > Then queries to retrieve/delete a record use a search for hashedKey1 > AND hashedKey2. The probability of having the same collision on two > different hashing functions is minimal and should return the original record > only. I am sorry, but this won't work. It is a variation of the birthday paradox: http://en.wikipedia.org/wiki/Birthday_problem Assuming the two hash-functions are ideal so that there will be 1M different values from each after the modulo, the probability for any given pair of different UIDs having the same hashes is 1/(1M * 1M). That's very low. Another way to look at it would be to say that there are 1M * 1M possible values for the aggregated hash function. Using the recipe from http://en.wikipedia.org/wiki/Birthday_problem#Cast_as_a_collision_problem we have n = 300M d = 1M * 1M and the formula 1-((d-1)/d)^(n*(n-1)/2) which gets expanded to http://www.google.com/search?q=1-(((1000000000000-1)/1000000000000)^(300000000*(300000000-1)/2) We see that the probability of a collision is ... 1. Or rather, so close to 1 that Google's calculator will not show any decimals. Turning the number of UIDs down to just 3M, we still get the probability 0.988889881 for a collision. It does not really help to increase the number unique hashes as there are simply too many possible pairs of UIDs. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org