From: Mark Harwood [markharw...@yahoo.co.uk]
> Good point, Toke. Forgot about that. Of course doubling the number
> of hash algos used to 4 increases the space massively.

Maybe your hashing-idea could work even with collisions?

Using your original two-hash suggestion, we're just about sure to get 
collisions. However, we are still able to uniquely identify the right document 
as the UID is also stored (search for the hashes, iterate over the results and 
get the UID for each). When an update is requested for an existing document, 
the indexer extracts the UIDs from all the documents that matches the hash. 
Then it performs a delete of the hash-terms and re-indexes all the documents 
that had "false" collisions. As the number of unique hash-values as well as 
hash-function can be adjusted, this could be a nicely tweakable 
performance-vs-space trade off.

This will only work if it is possible to re-create the documents from stored 
terms or by requesting the data from outside of Lucene by UID. Is this possible 
with your setup, eks dev?
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to