RE: Refactoring Lucene to Variable-Width DocIds

Uwe Schindler Tue, 09 Jul 2013 23:37:37 -0700

Hi Ed,


you can have (and in fact you must) have your own doc ids. Solr has them, 
ElasticSearch has them. Which are implemented as fields in the index (like you 
do this in a relational database: You define one column as the primary key).

 

Lucene’s docids are completely internal and not even stable on the long term. 
They can change at any time – so you cannot use them outside of Lucene, like 
said before: See them as something like a row number or array index or the 
internal row-number of a relational database)! This is done for performance and 
to support the nature of the underlying algorithms (which work “iterator-based” 
and store only differences of doc-ids – see “packed ints” in Lucene docs).

 

Lucene has everything to support your own custom ID scheme, you can define one 
field as “primary key” and use it. This primary key is indexed and can also be 
placed in a docvalues field for random access. Code using Lucene only refers 
documents by this custom primary key. The internal integers are only to be used 
*inside* the Lucene API and are not stable at all.

 

Uwe

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

http://www.thetaphi.de <http://www.thetaphi.de/> 

eMail: [email protected]

 

From: Ed Kohlwey [mailto:[email protected]] 
Sent: Wednesday, July 10, 2013 4:15 AM
To: [email protected]
Subject: Re: Refactoring Lucene to Variable-Width DocIds

 

Yes, I had come to the same conclusion. It seems like there's a lot of interest 
via the codec API, etc. for extending the storage mechanisms for Lucene. I 
think allowing variable width doc id's could go a long way in supporting this 
objective and allowing experimentation with interesting new indexing techniques.

 

What would the negatives of variable width ID's be? The expectations for thread 
safety are pretty well defined, so it is likely that the bytesref (or whatever 
else) object allocation could be kept to a minimum. Integer based codec's would 
have to translate to/from integers but this shouldn't be huge overhead.

 

This could be really beneficial to projects like SolrCloud, allowing them to 
use the partitioning mechanism of the underlying database (Cassandra, HBase, 
etc) rather than maintaining it's own partitioning mechanism.

 

OK, I better understand what you are trying to achieve. Lucene doc IDs
are just a convenient way to communicate in-between internals APIs
such as the inverted index and stored fields, but they are stored
nowhere. Conceptually, they are like an index in an array: they
uniquely identify an element in an array but aren't stored anywhere,
you can think of the internal APIs as parallel arrays. And they are
transient, when segment are merged, doc IDs change.

I don't think it is possible to write a reasonably fast Lucene view on
top of the existing index of your database, you would have to keep on
translating the database doc IDs to Lucene doc IDs and this would
likely kill performance.

RE: Refactoring Lucene to Variable-Width DocIds

Reply via email to