Re: Refactoring Lucene to Variable-Width DocIds

Ed Kohlwey Tue, 09 Jul 2013 19:15:47 -0700

Yes, I had come to the same conclusion. It seems like there's a lot of
interest via the codec API, etc. for extending the storage mechanisms for
Lucene. I think allowing variable width doc id's could go a long way in
supporting this objective and allowing experimentation with interesting new
indexing techniques.


What would the negatives of variable width ID's be? The expectations for
thread safety are pretty well defined, so it is likely that the bytesref
(or whatever else) object allocation could be kept to a minimum. Integer
based codec's would have to translate to/from integers but this shouldn't
be huge overhead.

This could be really beneficial to projects like SolrCloud, allowing them
to use the partitioning mechanism of the underlying database (Cassandra,
HBase, etc) rather than maintaining it's own partitioning mechanism.

OK, I better understand what you are trying to achieve. Lucene doc IDs
> are just a convenient way to communicate in-between internals APIs
> such as the inverted index and stored fields, but they are stored
> nowhere. Conceptually, they are like an index in an array: they
> uniquely identify an element in an array but aren't stored anywhere,
> you can think of the internal APIs as parallel arrays. And they are
> transient, when segment are merged, doc IDs change.
>
> I don't think it is possible to write a reasonably fast Lucene view on
> top of the existing index of your database, you would have to keep on
> translating the database doc IDs to Lucene doc IDs and this would
> likely kill performance.
>
>

Re: Refactoring Lucene to Variable-Width DocIds

Reply via email to