Yes, I had come to the same conclusion. It seems like there's a lot of interest via the codec API, etc. for extending the storage mechanisms for Lucene. I think allowing variable width doc id's could go a long way in supporting this objective and allowing experimentation with interesting new indexing techniques.
What would the negatives of variable width ID's be? The expectations for thread safety are pretty well defined, so it is likely that the bytesref (or whatever else) object allocation could be kept to a minimum. Integer based codec's would have to translate to/from integers but this shouldn't be huge overhead. This could be really beneficial to projects like SolrCloud, allowing them to use the partitioning mechanism of the underlying database (Cassandra, HBase, etc) rather than maintaining it's own partitioning mechanism. OK, I better understand what you are trying to achieve. Lucene doc IDs > are just a convenient way to communicate in-between internals APIs > such as the inverted index and stored fields, but they are stored > nowhere. Conceptually, they are like an index in an array: they > uniquely identify an element in an array but aren't stored anywhere, > you can think of the internal APIs as parallel arrays. And they are > transient, when segment are merged, doc IDs change. > > I don't think it is possible to write a reasonably fast Lucene view on > top of the existing index of your database, you would have to keep on > translating the database doc IDs to Lucene doc IDs and this would > likely kill performance. > >
