There's common indexing design patterns on top of some of these databases where a deterministic hash is used as the primary key (in order to avoid locking on an id incrementer, for instance, and to make the insert operations idempotent). Coming up with a meaningful way to derive a 4-byte docID in such a shared-nothing architecture is difficult. In the example I provided, there's 12 more bytes of keyspace partition, so obvious methods like translating a portion of the keyspace into a shard don't make sense, as there would be 2^96 possible shards.
I haven't thought of a good way to combine the lucene design with this design pattern - the easiest method to me seems to be allowing variable width docIds. I guess the problem that I'm really trying to solve is that I'd like to implement lucene search over this index storage method, and finding a way to conform this index pattern to the codec/reader api seems like a good way to do so, but the 4-byte based access makes it difficult to do so. On Fri, Jul 5, 2013 at 4:17 PM, Adrien Grand <[email protected]> wrote: > Hi, > > Lucene heavily relies on the fact that the internal doc IDs are dense > and sequential. This is at the core of Lucene's design and is the key > to compact postings lists and easily addressable doc values, stored > fields, etc... Is there a specific reason why you don't want to handle > these 16-bytes identifiers on top of the Lucene index (as a standard > field)? > > -- > Adrien > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
