Re: Refactoring Lucene to Variable-Width DocIds

Ed Kohlwey Tue, 09 Jul 2013 13:36:59 -0700

There's common indexing design patterns on top of some of these databases
where a deterministic hash is used as the primary key (in order to avoid
locking on an id incrementer, for instance, and to make the insert
operations idempotent). Coming up with a meaningful way to derive a 4-byte
docID in such a shared-nothing architecture is difficult. In the example I
provided, there's 12 more bytes of keyspace partition, so obvious methods
like translating a portion of the keyspace into a shard don't make sense,
as there would be 2^96 possible shards.

I haven't thought of a good way to combine the lucene design with this
design pattern - the easiest method to me seems to be allowing variable
width docIds.

I guess the problem that I'm really trying to solve is that I'd like to
implement lucene search over this index storage method, and finding a way
to conform this index pattern to the codec/reader api seems like a good way
to do so, but the 4-byte based access makes it difficult to do so.

On Fri, Jul 5, 2013 at 4:17 PM, Adrien Grand <[email protected]> wrote:

> Hi,
>
> Lucene heavily relies on the fact that the internal doc IDs are dense
> and sequential. This is at the core of Lucene's design and is the key
> to compact postings lists and easily addressable doc values, stored
> fields, etc... Is there a specific reason why you don't want to handle
> these 16-bytes identifiers on top of the Lucene index (as a standard
> field)?
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Refactoring Lucene to Variable-Width DocIds

Reply via email to