Re: Storing an int–int map in Lucene

Adrien Grand Tue, 07 Jan 2025 23:42:20 -0800

Hi Vigya,

Getting vectors for a given doc ID sounds similar to what sorted numeric
doc values do. They store an `addresses` DirectMonotonicReader that stores
the cumulative number of values before the current doc. Reusing your
example, this would give:
    addresses = [ 0, 3, 5, 9 ]
The first ordinal for doc ID `D` is `addresses[D]` and the number of values
of this doc ID is `addresses[D+1] - addresses[D]`. So this should address
your requirements (2) and (3).


To get the doc ID given an ordinal (requirement (1)), you could either
perform a binary search on this `addresses` structure (like stored fields
do to go from doc ID to block ID), or store an explicit ordToDoc mapping
like your current prototype does. The former is more storage-efficient, the
latter is likely faster.

I hope this helps?

On Wed, Jan 8, 2025 at 7:08 AM Vigya Sharma <[email protected]> wrote:

> Hello Lucene Devs,
>
> What is a good way to store an int – int map on disk in Lucene? Keys are
> not guaranteed to be a continuous sequence of integers. I can sort the map
> by keys before writing if that helps, but the values won't always be in
> increasing order.
> I'd like to be able to efficiently do reader.get(key);
>
> I looked into DirectMonotonicReader/Writer, but it'll need keys to be in
> the range [0,n) and values to be monotonically increasing. Are there other
> optimized internal storage structures in Lucene that I can use?
>
> ...
>
> Context: I've been dabbling with different ways to implement multi-valued
> vectors in the flat storage format. In my current prototype (i'm at version
> 4 now), each vector value gets a unique int ordinal in the graph, and a
> single document can map to multiple vector ordinals. I want to be able to
> 1) get the docId for a given ordinal, 2) get the first ordinal (called
> baseOrdinal) for a document, and 3) get the number of vector values for a
> document.
>
> All vectors of a document are written together and get a continuous subset
> of ordinals. If I know the first ordinal for a document, and the number of
> vector values, I can get all ordinals and their vector values from flat
> storage. This would omit the need for a separate child query and
> parent-block join approach we need today for multi-valued vector fields.
>
> In my currently hacky approach – I use a direct monotonic writer and write
> docId, baseOrdinal, and nextBaseOrdinal values for every ordinal, repeating
> them where necessary.
> For e.g., If I had documents {d1 -> 3 vectors, d2 -> 2 vectors, d3 -> 4
> vectors}, I would write something like:
>     ordToDoc = [d1, d1, d1, d2, d2, d3, d3, d3, d3]
>     baseOrd = [0, 0, 0, 3, 3, 5, 5, 5, 5]
>     nextBaseOrd = [3, 3, 3, 5, 5, 9, 9, 9, 9]
>
> DirectMonotonicReader lets me directly do reader.get(ordinal) for docId,
> baseOrd, or nextBaseOrd. And I can compute vector count per doc via
> nextBaseOrd.get(ord) - baseOrd.get(ord).
>
> I think (hope) DirectMonotonicWriter has clever ways to pack these docs
> efficiently. But I'm wary about the potential redundancy here, and
> wondering if there are other efficient storage structures I could leverage.
>
> Thank you,
> -Vigya
>


-- 
Adrien

Re: Storing an int–int map in Lucene

Reply via email to