Storing an int–int map in Lucene

Vigya Sharma Tue, 07 Jan 2025 22:08:10 -0800

Hello Lucene Devs,

What is a good way to store an int – int map on disk in Lucene? Keys are
not guaranteed to be a continuous sequence of integers. I can sort the map
by keys before writing if that helps, but the values won't always be in
increasing order.
I'd like to be able to efficiently do reader.get(key);


I looked into DirectMonotonicReader/Writer, but it'll need keys to be in
the range [0,n) and values to be monotonically increasing. Are there other
optimized internal storage structures in Lucene that I can use?

...

Context: I've been dabbling with different ways to implement multi-valued
vectors in the flat storage format. In my current prototype (i'm at version
4 now), each vector value gets a unique int ordinal in the graph, and a
single document can map to multiple vector ordinals. I want to be able to
1) get the docId for a given ordinal, 2) get the first ordinal (called
baseOrdinal) for a document, and 3) get the number of vector values for a
document.

All vectors of a document are written together and get a continuous subset
of ordinals. If I know the first ordinal for a document, and the number of
vector values, I can get all ordinals and their vector values from flat
storage. This would omit the need for a separate child query and
parent-block join approach we need today for multi-valued vector fields.

In my currently hacky approach – I use a direct monotonic writer and write
docId, baseOrdinal, and nextBaseOrdinal values for every ordinal, repeating
them where necessary.
For e.g., If I had documents {d1 -> 3 vectors, d2 -> 2 vectors, d3 -> 4
vectors}, I would write something like:
    ordToDoc = [d1, d1, d1, d2, d2, d3, d3, d3, d3]
    baseOrd = [0, 0, 0, 3, 3, 5, 5, 5, 5]
    nextBaseOrd = [3, 3, 3, 5, 5, 9, 9, 9, 9]

DirectMonotonicReader lets me directly do reader.get(ordinal) for docId,
baseOrd, or nextBaseOrd. And I can compute vector count per doc via
nextBaseOrd.get(ord) - baseOrd.get(ord).

I think (hope) DirectMonotonicWriter has clever ways to pack these docs
efficiently. But I'm wary about the potential redundancy here, and
wondering if there are other efficient storage structures I could leverage.

Thank you,
-Vigya

Storing an int–int map in Lucene

Reply via email to