Thanks Adrien! I was struggling to make it work for sparse vector values.
One approach is to use it with a DocIndexIterator. I can use ordToDoc to
get docId, advance() my DISI to that docId, and then use
`addresses[disi.index()+1] - addresses[disi.index()]`. But I guess this is
not that great for random access via ordinals, since the DISI is forward
moving?

I'm not entirely sure that I need random access when evaluating all vector
values of a document. It would be used for scoring a query, e.g. by taking
an aggregate score for all document vectors. But VectorScorer seems to work
off of a DISI anyway.

It'll get clearer as I implement more pieces. But this is good to get me
going for now. Thank you for your help.

On Tue, Jan 7, 2025 at 11:42 PM Adrien Grand <jpou...@gmail.com> wrote:

> Hi Vigya,
>
> Getting vectors for a given doc ID sounds similar to what sorted numeric
> doc values do. They store an `addresses` DirectMonotonicReader that stores
> the cumulative number of values before the current doc. Reusing your
> example, this would give:
>     addresses = [ 0, 3, 5, 9 ]
> The first ordinal for doc ID `D` is `addresses[D]` and the number of
> values of this doc ID is `addresses[D+1] - addresses[D]`. So this should
> address your requirements (2) and (3).
>
> To get the doc ID given an ordinal (requirement (1)), you could either
> perform a binary search on this `addresses` structure (like stored fields
> do to go from doc ID to block ID), or store an explicit ordToDoc mapping
> like your current prototype does. The former is more storage-efficient, the
> latter is likely faster.
>
> I hope this helps?
>
> On Wed, Jan 8, 2025 at 7:08 AM Vigya Sharma <vigya.w...@gmail.com> wrote:
>
>> Hello Lucene Devs,
>>
>> What is a good way to store an int – int map on disk in Lucene? Keys are
>> not guaranteed to be a continuous sequence of integers. I can sort the map
>> by keys before writing if that helps, but the values won't always be in
>> increasing order.
>> I'd like to be able to efficiently do reader.get(key);
>>
>> I looked into DirectMonotonicReader/Writer, but it'll need keys to be in
>> the range [0,n) and values to be monotonically increasing. Are there other
>> optimized internal storage structures in Lucene that I can use?
>>
>> ...
>>
>> Context: I've been dabbling with different ways to implement multi-valued
>> vectors in the flat storage format. In my current prototype (i'm at version
>> 4 now), each vector value gets a unique int ordinal in the graph, and a
>> single document can map to multiple vector ordinals. I want to be able to
>> 1) get the docId for a given ordinal, 2) get the first ordinal (called
>> baseOrdinal) for a document, and 3) get the number of vector values for a
>> document.
>>
>> All vectors of a document are written together and get a continuous
>> subset of ordinals. If I know the first ordinal for a document, and the
>> number of vector values, I can get all ordinals and their vector values
>> from flat storage. This would omit the need for a separate child query
>> and parent-block join approach we need today for multi-valued vector fields.
>>
>> In my currently hacky approach – I use a direct monotonic writer and
>> write docId, baseOrdinal, and nextBaseOrdinal values for every ordinal,
>> repeating them where necessary.
>> For e.g., If I had documents {d1 -> 3 vectors, d2 -> 2 vectors, d3 -> 4
>> vectors}, I would write something like:
>>     ordToDoc = [d1, d1, d1, d2, d2, d3, d3, d3, d3]
>>     baseOrd = [0, 0, 0, 3, 3, 5, 5, 5, 5]
>>     nextBaseOrd = [3, 3, 3, 5, 5, 9, 9, 9, 9]
>>
>> DirectMonotonicReader lets me directly do reader.get(ordinal) for docId,
>> baseOrd, or nextBaseOrd. And I can compute vector count per doc via
>> nextBaseOrd.get(ord) - baseOrd.get(ord).
>>
>> I think (hope) DirectMonotonicWriter has clever ways to pack these docs
>> efficiently. But I'm wary about the potential redundancy here, and
>> wondering if there are other efficient storage structures I could leverage.
>>
>> Thank you,
>> -Vigya
>>
>
>
> --
> Adrien
>


-- 
- Vigya

Reply via email to