Thanks Adrien! I was struggling to make it work for sparse vector values. One approach is to use it with a DocIndexIterator. I can use ordToDoc to get docId, advance() my DISI to that docId, and then use `addresses[disi.index()+1] - addresses[disi.index()]`. But I guess this is not that great for random access via ordinals, since the DISI is forward moving?
I'm not entirely sure that I need random access when evaluating all vector values of a document. It would be used for scoring a query, e.g. by taking an aggregate score for all document vectors. But VectorScorer seems to work off of a DISI anyway. It'll get clearer as I implement more pieces. But this is good to get me going for now. Thank you for your help. On Tue, Jan 7, 2025 at 11:42 PM Adrien Grand <jpou...@gmail.com> wrote: > Hi Vigya, > > Getting vectors for a given doc ID sounds similar to what sorted numeric > doc values do. They store an `addresses` DirectMonotonicReader that stores > the cumulative number of values before the current doc. Reusing your > example, this would give: > addresses = [ 0, 3, 5, 9 ] > The first ordinal for doc ID `D` is `addresses[D]` and the number of > values of this doc ID is `addresses[D+1] - addresses[D]`. So this should > address your requirements (2) and (3). > > To get the doc ID given an ordinal (requirement (1)), you could either > perform a binary search on this `addresses` structure (like stored fields > do to go from doc ID to block ID), or store an explicit ordToDoc mapping > like your current prototype does. The former is more storage-efficient, the > latter is likely faster. > > I hope this helps? > > On Wed, Jan 8, 2025 at 7:08 AM Vigya Sharma <vigya.w...@gmail.com> wrote: > >> Hello Lucene Devs, >> >> What is a good way to store an int – int map on disk in Lucene? Keys are >> not guaranteed to be a continuous sequence of integers. I can sort the map >> by keys before writing if that helps, but the values won't always be in >> increasing order. >> I'd like to be able to efficiently do reader.get(key); >> >> I looked into DirectMonotonicReader/Writer, but it'll need keys to be in >> the range [0,n) and values to be monotonically increasing. Are there other >> optimized internal storage structures in Lucene that I can use? >> >> ... >> >> Context: I've been dabbling with different ways to implement multi-valued >> vectors in the flat storage format. In my current prototype (i'm at version >> 4 now), each vector value gets a unique int ordinal in the graph, and a >> single document can map to multiple vector ordinals. I want to be able to >> 1) get the docId for a given ordinal, 2) get the first ordinal (called >> baseOrdinal) for a document, and 3) get the number of vector values for a >> document. >> >> All vectors of a document are written together and get a continuous >> subset of ordinals. If I know the first ordinal for a document, and the >> number of vector values, I can get all ordinals and their vector values >> from flat storage. This would omit the need for a separate child query >> and parent-block join approach we need today for multi-valued vector fields. >> >> In my currently hacky approach – I use a direct monotonic writer and >> write docId, baseOrdinal, and nextBaseOrdinal values for every ordinal, >> repeating them where necessary. >> For e.g., If I had documents {d1 -> 3 vectors, d2 -> 2 vectors, d3 -> 4 >> vectors}, I would write something like: >> ordToDoc = [d1, d1, d1, d2, d2, d3, d3, d3, d3] >> baseOrd = [0, 0, 0, 3, 3, 5, 5, 5, 5] >> nextBaseOrd = [3, 3, 3, 5, 5, 9, 9, 9, 9] >> >> DirectMonotonicReader lets me directly do reader.get(ordinal) for docId, >> baseOrd, or nextBaseOrd. And I can compute vector count per doc via >> nextBaseOrd.get(ord) - baseOrd.get(ord). >> >> I think (hope) DirectMonotonicWriter has clever ways to pack these docs >> efficiently. But I'm wary about the potential redundancy here, and >> wondering if there are other efficient storage structures I could leverage. >> >> Thank you, >> -Vigya >> > > > -- > Adrien > -- - Vigya