Hi Vigya, Getting vectors for a given doc ID sounds similar to what sorted numeric doc values do. They store an `addresses` DirectMonotonicReader that stores the cumulative number of values before the current doc. Reusing your example, this would give: addresses = [ 0, 3, 5, 9 ] The first ordinal for doc ID `D` is `addresses[D]` and the number of values of this doc ID is `addresses[D+1] - addresses[D]`. So this should address your requirements (2) and (3).
To get the doc ID given an ordinal (requirement (1)), you could either perform a binary search on this `addresses` structure (like stored fields do to go from doc ID to block ID), or store an explicit ordToDoc mapping like your current prototype does. The former is more storage-efficient, the latter is likely faster. I hope this helps? On Wed, Jan 8, 2025 at 7:08 AM Vigya Sharma <vigya.w...@gmail.com> wrote: > Hello Lucene Devs, > > What is a good way to store an int – int map on disk in Lucene? Keys are > not guaranteed to be a continuous sequence of integers. I can sort the map > by keys before writing if that helps, but the values won't always be in > increasing order. > I'd like to be able to efficiently do reader.get(key); > > I looked into DirectMonotonicReader/Writer, but it'll need keys to be in > the range [0,n) and values to be monotonically increasing. Are there other > optimized internal storage structures in Lucene that I can use? > > ... > > Context: I've been dabbling with different ways to implement multi-valued > vectors in the flat storage format. In my current prototype (i'm at version > 4 now), each vector value gets a unique int ordinal in the graph, and a > single document can map to multiple vector ordinals. I want to be able to > 1) get the docId for a given ordinal, 2) get the first ordinal (called > baseOrdinal) for a document, and 3) get the number of vector values for a > document. > > All vectors of a document are written together and get a continuous subset > of ordinals. If I know the first ordinal for a document, and the number of > vector values, I can get all ordinals and their vector values from flat > storage. This would omit the need for a separate child query and > parent-block join approach we need today for multi-valued vector fields. > > In my currently hacky approach – I use a direct monotonic writer and write > docId, baseOrdinal, and nextBaseOrdinal values for every ordinal, repeating > them where necessary. > For e.g., If I had documents {d1 -> 3 vectors, d2 -> 2 vectors, d3 -> 4 > vectors}, I would write something like: > ordToDoc = [d1, d1, d1, d2, d2, d3, d3, d3, d3] > baseOrd = [0, 0, 0, 3, 3, 5, 5, 5, 5] > nextBaseOrd = [3, 3, 3, 5, 5, 9, 9, 9, 9] > > DirectMonotonicReader lets me directly do reader.get(ordinal) for docId, > baseOrd, or nextBaseOrd. And I can compute vector count per doc via > nextBaseOrd.get(ord) - baseOrd.get(ord). > > I think (hope) DirectMonotonicWriter has clever ways to pack these docs > efficiently. But I'm wary about the potential redundancy here, and > wondering if there are other efficient storage structures I could leverage. > > Thank you, > -Vigya > -- Adrien