Hello Lucene Devs, What is a good way to store an int – int map on disk in Lucene? Keys are not guaranteed to be a continuous sequence of integers. I can sort the map by keys before writing if that helps, but the values won't always be in increasing order. I'd like to be able to efficiently do reader.get(key);
I looked into DirectMonotonicReader/Writer, but it'll need keys to be in the range [0,n) and values to be monotonically increasing. Are there other optimized internal storage structures in Lucene that I can use? ... Context: I've been dabbling with different ways to implement multi-valued vectors in the flat storage format. In my current prototype (i'm at version 4 now), each vector value gets a unique int ordinal in the graph, and a single document can map to multiple vector ordinals. I want to be able to 1) get the docId for a given ordinal, 2) get the first ordinal (called baseOrdinal) for a document, and 3) get the number of vector values for a document. All vectors of a document are written together and get a continuous subset of ordinals. If I know the first ordinal for a document, and the number of vector values, I can get all ordinals and their vector values from flat storage. This would omit the need for a separate child query and parent-block join approach we need today for multi-valued vector fields. In my currently hacky approach – I use a direct monotonic writer and write docId, baseOrdinal, and nextBaseOrdinal values for every ordinal, repeating them where necessary. For e.g., If I had documents {d1 -> 3 vectors, d2 -> 2 vectors, d3 -> 4 vectors}, I would write something like: ordToDoc = [d1, d1, d1, d2, d2, d3, d3, d3, d3] baseOrd = [0, 0, 0, 3, 3, 5, 5, 5, 5] nextBaseOrd = [3, 3, 3, 5, 5, 9, 9, 9, 9] DirectMonotonicReader lets me directly do reader.get(ordinal) for docId, baseOrd, or nextBaseOrd. And I can compute vector count per doc via nextBaseOrd.get(ord) - baseOrd.get(ord). I think (hope) DirectMonotonicWriter has clever ways to pack these docs efficiently. But I'm wary about the potential redundancy here, and wondering if there are other efficient storage structures I could leverage. Thank you, -Vigya