[PR] Add DictionaryColumn for SORTED / SORTED_SET dvs [lucene]

via GitHub Thu, 21 May 2026 20:34:27 -0700


Tim-Brooks opened a new pull request, #16101:
URL: https://github.com/apache/lucene/pull/16101


   Adds a new Column subtype for columnar addBatch ingestion where the
   term universe is known upfront (Arrow/Parquet-style). Callers supply
   a List<BytesRef> dictionary plus per-doc ordinals via OrdinalsCursor
   (dense) or OrdinalsTupleCursor (sparse / multi-valued). The writer
   maintains a per-batch ord→hash translation table so each distinct
   dictionary entry pays one BytesRefHash probe instead of one per doc.
   
   Wires through SortedDocValuesWriter.addOrdinalTuples /
   addDenseOrdinalValues, SortedSetDocValuesWriter.addOrdinalTuples,
   IndexingChain.processDictionaryColumn, ColumnValidation, and
   ColumnFieldAdapter (so stored fields and term inversion work via
   the existing BinaryColumnAdapter with a dictionary-resolving
   cursor wrapper).
   
   Tests in TestDictionaryColumn cover SORTED dense/sparse, SORTED_SET
   (with intra-doc dedup, duplicate dict entries, multi-batch merges),
   stored binary/string, inverted tokenized/untokenized fields, and
   validation rejections. Two randomized cross-check tests are
   included pending investigation of a divergence vs the plain-field
   oracle.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Add DictionaryColumn for SORTED / SORTED_SET dvs [lucene]

Reply via email to