Tim-Brooks opened a new pull request, #16101: URL: https://github.com/apache/lucene/pull/16101
Adds a new Column subtype for columnar addBatch ingestion where the term universe is known upfront (Arrow/Parquet-style). Callers supply a List<BytesRef> dictionary plus per-doc ordinals via OrdinalsCursor (dense) or OrdinalsTupleCursor (sparse / multi-valued). The writer maintains a per-batch ord→hash translation table so each distinct dictionary entry pays one BytesRefHash probe instead of one per doc. Wires through SortedDocValuesWriter.addOrdinalTuples / addDenseOrdinalValues, SortedSetDocValuesWriter.addOrdinalTuples, IndexingChain.processDictionaryColumn, ColumnValidation, and ColumnFieldAdapter (so stored fields and term inversion work via the existing BinaryColumnAdapter with a dictionary-resolving cursor wrapper). Tests in TestDictionaryColumn cover SORTED dense/sparse, SORTED_SET (with intra-doc dedup, duplicate dict entries, multi-batch merges), stored binary/string, inverted tokenized/untokenized fields, and validation rejections. Two randomized cross-check tests are included pending investigation of a divergence vs the plain-field oracle. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
