I also like the idea of having a pluggable skipper system where it is easy
for anyone to introduce additional metadata, instead of using one
monolithic DocValuesSkipper. I also have some ideas to store additional
metadata to improve the performance for doc values.

Also, I wonder if we can make per-block operations pluggable too. Today we
fetch metadata, classify the block, then read and decode values separately
through virtual calls. With a per-block evaluator (e.g.,
skipper.getBlockEvaluator().evaluateRange()), we could push predicate
evaluation down to the codec, combine decoding and comparison in a single
pass, probably eliminating per-element virtual dispatch along the way. This
might also unlock more SIMD benefits.

On Mon, May 18, 2026 at 4:52 AM Ignacio Vera <[email protected]> wrote:

> +1 I like the idea of having a proper extensibility mechanism. I feel
> adding sentinel values to signal if a value is present or not is fragile.
>
> I do think we should follow the Points and Terms design and have an
> intermediate object that allows accessing the static metadata of an index
> without having to create any search data structures.
>
> Cheers,
>
> Ignacio
>
> On Fri, May 15, 2026 at 11:00 AM Alan Woodward <[email protected]>
> wrote:
>
>> Hi folks,
>>
>> We have a few open PRs adding new data to the DocValuesSkipper interface
>> (eg https://github.com/apache/lucene/pull/15993,
>> https://github.com/apache/lucene/pull/15737), and other open issues
>> discussing adding more (https://github.com/apache/lucene/issues/15884).
>> We also have some ideas here at elastic for other bits of information that
>> would be useful in highly specific circumstances but not really in the
>> general case.  These all run into issues with backwards compatibility, and
>> questions of how to reliably signal to clients what data is available for a
>> given field and segment.
>>
>> One idea I had that would make this a bit more pluggable, and allow
>> Codecs to add additional block-based data without having to alter the base
>> API too much, is to add a SkipType object which would be passed to the
>> LeafReader like so:
>>
>> T getDocValuesSkipper(SkipType<T extends DocValuesSkipper> type)
>>
>> The codec would check the class of the SkipType and see if it knows how
>> to return that information.  If yes, it returns an instance of T, if not it
>> returns null.  The default type would be a Range<DocValuesSkipper>, which
>> would return the basic DocValuesSkipper that we have now, but we can extend
>> things with a Count or Cardinality type.  On the indexing side, the
>> FieldInfo could record the SkipType so that the codec knows what metadata
>> to generate.
>>
>> Some of these bits of information are useful both as global metadata and
>> as part of a skip block; some are only really relevant at the global
>> level.  Tying into the work that Ignacio is doing in
>> https://github.com/apache/lucene/issues/16052, the global metadata tends
>> to be loaded at segment open time and so can be accessed cheaply without
>> doing any IO, but because it is part of the general DocValuesSkipper object
>> it can only be accessed by calling LeafReader.getDocValuesSkipper() which
>> loads a bunch of extra data (and declares that it does IO via its throws
>> clause).
>>
>> We could add an intermediate object here, analogous to Points or Terms,
>> called DocValues (or something similar, I know this is already a class with
>> static helper methods on it); this would make the global min, max and
>> docCount (and maybe cardinality) available without having to do any further
>> IO, and the getSkipper() method could optionally be moved onto the
>> intermediate object.
>>
>> What do people think?
>>
>> - Alan
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>

Reply via email to