I also like the idea of having a pluggable skipper system where it is easy for anyone to introduce additional metadata, instead of using one monolithic DocValuesSkipper. I also have some ideas to store additional metadata to improve the performance for doc values.
Also, I wonder if we can make per-block operations pluggable too. Today we fetch metadata, classify the block, then read and decode values separately through virtual calls. With a per-block evaluator (e.g., skipper.getBlockEvaluator().evaluateRange()), we could push predicate evaluation down to the codec, combine decoding and comparison in a single pass, probably eliminating per-element virtual dispatch along the way. This might also unlock more SIMD benefits. On Mon, May 18, 2026 at 4:52 AM Ignacio Vera <[email protected]> wrote: > +1 I like the idea of having a proper extensibility mechanism. I feel > adding sentinel values to signal if a value is present or not is fragile. > > I do think we should follow the Points and Terms design and have an > intermediate object that allows accessing the static metadata of an index > without having to create any search data structures. > > Cheers, > > Ignacio > > On Fri, May 15, 2026 at 11:00 AM Alan Woodward <[email protected]> > wrote: > >> Hi folks, >> >> We have a few open PRs adding new data to the DocValuesSkipper interface >> (eg https://github.com/apache/lucene/pull/15993, >> https://github.com/apache/lucene/pull/15737), and other open issues >> discussing adding more (https://github.com/apache/lucene/issues/15884). >> We also have some ideas here at elastic for other bits of information that >> would be useful in highly specific circumstances but not really in the >> general case. These all run into issues with backwards compatibility, and >> questions of how to reliably signal to clients what data is available for a >> given field and segment. >> >> One idea I had that would make this a bit more pluggable, and allow >> Codecs to add additional block-based data without having to alter the base >> API too much, is to add a SkipType object which would be passed to the >> LeafReader like so: >> >> T getDocValuesSkipper(SkipType<T extends DocValuesSkipper> type) >> >> The codec would check the class of the SkipType and see if it knows how >> to return that information. If yes, it returns an instance of T, if not it >> returns null. The default type would be a Range<DocValuesSkipper>, which >> would return the basic DocValuesSkipper that we have now, but we can extend >> things with a Count or Cardinality type. On the indexing side, the >> FieldInfo could record the SkipType so that the codec knows what metadata >> to generate. >> >> Some of these bits of information are useful both as global metadata and >> as part of a skip block; some are only really relevant at the global >> level. Tying into the work that Ignacio is doing in >> https://github.com/apache/lucene/issues/16052, the global metadata tends >> to be loaded at segment open time and so can be accessed cheaply without >> doing any IO, but because it is part of the general DocValuesSkipper object >> it can only be accessed by calling LeafReader.getDocValuesSkipper() which >> loads a bunch of extra data (and declares that it does IO via its throws >> clause). >> >> We could add an intermediate object here, analogous to Points or Terms, >> called DocValues (or something similar, I know this is already a class with >> static helper methods on it); this would make the global min, max and >> docCount (and maybe cardinality) available without having to do any further >> IO, and the getSkipper() method could optionally be moved onto the >> intermediate object. >> >> What do people think? >> >> - Alan >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >>
