+1 I like the idea of having a proper extensibility mechanism. I feel
adding sentinel values to signal if a value is present or not is fragile.

I do think we should follow the Points and Terms design and have an
intermediate object that allows accessing the static metadata of an index
without having to create any search data structures.

Cheers,

Ignacio

On Fri, May 15, 2026 at 11:00 AM Alan Woodward <[email protected]> wrote:

> Hi folks,
>
> We have a few open PRs adding new data to the DocValuesSkipper interface
> (eg https://github.com/apache/lucene/pull/15993,
> https://github.com/apache/lucene/pull/15737), and other open issues
> discussing adding more (https://github.com/apache/lucene/issues/15884).
> We also have some ideas here at elastic for other bits of information that
> would be useful in highly specific circumstances but not really in the
> general case.  These all run into issues with backwards compatibility, and
> questions of how to reliably signal to clients what data is available for a
> given field and segment.
>
> One idea I had that would make this a bit more pluggable, and allow Codecs
> to add additional block-based data without having to alter the base API too
> much, is to add a SkipType object which would be passed to the LeafReader
> like so:
>
> T getDocValuesSkipper(SkipType<T extends DocValuesSkipper> type)
>
> The codec would check the class of the SkipType and see if it knows how to
> return that information.  If yes, it returns an instance of T, if not it
> returns null.  The default type would be a Range<DocValuesSkipper>, which
> would return the basic DocValuesSkipper that we have now, but we can extend
> things with a Count or Cardinality type.  On the indexing side, the
> FieldInfo could record the SkipType so that the codec knows what metadata
> to generate.
>
> Some of these bits of information are useful both as global metadata and
> as part of a skip block; some are only really relevant at the global
> level.  Tying into the work that Ignacio is doing in
> https://github.com/apache/lucene/issues/16052, the global metadata tends
> to be loaded at segment open time and so can be accessed cheaply without
> doing any IO, but because it is part of the general DocValuesSkipper object
> it can only be accessed by calling LeafReader.getDocValuesSkipper() which
> loads a bunch of extra data (and declares that it does IO via its throws
> clause).
>
> We could add an intermediate object here, analogous to Points or Terms,
> called DocValues (or something similar, I know this is already a class with
> static helper methods on it); this would make the global min, max and
> docCount (and maybe cardinality) available without having to do any further
> IO, and the getSkipper() method could optionally be moved onto the
> intermediate object.
>
> What do people think?
>
> - Alan
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to