Re: Require consistency between different data-structures sharing the same field name as of 9.0?

Michael McCandless Tue, 21 Apr 2020 09:55:25 -0700

+1

We already have a minimal enforcement of consistent schema in IndexWriter.
It checks that you are not trying to change the doc values type for a given
field, e.g. from NUMERIC to SORTED.


Maybe we could add these new checks there?  E.g., if this field was
previously indexed with both doc values and points, then require that
future documents also include both doc values (of same type) and points (of
same dimensionality), or neither (that doc is missing the field)?

And also "sugar" oal.document.FieldXXX class that enables both doc values
and points and is the "obvious" class to use to index numeric values that
you want to range filter and sort on.

I agree it's tricky to confirm that the same original numeric value is in
the points data (which is already a packed byte[] by the time IW sees it)
and the doc values data.

We would need some BWC to handle pre-8.x indices that did not record that
they in fact had indexed the same values as points and doc values.  Maybe
they just don't get this optimization, and the caller must construct the
full query themselves, as is required today?

Mike McCandless

http://blog.mikemccandless.com


On Tue, Apr 21, 2020 at 6:50 AM Adrien Grand <jpou...@gmail.com> wrote:

> Thanks all, the initial feedback sounds positive so I opened
> https://issues.apache.org/jira/browse/LUCENE-9334. As Mike pointed out,
> this could be a first step towards having something that looks more like a
> proper schema.
>
> I like the idea of exposing higher-level abstractions, like a field that
> would index as points, add doc values (but which ones? Numeric or
> SortedNumeric?) and maybe stored fields all at once.
>
> On Mon, Apr 20, 2020 at 6:12 PM David Smiley <david.w.smi...@gmail.com>
> wrote:
>
>> What Alan says makes sense to me -- simple and sufficiently addresses the
>> pain point Adrien raises.  I don't think we need a long class name which is
>> also some extra class in addition to the current one (confusing); I'd
>> prefer one simple numeric class name (per int/long/float/double) with a
>> constructor that indicates indexed/docValues.
>>
>> I also agree with Michael Sokolov; it's crazy Lucene FieldInfos doesn't
>> have some basic numeric type metadata.
>>
>> ~ David Smiley
>> Apache Lucene/Solr Search Developer
>> http://www.linkedin.com/in/davidwsmiley
>>
>>
>> On Mon, Apr 20, 2020 at 10:15 AM Michael Sokolov <msoko...@gmail.com>
>> wrote:
>>
>>> Could we use this as a stepping stone towards a schema? Just a very
>>> lightweight schema that only enforces what we can easily enforce
>>> today, but put some minimal abstraction in place where we can hang
>>> future consistency checks.
>>>
>>> Re: value consistency; could we do a best-effort enforcement in
>>> DefaultIndexingChain where we have the values unencoded?
>>>
>>> On Mon, Apr 20, 2020 at 9:48 AM Alan Woodward <romseyg...@gmail.com>
>>> wrote:
>>> >
>>> > One way of doing this might be to add an additional field type that
>>> adds both point and docvalues, and then have factory methods for queries
>>> and sorts on the field type.  So for example a LongPointAndValue field
>>> would automatically index its value into both BKD and NumericDocValues, and
>>> then LongPointAndValue#newRangeQuery() would build the relevant
>>> IndexOrDocValuesQuery, and LongPointAndValue#newSortField() would return a
>>> sort field that can use the shortcuts.
>>> >
>>> > On 20 Apr 2020, at 14:10, Adrien Grand <jpou...@gmail.com> wrote:
>>> >
>>> > Hello,
>>> >
>>> > Lucene currently doesn't require consistency across data-structures.
>>> For instance it is possible to have different values in points and doc
>>> values under the same field name. Until now, we worked around it either by
>>> making features use a single data-structure, e.g. facets only use doc
>>> values, or by pushing the responsibility of having consistent data across
>>> data-structures to the user, e.g. IndexOrDocValuesQuery requires that the
>>> point query and the doc-value query match the same documents and it's the
>>> responsibility of the user to ensure this.
>>> >
>>> > I'm unhappy that it makes Lucene very hard to use. Creating an
>>> efficient range query should be a one-liner, but due to this limitation,
>>> users have to first learn about LongPointQuery#newRangeQuery,
>>> NumericDocValuesField#newSlowRangeQuery and then combine them with
>>> IndexOrDocValuesQuery or maybe even
>>> IndexSortSortedNumericDocValuesRangeQuery. If Lucene had a requirement that
>>> if a field both has points and numeric doc values then both data-structurs
>>> contain the same content, then we could automatically use the
>>> IndexOrDocValuesQuery optimization in LongPoint#newRangeQuery when noticing
>>> that the field also has doc values of type NUMERIC or SORTED_NUMERIC.
>>> >
>>> > This question is being raised again as we are working on dynamically
>>> pruning uncompetitive hits when sorting by field by leveraging the points
>>> index.[1] This can produce very significant speedups but again requires
>>> that the same data be indexed in points and doc values.
>>> >
>>> > [1] https://github.com/apache/lucene-solr/pull/1351
>>> >
>>> > We had discussions about adding a notion of schema of Lucene in the
>>> past, see e.g. [2]. This seems desirable to me but also a high hanging
>>> fruit and possibly controversial, so my short term proposal would instead
>>> be to:
>>> >  - Require documents to be consistent in the data-structures that they
>>> use: you can't have one document using only points on a document and
>>> another document using only doc values on another document. Of course it
>>> would still be possible to index documents that have neither points nor doc
>>> values indexed even if previous documents had either enabled in order to
>>> handle documents with missing values properly.
>>> >  - Don't hesitate to rely on consistency across fields when
>>> implementing new functionality, ie. LongPoint#newRangeQuery would check
>>> whether the FieldInfo has numeric doc values, and if so would automatically
>>> enable the IndexOrDocValuesQuery and
>>> IndexSortSortedNumericDocValuesRangeQuery optimizations.
>>> >
>>> > [2] https://issues.apache.org/jira/browse/LUCENE-6005
>>> >
>>> > Checking that documents have the same values sounds desirable to me
>>> but also challenging due to how we sometimes encode data on top of the
>>> Lucene APIs, e.g. longs become byte[] in the points index, geo points
>>> become a single long in doc values, and we have a few use-cases when we
>>> encode muliple values into a single BinaryDocValueField in Elasticsearch to
>>> work around the absence of multi-value binary doc values support. I think
>>> it'd be acceptable to not validate values but still expect consistency in
>>> our search APIs?
>>> >
>>> > What do you think?
>>> >
>>> > --
>>> > Adrien
>>> >
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>>
>
> --
> Adrien
>

Re: Require consistency between different data-structures sharing the same field name as of 9.0?

Reply via email to