Re: Require consistency between different data-structures sharing the same field name as of 9.0?

Adrien Grand Tue, 21 Apr 2020 03:51:26 -0700

Thanks all, the initial feedback sounds positive so I opened
https://issues.apache.org/jira/browse/LUCENE-9334. As Mike pointed out,
this could be a first step towards having something that looks more like a
proper schema.


I like the idea of exposing higher-level abstractions, like a field that
would index as points, add doc values (but which ones? Numeric or
SortedNumeric?) and maybe stored fields all at once.

On Mon, Apr 20, 2020 at 6:12 PM David Smiley <david.w.smi...@gmail.com>
wrote:

> What Alan says makes sense to me -- simple and sufficiently addresses the
> pain point Adrien raises.  I don't think we need a long class name which is
> also some extra class in addition to the current one (confusing); I'd
> prefer one simple numeric class name (per int/long/float/double) with a
> constructor that indicates indexed/docValues.
>
> I also agree with Michael Sokolov; it's crazy Lucene FieldInfos doesn't
> have some basic numeric type metadata.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Mon, Apr 20, 2020 at 10:15 AM Michael Sokolov <msoko...@gmail.com>
> wrote:
>
>> Could we use this as a stepping stone towards a schema? Just a very
>> lightweight schema that only enforces what we can easily enforce
>> today, but put some minimal abstraction in place where we can hang
>> future consistency checks.
>>
>> Re: value consistency; could we do a best-effort enforcement in
>> DefaultIndexingChain where we have the values unencoded?
>>
>> On Mon, Apr 20, 2020 at 9:48 AM Alan Woodward <romseyg...@gmail.com>
>> wrote:
>> >
>> > One way of doing this might be to add an additional field type that
>> adds both point and docvalues, and then have factory methods for queries
>> and sorts on the field type.  So for example a LongPointAndValue field
>> would automatically index its value into both BKD and NumericDocValues, and
>> then LongPointAndValue#newRangeQuery() would build the relevant
>> IndexOrDocValuesQuery, and LongPointAndValue#newSortField() would return a
>> sort field that can use the shortcuts.
>> >
>> > On 20 Apr 2020, at 14:10, Adrien Grand <jpou...@gmail.com> wrote:
>> >
>> > Hello,
>> >
>> > Lucene currently doesn't require consistency across data-structures.
>> For instance it is possible to have different values in points and doc
>> values under the same field name. Until now, we worked around it either by
>> making features use a single data-structure, e.g. facets only use doc
>> values, or by pushing the responsibility of having consistent data across
>> data-structures to the user, e.g. IndexOrDocValuesQuery requires that the
>> point query and the doc-value query match the same documents and it's the
>> responsibility of the user to ensure this.
>> >
>> > I'm unhappy that it makes Lucene very hard to use. Creating an
>> efficient range query should be a one-liner, but due to this limitation,
>> users have to first learn about LongPointQuery#newRangeQuery,
>> NumericDocValuesField#newSlowRangeQuery and then combine them with
>> IndexOrDocValuesQuery or maybe even
>> IndexSortSortedNumericDocValuesRangeQuery. If Lucene had a requirement that
>> if a field both has points and numeric doc values then both data-structurs
>> contain the same content, then we could automatically use the
>> IndexOrDocValuesQuery optimization in LongPoint#newRangeQuery when noticing
>> that the field also has doc values of type NUMERIC or SORTED_NUMERIC.
>> >
>> > This question is being raised again as we are working on dynamically
>> pruning uncompetitive hits when sorting by field by leveraging the points
>> index.[1] This can produce very significant speedups but again requires
>> that the same data be indexed in points and doc values.
>> >
>> > [1] https://github.com/apache/lucene-solr/pull/1351
>> >
>> > We had discussions about adding a notion of schema of Lucene in the
>> past, see e.g. [2]. This seems desirable to me but also a high hanging
>> fruit and possibly controversial, so my short term proposal would instead
>> be to:
>> >  - Require documents to be consistent in the data-structures that they
>> use: you can't have one document using only points on a document and
>> another document using only doc values on another document. Of course it
>> would still be possible to index documents that have neither points nor doc
>> values indexed even if previous documents had either enabled in order to
>> handle documents with missing values properly.
>> >  - Don't hesitate to rely on consistency across fields when
>> implementing new functionality, ie. LongPoint#newRangeQuery would check
>> whether the FieldInfo has numeric doc values, and if so would automatically
>> enable the IndexOrDocValuesQuery and
>> IndexSortSortedNumericDocValuesRangeQuery optimizations.
>> >
>> > [2] https://issues.apache.org/jira/browse/LUCENE-6005
>> >
>> > Checking that documents have the same values sounds desirable to me but
>> also challenging due to how we sometimes encode data on top of the Lucene
>> APIs, e.g. longs become byte[] in the points index, geo points become a
>> single long in doc values, and we have a few use-cases when we encode
>> muliple values into a single BinaryDocValueField in Elasticsearch to work
>> around the absence of multi-value binary doc values support. I think it'd
>> be acceptable to not validate values but still expect consistency in our
>> search APIs?
>> >
>> > What do you think?
>> >
>> > --
>> > Adrien
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>

-- 
Adrien

Re: Require consistency between different data-structures sharing the same field name as of 9.0?

Reply via email to