Re: Require consistency between different data-structures sharing the same field name as of 9.0?

David Smiley Mon, 20 Apr 2020 09:12:34 -0700

What Alan says makes sense to me -- simple and sufficiently addresses the
pain point Adrien raises.  I don't think we need a long class name which is
also some extra class in addition to the current one (confusing); I'd
prefer one simple numeric class name (per int/long/float/double) with a
constructor that indicates indexed/docValues.


I also agree with Michael Sokolov; it's crazy Lucene FieldInfos doesn't
have some basic numeric type metadata.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Mon, Apr 20, 2020 at 10:15 AM Michael Sokolov <msoko...@gmail.com> wrote:

> Could we use this as a stepping stone towards a schema? Just a very
> lightweight schema that only enforces what we can easily enforce
> today, but put some minimal abstraction in place where we can hang
> future consistency checks.
>
> Re: value consistency; could we do a best-effort enforcement in
> DefaultIndexingChain where we have the values unencoded?
>
> On Mon, Apr 20, 2020 at 9:48 AM Alan Woodward <romseyg...@gmail.com>
> wrote:
> >
> > One way of doing this might be to add an additional field type that adds
> both point and docvalues, and then have factory methods for queries and
> sorts on the field type.  So for example a LongPointAndValue field would
> automatically index its value into both BKD and NumericDocValues, and then
> LongPointAndValue#newRangeQuery() would build the relevant
> IndexOrDocValuesQuery, and LongPointAndValue#newSortField() would return a
> sort field that can use the shortcuts.
> >
> > On 20 Apr 2020, at 14:10, Adrien Grand <jpou...@gmail.com> wrote:
> >
> > Hello,
> >
> > Lucene currently doesn't require consistency across data-structures. For
> instance it is possible to have different values in points and doc values
> under the same field name. Until now, we worked around it either by making
> features use a single data-structure, e.g. facets only use doc values, or
> by pushing the responsibility of having consistent data across
> data-structures to the user, e.g. IndexOrDocValuesQuery requires that the
> point query and the doc-value query match the same documents and it's the
> responsibility of the user to ensure this.
> >
> > I'm unhappy that it makes Lucene very hard to use. Creating an efficient
> range query should be a one-liner, but due to this limitation, users have
> to first learn about LongPointQuery#newRangeQuery,
> NumericDocValuesField#newSlowRangeQuery and then combine them with
> IndexOrDocValuesQuery or maybe even
> IndexSortSortedNumericDocValuesRangeQuery. If Lucene had a requirement that
> if a field both has points and numeric doc values then both data-structurs
> contain the same content, then we could automatically use the
> IndexOrDocValuesQuery optimization in LongPoint#newRangeQuery when noticing
> that the field also has doc values of type NUMERIC or SORTED_NUMERIC.
> >
> > This question is being raised again as we are working on dynamically
> pruning uncompetitive hits when sorting by field by leveraging the points
> index.[1] This can produce very significant speedups but again requires
> that the same data be indexed in points and doc values.
> >
> > [1] https://github.com/apache/lucene-solr/pull/1351
> >
> > We had discussions about adding a notion of schema of Lucene in the
> past, see e.g. [2]. This seems desirable to me but also a high hanging
> fruit and possibly controversial, so my short term proposal would instead
> be to:
> >  - Require documents to be consistent in the data-structures that they
> use: you can't have one document using only points on a document and
> another document using only doc values on another document. Of course it
> would still be possible to index documents that have neither points nor doc
> values indexed even if previous documents had either enabled in order to
> handle documents with missing values properly.
> >  - Don't hesitate to rely on consistency across fields when implementing
> new functionality, ie. LongPoint#newRangeQuery would check whether the
> FieldInfo has numeric doc values, and if so would automatically enable the
> IndexOrDocValuesQuery and IndexSortSortedNumericDocValuesRangeQuery
> optimizations.
> >
> > [2] https://issues.apache.org/jira/browse/LUCENE-6005
> >
> > Checking that documents have the same values sounds desirable to me but
> also challenging due to how we sometimes encode data on top of the Lucene
> APIs, e.g. longs become byte[] in the points index, geo points become a
> single long in doc values, and we have a few use-cases when we encode
> muliple values into a single BinaryDocValueField in Elasticsearch to work
> around the absence of multi-value binary doc values support. I think it'd
> be acceptable to not validate values but still expect consistency in our
> search APIs?
> >
> > What do you think?
> >
> > --
> > Adrien
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Re: Require consistency between different data-structures sharing the same field name as of 9.0?

Reply via email to