What Alan says makes sense to me -- simple and sufficiently addresses the pain point Adrien raises. I don't think we need a long class name which is also some extra class in addition to the current one (confusing); I'd prefer one simple numeric class name (per int/long/float/double) with a constructor that indicates indexed/docValues.
I also agree with Michael Sokolov; it's crazy Lucene FieldInfos doesn't have some basic numeric type metadata. ~ David Smiley Apache Lucene/Solr Search Developer http://www.linkedin.com/in/davidwsmiley On Mon, Apr 20, 2020 at 10:15 AM Michael Sokolov <msoko...@gmail.com> wrote: > Could we use this as a stepping stone towards a schema? Just a very > lightweight schema that only enforces what we can easily enforce > today, but put some minimal abstraction in place where we can hang > future consistency checks. > > Re: value consistency; could we do a best-effort enforcement in > DefaultIndexingChain where we have the values unencoded? > > On Mon, Apr 20, 2020 at 9:48 AM Alan Woodward <romseyg...@gmail.com> > wrote: > > > > One way of doing this might be to add an additional field type that adds > both point and docvalues, and then have factory methods for queries and > sorts on the field type. So for example a LongPointAndValue field would > automatically index its value into both BKD and NumericDocValues, and then > LongPointAndValue#newRangeQuery() would build the relevant > IndexOrDocValuesQuery, and LongPointAndValue#newSortField() would return a > sort field that can use the shortcuts. > > > > On 20 Apr 2020, at 14:10, Adrien Grand <jpou...@gmail.com> wrote: > > > > Hello, > > > > Lucene currently doesn't require consistency across data-structures. For > instance it is possible to have different values in points and doc values > under the same field name. Until now, we worked around it either by making > features use a single data-structure, e.g. facets only use doc values, or > by pushing the responsibility of having consistent data across > data-structures to the user, e.g. IndexOrDocValuesQuery requires that the > point query and the doc-value query match the same documents and it's the > responsibility of the user to ensure this. > > > > I'm unhappy that it makes Lucene very hard to use. Creating an efficient > range query should be a one-liner, but due to this limitation, users have > to first learn about LongPointQuery#newRangeQuery, > NumericDocValuesField#newSlowRangeQuery and then combine them with > IndexOrDocValuesQuery or maybe even > IndexSortSortedNumericDocValuesRangeQuery. If Lucene had a requirement that > if a field both has points and numeric doc values then both data-structurs > contain the same content, then we could automatically use the > IndexOrDocValuesQuery optimization in LongPoint#newRangeQuery when noticing > that the field also has doc values of type NUMERIC or SORTED_NUMERIC. > > > > This question is being raised again as we are working on dynamically > pruning uncompetitive hits when sorting by field by leveraging the points > index.[1] This can produce very significant speedups but again requires > that the same data be indexed in points and doc values. > > > > [1] https://github.com/apache/lucene-solr/pull/1351 > > > > We had discussions about adding a notion of schema of Lucene in the > past, see e.g. [2]. This seems desirable to me but also a high hanging > fruit and possibly controversial, so my short term proposal would instead > be to: > > - Require documents to be consistent in the data-structures that they > use: you can't have one document using only points on a document and > another document using only doc values on another document. Of course it > would still be possible to index documents that have neither points nor doc > values indexed even if previous documents had either enabled in order to > handle documents with missing values properly. > > - Don't hesitate to rely on consistency across fields when implementing > new functionality, ie. LongPoint#newRangeQuery would check whether the > FieldInfo has numeric doc values, and if so would automatically enable the > IndexOrDocValuesQuery and IndexSortSortedNumericDocValuesRangeQuery > optimizations. > > > > [2] https://issues.apache.org/jira/browse/LUCENE-6005 > > > > Checking that documents have the same values sounds desirable to me but > also challenging due to how we sometimes encode data on top of the Lucene > APIs, e.g. longs become byte[] in the points index, geo points become a > single long in doc values, and we have a few use-cases when we encode > muliple values into a single BinaryDocValueField in Elasticsearch to work > around the absence of multi-value binary doc values support. I think it'd > be acceptable to not validate values but still expect consistency in our > search APIs? > > > > What do you think? > > > > -- > > Adrien > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >