Hello, Lucene currently doesn't require consistency across data-structures. For instance it is possible to have different values in points and doc values under the same field name. Until now, we worked around it either by making features use a single data-structure, e.g. facets only use doc values, or by pushing the responsibility of having consistent data across data-structures to the user, e.g. IndexOrDocValuesQuery requires that the point query and the doc-value query match the same documents and it's the responsibility of the user to ensure this.
I'm unhappy that it makes Lucene very hard to use. Creating an efficient range query should be a one-liner, but due to this limitation, users have to first learn about LongPointQuery#newRangeQuery, NumericDocValuesField#newSlowRangeQuery and then combine them with IndexOrDocValuesQuery or maybe even IndexSortSortedNumericDocValuesRangeQuery. If Lucene had a requirement that if a field both has points and numeric doc values then both data-structurs contain the same content, then we could automatically use the IndexOrDocValuesQuery optimization in LongPoint#newRangeQuery when noticing that the field also has doc values of type NUMERIC or SORTED_NUMERIC. This question is being raised again as we are working on dynamically pruning uncompetitive hits when sorting by field by leveraging the points index.[1] This can produce very significant speedups but again requires that the same data be indexed in points and doc values. [1] https://github.com/apache/lucene-solr/pull/1351 We had discussions about adding a notion of schema of Lucene in the past, see e.g. [2]. This seems desirable to me but also a high hanging fruit and possibly controversial, so my short term proposal would instead be to: - Require documents to be consistent in the data-structures that they use: you can't have one document using only points on a document and another document using only doc values on another document. Of course it would still be possible to index documents that have neither points nor doc values indexed even if previous documents had either enabled in order to handle documents with missing values properly. - Don't hesitate to rely on consistency across fields when implementing new functionality, ie. LongPoint#newRangeQuery would check whether the FieldInfo has numeric doc values, and if so would automatically enable the IndexOrDocValuesQuery and IndexSortSortedNumericDocValuesRangeQuery optimizations. [2] https://issues.apache.org/jira/browse/LUCENE-6005 Checking that documents have the same values sounds desirable to me but also challenging due to how we sometimes encode data on top of the Lucene APIs, e.g. longs become byte[] in the points index, geo points become a single long in doc values, and we have a few use-cases when we encode muliple values into a single BinaryDocValueField in Elasticsearch to work around the absence of multi-value binary doc values support. I think it'd be acceptable to not validate values but still expect consistency in our search APIs? What do you think? -- Adrien