One way of doing this might be to add an additional field type that adds both 
point and docvalues, and then have factory methods for queries and sorts on the 
field type.  So for example a LongPointAndValue field would automatically index 
its value into both BKD and NumericDocValues, and then 
LongPointAndValue#newRangeQuery() would build the relevant 
IndexOrDocValuesQuery, and LongPointAndValue#newSortField() would return a sort 
field that can use the shortcuts.

> On 20 Apr 2020, at 14:10, Adrien Grand <jpou...@gmail.com 
> <mailto:jpou...@gmail.com>> wrote:
> 
> Hello,
> 
> Lucene currently doesn't require consistency across data-structures. For 
> instance it is possible to have different values in points and doc values 
> under the same field name. Until now, we worked around it either by making 
> features use a single data-structure, e.g. facets only use doc values, or by 
> pushing the responsibility of having consistent data across data-structures 
> to the user, e.g. IndexOrDocValuesQuery requires that the point query and the 
> doc-value query match the same documents and it's the responsibility of the 
> user to ensure this.
> 
> I'm unhappy that it makes Lucene very hard to use. Creating an efficient 
> range query should be a one-liner, but due to this limitation, users have to 
> first learn about LongPointQuery#newRangeQuery, 
> NumericDocValuesField#newSlowRangeQuery and then combine them with 
> IndexOrDocValuesQuery or maybe even 
> IndexSortSortedNumericDocValuesRangeQuery. If Lucene had a requirement that 
> if a field both has points and numeric doc values then both data-structurs 
> contain the same content, then we could automatically use the 
> IndexOrDocValuesQuery optimization in LongPoint#newRangeQuery when noticing 
> that the field also has doc values of type NUMERIC or SORTED_NUMERIC.
> 
> This question is being raised again as we are working on dynamically pruning 
> uncompetitive hits when sorting by field by leveraging the points index.[1] 
> This can produce very significant speedups but again requires that the same 
> data be indexed in points and doc values.
> 
> [1] https://github.com/apache/lucene-solr/pull/1351 
> <https://github.com/apache/lucene-solr/pull/1351>
> 
> We had discussions about adding a notion of schema of Lucene in the past, see 
> e.g. [2]. This seems desirable to me but also a high hanging fruit and 
> possibly controversial, so my short term proposal would instead be to:
>  - Require documents to be consistent in the data-structures that they use: 
> you can't have one document using only points on a document and another 
> document using only doc values on another document. Of course it would still 
> be possible to index documents that have neither points nor doc values 
> indexed even if previous documents had either enabled in order to handle 
> documents with missing values properly.
>  - Don't hesitate to rely on consistency across fields when implementing new 
> functionality, ie. LongPoint#newRangeQuery would check whether the FieldInfo 
> has numeric doc values, and if so would automatically enable the 
> IndexOrDocValuesQuery and IndexSortSortedNumericDocValuesRangeQuery 
> optimizations.
> 
> [2] https://issues.apache.org/jira/browse/LUCENE-6005 
> <https://issues.apache.org/jira/browse/LUCENE-6005>
> 
> Checking that documents have the same values sounds desirable to me but also 
> challenging due to how we sometimes encode data on top of the Lucene APIs, 
> e.g. longs become byte[] in the points index, geo points become a single long 
> in doc values, and we have a few use-cases when we encode muliple values into 
> a single BinaryDocValueField in Elasticsearch to work around the absence of 
> multi-value binary doc values support. I think it'd be acceptable to not 
> validate values but still expect consistency in our search APIs?
> 
> What do you think?
> 
> -- 
> Adrien

Reply via email to