Thanks all, the initial feedback sounds positive so I opened https://issues.apache.org/jira/browse/LUCENE-9334. As Mike pointed out, this could be a first step towards having something that looks more like a proper schema.
I like the idea of exposing higher-level abstractions, like a field that would index as points, add doc values (but which ones? Numeric or SortedNumeric?) and maybe stored fields all at once. On Mon, Apr 20, 2020 at 6:12 PM David Smiley <david.w.smi...@gmail.com> wrote: > What Alan says makes sense to me -- simple and sufficiently addresses the > pain point Adrien raises. I don't think we need a long class name which is > also some extra class in addition to the current one (confusing); I'd > prefer one simple numeric class name (per int/long/float/double) with a > constructor that indicates indexed/docValues. > > I also agree with Michael Sokolov; it's crazy Lucene FieldInfos doesn't > have some basic numeric type metadata. > > ~ David Smiley > Apache Lucene/Solr Search Developer > http://www.linkedin.com/in/davidwsmiley > > > On Mon, Apr 20, 2020 at 10:15 AM Michael Sokolov <msoko...@gmail.com> > wrote: > >> Could we use this as a stepping stone towards a schema? Just a very >> lightweight schema that only enforces what we can easily enforce >> today, but put some minimal abstraction in place where we can hang >> future consistency checks. >> >> Re: value consistency; could we do a best-effort enforcement in >> DefaultIndexingChain where we have the values unencoded? >> >> On Mon, Apr 20, 2020 at 9:48 AM Alan Woodward <romseyg...@gmail.com> >> wrote: >> > >> > One way of doing this might be to add an additional field type that >> adds both point and docvalues, and then have factory methods for queries >> and sorts on the field type. So for example a LongPointAndValue field >> would automatically index its value into both BKD and NumericDocValues, and >> then LongPointAndValue#newRangeQuery() would build the relevant >> IndexOrDocValuesQuery, and LongPointAndValue#newSortField() would return a >> sort field that can use the shortcuts. >> > >> > On 20 Apr 2020, at 14:10, Adrien Grand <jpou...@gmail.com> wrote: >> > >> > Hello, >> > >> > Lucene currently doesn't require consistency across data-structures. >> For instance it is possible to have different values in points and doc >> values under the same field name. Until now, we worked around it either by >> making features use a single data-structure, e.g. facets only use doc >> values, or by pushing the responsibility of having consistent data across >> data-structures to the user, e.g. IndexOrDocValuesQuery requires that the >> point query and the doc-value query match the same documents and it's the >> responsibility of the user to ensure this. >> > >> > I'm unhappy that it makes Lucene very hard to use. Creating an >> efficient range query should be a one-liner, but due to this limitation, >> users have to first learn about LongPointQuery#newRangeQuery, >> NumericDocValuesField#newSlowRangeQuery and then combine them with >> IndexOrDocValuesQuery or maybe even >> IndexSortSortedNumericDocValuesRangeQuery. If Lucene had a requirement that >> if a field both has points and numeric doc values then both data-structurs >> contain the same content, then we could automatically use the >> IndexOrDocValuesQuery optimization in LongPoint#newRangeQuery when noticing >> that the field also has doc values of type NUMERIC or SORTED_NUMERIC. >> > >> > This question is being raised again as we are working on dynamically >> pruning uncompetitive hits when sorting by field by leveraging the points >> index.[1] This can produce very significant speedups but again requires >> that the same data be indexed in points and doc values. >> > >> > [1] https://github.com/apache/lucene-solr/pull/1351 >> > >> > We had discussions about adding a notion of schema of Lucene in the >> past, see e.g. [2]. This seems desirable to me but also a high hanging >> fruit and possibly controversial, so my short term proposal would instead >> be to: >> > - Require documents to be consistent in the data-structures that they >> use: you can't have one document using only points on a document and >> another document using only doc values on another document. Of course it >> would still be possible to index documents that have neither points nor doc >> values indexed even if previous documents had either enabled in order to >> handle documents with missing values properly. >> > - Don't hesitate to rely on consistency across fields when >> implementing new functionality, ie. LongPoint#newRangeQuery would check >> whether the FieldInfo has numeric doc values, and if so would automatically >> enable the IndexOrDocValuesQuery and >> IndexSortSortedNumericDocValuesRangeQuery optimizations. >> > >> > [2] https://issues.apache.org/jira/browse/LUCENE-6005 >> > >> > Checking that documents have the same values sounds desirable to me but >> also challenging due to how we sometimes encode data on top of the Lucene >> APIs, e.g. longs become byte[] in the points index, geo points become a >> single long in doc values, and we have a few use-cases when we encode >> muliple values into a single BinaryDocValueField in Elasticsearch to work >> around the absence of multi-value binary doc values support. I think it'd >> be acceptable to not validate values but still expect consistency in our >> search APIs? >> > >> > What do you think? >> > >> > -- >> > Adrien >> > >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >> -- Adrien