Oh, and also, I like the idea of making index sorting parent/child aware! On Thu, Sep 2, 2021 at 7:45 AM Michael Sokolov <[email protected]> wrote: > > Yes, I am also supportive of the idea of having a schema that is > enforced, and I like what it enables us to do. I just wonder if we > could relax the enforcement around IndexOptions.NONE (and > DocValuesType.NONE). Would it make sense to enable NONE to be "equal > to" any other IndexOptions, so that eg, you if you index a field with > IndexOptions.DOCS_AND_TERMS then every document must have either > DOCS_AND_TERMS or NONE? In the case where a field is *only* indexed > as terms, and has no docvalues, this is already allowed. But if you > index a field as both docvalue and terms, then it is not (currently), > which seems weird. I guess the same is true of a field that has no > docvalues on some docs, and has them on others, but is also indexed as > terms everywhere. I think that ought to be allowed (since you can have > a sparse docvalues field that is not indexed with terms). > > On Wed, Sep 1, 2021 at 12:24 PM Adrien Grand <[email protected]> wrote: > > > > This additional validation that we introduced in Lucene 9 feels like a > > natural extension of the validation that we already had before, such as the > > fact that you cannot have some docs that use SORTED doc values and other > > docs that use NUMERIC doc values on the same field. Actually I would have > > liked to go further by enforcing that all data structures record the exact > > same information but this is challenging due to the fact that IndexingChain > > only has access to the encoded data, e.g. with IntPoint it only sees a > > byte[] rather than the original integer, so we'd have to make assumptions > > about how the data is encoded, which doesn't feel right. > > > > I do like this additional validation very much because I suspect that most > > cases when users would get this error is because they made a mistake in > > their indexing code. And this also helps make Lucene work better > > out-of-the-box. For instance, thanks to this additional validation we > > enabled dynamic pruning when sorting on numeric fields by default - this is > > opt-in on 8.x since this optimization needs to look at both points and doc > > values, so it's broken if not all documents have the same schema. And there > > are other things we could do in the near future like rewriting > > DocValuesFieldExistsQuery to a MatchAllDocsQuery when points/terms report > > that docCount == maxDoc. > > > > In my opinion the correct solution for the problem you are facing would be > > to have a way to make index sorting aware of the parent/child relationship > > so that index sorting would read the sort key of the parent document > > whenever it is on a child document, e.g. as done on LUCENE-5312. This way > > you wouldn't have to duplicate this sort key from your parent documents to > > your child documents, so you wouldn't have any schema issues. > > > > On Wed, Sep 1, 2021 at 4:44 PM Michael Sokolov <[email protected]> wrote: > >> > >> While upgrading I ran afoul of some inconsistencies in our schema > >> usage, and to fix them I've ended up having to add data to our index > >> that I'd rather not. Let me give a little context: We have a > >> parent/child document structure. Some fields are shared across partn > >> and child docs, others are not. Our index has a sort key, and in order > >> for all the parent/child docs to sort together correctly, we add the > >> same (docvalues) fields that are part of the sortkey to both parent > >> and child docs. Some of these fields are *also* indexed as postings > >> (StringField) of the same name, but we only index the postings field > >> on the parent document, since child documents are never searched for > >> on their own - always in conjunction with a parent. > >> > >> The schema-checking code we added in Lucene 9 does not allow this: it > >> enforces that all documents having a field should have the same "index > >> options", and failing to index the postings gets interpreted as having > >> index options = NONE (because of the presence of the doc values field > >> of the same name, I think?) > >> > >> Our current solution is to also index the postings for the child > >> document (but just with an empty string value). This seems gross, and > >> creates postings in the index that we will never use. > >> > >> Another possibility would be to rename the fields so that the postings > >> and docvalues fields have different names. But in this case our > >> application-level schema diverges from our Lucene schema, adding a > >> layer of complexity we'd rather not introduce. > >> > >> Finally, could we relax this constraint, always allowing index > >> options=NONE regardless of how other docs are indexed? Would it cause > >> problems? > >> > >> -Mike > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: [email protected] > >> For additional commands, e-mail: [email protected] > >> > > > > > > -- > > Adrien
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
