A slightly different but related topic is how to manage lots of fields I agree that sub-fields are a pain and that mashing everything together in an all-field is a mess, but for best performance with a large number of fields/sub-fields, it is the only workable option I can see? Expanding a query over numerous fields grows combinatorically in the number of fields (if I want my query to match when all terms match in *some* field), doesn't it?
I would like to see a mechanism for defining sub-fields using positions. Together with an absolute positional query this would enable both match-any-field as well as field-specific matching with each token indexed only once (multi-values are possible within this with boundary tokens or big enough position ranges, as Alan suggested). It does mean that the sub-field boundaries have to be managed somehow. Without index support, you can set an arbitrary large size for your sub-field and insert position gaps at the boundaries, but maybe we could detect the largest sub-field at flush time and write that metadata somewhere in the index to enable smaller gaps? Another issue is differing analysis for the sub-fields, and properly updating the positions during analysis: at the boundaries(you don't want to insert a gap, rather advance to a fixed position, and you have to index sub-fields in order. Maybe we could make it less horrible by adding better support for it. Re: query parsing; wasn't there at one time an interval query parser? It had operators like w() and n() IIRC On Thu, Sep 10, 2020 at 4:20 PM Dawid Weiss <dawid.we...@gmail.com> wrote: > > > Ok so the more general question is whether we need an interval query parser > > Oh, to this I'd say: yes, yes, yes. > > I didn't have much prior experience writing frontend apps on top of > Solr/Lucene but once I did have > to go that route it quickly turns out that several things that are > readily available from code-level > are so darn difficult to achieve and integrate from the outside. Specifically: > > - Field expansion in query parsers is a must (so that unqualified > terms are expanded over multiple fields). > Any query parser that doesn't support this is in my opinion of zero > use. The "default" copy-to sink field known > from Solr brings more problems than it solves. > > - Exact match-region hit highlighting is a strong expectation. I > solved this with matches API (see LUCENE-9461) > and flexible query parser's multifield expansion. Works like a charm. > > - Multivalued fields are common and sub-document handling is a pain. > The problem I raised here is a result of > direct user feedback. In real life multivalued fields are omnipresent > and searches over those fields can be complex. > Users see hits that just should not be there and are confused. > > - People do use complex queries. Maybe not all people but there are > people out there who do... Just recently I extended > flexible query parser with a handcrafted min-should-match operator > because it is otherwise not accessible in any Lucene > query parser (!). I can make this code available (it's not terribly > complex), although, since you asked, I think a query parser that > exposes all sorts of "higher level" functionality of intervals would > be very, very useful. > > It may end up that I'll have to write something for intervals anyway > so we can work on this together if you like. > Especially the syntax is an open question - should it be > operator-based (like the current boost of fuzzy operators) or > meta-function-based (so that pseudo-functions would be available). Or > maybe a mix of both? I don't know, really. :) > > Dawid > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org