Re: Avoiding false-positives in multivalued field search with intervals?

Michael Sokolov Thu, 10 Sep 2020 13:41:18 -0700

A slightly different but related topic is how to manage lots of fields

I agree that sub-fields are a pain and that mashing everything
together in an all-field is a mess, but for best performance with a
large number of fields/sub-fields, it is the only workable option I
can see? Expanding a query over numerous fields grows combinatorically
in the number of fields (if I want my query to match when all terms
match in *some* field), doesn't it?

I would like to see a mechanism for defining sub-fields using
positions. Together with an absolute positional query this would
enable both match-any-field as well as field-specific matching with
each token indexed only once (multi-values are possible within this
with boundary tokens or big enough position ranges, as Alan
suggested). It does mean that the sub-field boundaries have to be
managed somehow. Without index support, you can set an arbitrary large
size for your sub-field and insert position gaps at the boundaries,
but maybe we could detect the largest sub-field at flush time and
write that metadata somewhere in the index to enable smaller gaps?
Another issue is differing analysis for the sub-fields, and properly
updating the positions during analysis: at the boundaries(you don't
want to insert a gap, rather advance to a fixed position, and you have
to index sub-fields in order. Maybe we could make it less horrible by
adding better support for it.

Re: query parsing; wasn't there at one time an interval query parser?
It had operators like w() and n() IIRC

On Thu, Sep 10, 2020 at 4:20 PM Dawid Weiss <[email protected]> wrote:
>
> > Ok so the more general question is whether we need an interval query parser
>
> Oh, to this I'd say: yes, yes, yes.
>
> I didn't have much prior experience writing frontend apps on top of
> Solr/Lucene but once I did have
> to go that route it quickly turns out that several things that are
> readily available from code-level
> are so darn difficult to achieve and integrate from the outside. Specifically:
>
> - Field expansion in query parsers is a must (so that unqualified
> terms are expanded over multiple fields).
> Any query parser that doesn't support this is in my opinion of zero
> use. The "default" copy-to sink field known
> from Solr brings more problems than it solves.
>
> - Exact match-region hit highlighting is a strong expectation. I
> solved this with matches API (see LUCENE-9461)
> and flexible query parser's multifield expansion. Works like a charm.
>
> - Multivalued fields are common and sub-document handling is a pain.
> The problem I raised here is a result of
> direct user feedback. In real life multivalued fields are omnipresent
> and searches over those fields can be complex.
> Users see hits that just should not be there and are confused.
>
> - People do use complex queries. Maybe not all people but there are
> people out there who do... Just recently I extended
> flexible query parser with a handcrafted min-should-match operator
> because it is otherwise not accessible in any Lucene
> query parser (!). I can make this code available (it's not terribly
> complex), although, since you asked, I think a query parser that
> exposes all sorts of "higher level" functionality of intervals would
> be very, very useful.
>
> It may end up that I'll have to write something for intervals anyway
> so we can work on this together if you like.
> Especially the syntax is an open question - should it be
> operator-based (like the current boost of fuzzy operators) or
> meta-function-based (so that pseudo-functions would be available). Or
> maybe a mix of both? I don't know, really. :)
>
> Dawid
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Avoiding false-positives in multivalued field search with intervals?

Reply via email to