Re: Avoiding false-positives in multivalued field search with intervals?
Hi Chris, > Because if you can adjust your parser syntax, this literallyly just > becomes: ' field:"foo bar"~N ' ... where N is the positionIncrementGap > on your analyzer ... OR ... ' field:"foo bar" ' ... if you call > setPhraseSlop on your QueryParser. Yes - correct. This would be equivalent what others suggested with intervals (search for a fixed-length phrase and filter out false positives by leveraging position increments between values). I think the second solution is somewhat more flexible - index a sentinel token between values and ensure it's not part of the hit range. This allows you to use any type of interval query underneath, which is nice. > So maybe the "solution" (at least for the flexible parser ... IIUC, I > haven't used it much) would be for BooleanQueryNode to carry some metadata > [...] I haven't reached the phase of modifying the flexible query parser for my use case but it's definitely going to work something like you suggest. I think I'm going to rewrite the node tree from the syntax parser into interval queries though (either in full or in part). I'll see what can be done there. Dawid - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Avoiding false-positives in multivalued field search with intervals?
(caveat: i don't ever really understand what Intervals at hte lucene feature set stage) : Yup - similar to what Alan suggested. I'd have to rewrite the (general : text-to-query) query parser to only use intervals though. Still : thinking about possible approaches to this. ... : > You could set a very high position increment gap for multi-valued : > fields (Analyzer#getPositionIncrementGap) and perform something : > like Intervals.maxWidth(Intervals.unordered(...), pos_gap-1) ? I'm assuming form your response that the issue here is really that you want to *directly* support the syntax you mentioned... : >> > > doc: field=["foo", "bar"] : >> > > query: field:(foo AND bar) ...and identify *when* the parser encouters a "boolean" expresion preceeded by the "fieldName:" syntax, and *then* treat thta special. ie: this seems 100% like a query parser question, and not at all as a "what does the query structure look like ater parsing" question. Because if you can adjust your parser syntax, this literallyly just becomes: ' field:"foo bar"~N ' ... where N is the positionIncrementGap on your analyzer ... OR ... ' field:"foo bar" ' ... if you call setPhraseSlop on your QueryParser. i *THINK* the crux of your question/problem is that -- from the point of view of the QueryParserBase/BooleanQueryNodeBuilder, these 2 input strings are treated identically by the time any "subclass" has a chance to do anything interesting with them... field:(foo AND bar) field:foo AND field:bar ...so you can't for instance, build an Interval / sloppy Phrase query from the first, while building a 2 clause boolean query from the second. So maybe the "solution" (at least for the flexible parser ... IIUC, I haven't used it much) would be for BooleanQueryNode to carry some metadata indicating that there was a "fieldName:" prefix on it, so that the BooleanQueryNodeBuilder can choose to use that information to do something "special" if the "List clauses" are all simple TermNodes (in the same field) ? -Hoss http://www.lucidworks.com/ - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Avoiding false-positives in multivalued field search with intervals?
Thanks Michael. The outcome of this discussion seems to be clear that everyone is trying to reinvent the wheel somehow. ;) I think it really should become part of core Lucene functionality. Seems like a corner case people are not aware of until they hit it (and then it's not clear what to do about it). Dawid On Mon, Sep 14, 2020 at 4:57 PM Michael Gibney wrote: > > This might be a little outside the spirit of this discussion (in that > it's not really "off-the-shelf") -- but I implemented a > proof-of-concept for a different use case that I think could be > adapted here: > > For a given doc, for each term in your multivalued field, you could > record a bitset representation of the indexes of the individual fields > in which that term appears; then in conjunction DISI for different > terms, intersect the bitset values for different terms to speed the > determination of whether the terms appear in the same field. You could > put the bitset representation, e.g., in the Payload for the first > position of each term, or for more general-purpose use, in > polyField/subfield DocValues, or whatever. > > It seems like everyone's on the same page more-or-less, but I'll > explicitly note: this feels superficially a little like a "special > case", as it addresses only the "conjunction" case ... but for > avoiding false-positives in the multivalued-field case, arguably the > conjunction case *is* the general case. > > Michael > > On Mon, Sep 14, 2020 at 3:17 AM Dawid Weiss wrote: > > > > bq. Expanding a query over numerous fields grows combinatorically > > in the number of fields (if I want my query to match when all terms > > match in *some* field), doesn't it? > > > > I don't think it does? It grows linearly with the number of fields? In > > my experience the number of fields > > searchable "by default" is typically limited - it's not *all* fields - > > it's just a subset that constitutes the "text body" > > of a document. Of course everyone's experience will vary depending on > > the application. > > > > > Re: query parsing; wasn't there at one time an interval query parser? It > > > had operators like w() and n() IIRC > > > > I've tried that but it's really unusable unless the queries are > > automated - the syntax is difficult to use; mistakes cause cryptic > > parse errors and are hard to recover from. > > > > Dawid > > > > On Thu, Sep 10, 2020 at 10:40 PM Michael Sokolov wrote: > > > > > > A slightly different but related topic is how to manage lots of fields > > > > > > I agree that sub-fields are a pain and that mashing everything > > > together in an all-field is a mess, but for best performance with a > > > large number of fields/sub-fields, it is the only workable option I > > > can see? Expanding a query over numerous fields grows combinatorically > > > in the number of fields (if I want my query to match when all terms > > > match in *some* field), doesn't it? > > > > > > I would like to see a mechanism for defining sub-fields using > > > positions. Together with an absolute positional query this would > > > enable both match-any-field as well as field-specific matching with > > > each token indexed only once (multi-values are possible within this > > > with boundary tokens or big enough position ranges, as Alan > > > suggested). It does mean that the sub-field boundaries have to be > > > managed somehow. Without index support, you can set an arbitrary large > > > size for your sub-field and insert position gaps at the boundaries, > > > but maybe we could detect the largest sub-field at flush time and > > > write that metadata somewhere in the index to enable smaller gaps? > > > Another issue is differing analysis for the sub-fields, and properly > > > updating the positions during analysis: at the boundaries(you don't > > > want to insert a gap, rather advance to a fixed position, and you have > > > to index sub-fields in order. Maybe we could make it less horrible by > > > adding better support for it. > > > > > > Re: query parsing; wasn't there at one time an interval query parser? > > > It had operators like w() and n() IIRC > > > > > > On Thu, Sep 10, 2020 at 4:20 PM Dawid Weiss wrote: > > > > > > > > > Ok so the more general question is whether we need an interval query > > > > > parser > > > > > > > > Oh, to this I'd say: yes, yes, yes. > > > > > > > > I didn't have much prior experience writing frontend apps on top of > > > > Solr/Lucene but once I did have > > > > to go that route it quickly turns out that several things that are > > > > readily available from code-level > > > > are so darn difficult to achieve and integrate from the outside. > > > > Specifically: > > > > > > > > - Field expansion in query parsers is a must (so that unqualified > > > > terms are expanded over multiple fields). > > > > Any query parser that doesn't support this is in my opinion of zero > > > > use. The "default" copy-to sink field known > > > > from Solr brings more problems than it solves. > > > > > > >
Re: Avoiding false-positives in multivalued field search with intervals?
This might be a little outside the spirit of this discussion (in that it's not really "off-the-shelf") -- but I implemented a proof-of-concept for a different use case that I think could be adapted here: For a given doc, for each term in your multivalued field, you could record a bitset representation of the indexes of the individual fields in which that term appears; then in conjunction DISI for different terms, intersect the bitset values for different terms to speed the determination of whether the terms appear in the same field. You could put the bitset representation, e.g., in the Payload for the first position of each term, or for more general-purpose use, in polyField/subfield DocValues, or whatever. It seems like everyone's on the same page more-or-less, but I'll explicitly note: this feels superficially a little like a "special case", as it addresses only the "conjunction" case ... but for avoiding false-positives in the multivalued-field case, arguably the conjunction case *is* the general case. Michael On Mon, Sep 14, 2020 at 3:17 AM Dawid Weiss wrote: > > bq. Expanding a query over numerous fields grows combinatorically > in the number of fields (if I want my query to match when all terms > match in *some* field), doesn't it? > > I don't think it does? It grows linearly with the number of fields? In > my experience the number of fields > searchable "by default" is typically limited - it's not *all* fields - > it's just a subset that constitutes the "text body" > of a document. Of course everyone's experience will vary depending on > the application. > > > Re: query parsing; wasn't there at one time an interval query parser? It > > had operators like w() and n() IIRC > > I've tried that but it's really unusable unless the queries are > automated - the syntax is difficult to use; mistakes cause cryptic > parse errors and are hard to recover from. > > Dawid > > On Thu, Sep 10, 2020 at 10:40 PM Michael Sokolov wrote: > > > > A slightly different but related topic is how to manage lots of fields > > > > I agree that sub-fields are a pain and that mashing everything > > together in an all-field is a mess, but for best performance with a > > large number of fields/sub-fields, it is the only workable option I > > can see? Expanding a query over numerous fields grows combinatorically > > in the number of fields (if I want my query to match when all terms > > match in *some* field), doesn't it? > > > > I would like to see a mechanism for defining sub-fields using > > positions. Together with an absolute positional query this would > > enable both match-any-field as well as field-specific matching with > > each token indexed only once (multi-values are possible within this > > with boundary tokens or big enough position ranges, as Alan > > suggested). It does mean that the sub-field boundaries have to be > > managed somehow. Without index support, you can set an arbitrary large > > size for your sub-field and insert position gaps at the boundaries, > > but maybe we could detect the largest sub-field at flush time and > > write that metadata somewhere in the index to enable smaller gaps? > > Another issue is differing analysis for the sub-fields, and properly > > updating the positions during analysis: at the boundaries(you don't > > want to insert a gap, rather advance to a fixed position, and you have > > to index sub-fields in order. Maybe we could make it less horrible by > > adding better support for it. > > > > Re: query parsing; wasn't there at one time an interval query parser? > > It had operators like w() and n() IIRC > > > > On Thu, Sep 10, 2020 at 4:20 PM Dawid Weiss wrote: > > > > > > > Ok so the more general question is whether we need an interval query > > > > parser > > > > > > Oh, to this I'd say: yes, yes, yes. > > > > > > I didn't have much prior experience writing frontend apps on top of > > > Solr/Lucene but once I did have > > > to go that route it quickly turns out that several things that are > > > readily available from code-level > > > are so darn difficult to achieve and integrate from the outside. > > > Specifically: > > > > > > - Field expansion in query parsers is a must (so that unqualified > > > terms are expanded over multiple fields). > > > Any query parser that doesn't support this is in my opinion of zero > > > use. The "default" copy-to sink field known > > > from Solr brings more problems than it solves. > > > > > > - Exact match-region hit highlighting is a strong expectation. I > > > solved this with matches API (see LUCENE-9461) > > > and flexible query parser's multifield expansion. Works like a charm. > > > > > > - Multivalued fields are common and sub-document handling is a pain. > > > The problem I raised here is a result of > > > direct user feedback. In real life multivalued fields are omnipresent > > > and searches over those fields can be complex. > > > Users see hits that just should not be there and are confused. > > > > > > - People do use
Re: Avoiding false-positives in multivalued field search with intervals?
bq. Expanding a query over numerous fields grows combinatorically in the number of fields (if I want my query to match when all terms match in *some* field), doesn't it? I don't think it does? It grows linearly with the number of fields? In my experience the number of fields searchable "by default" is typically limited - it's not *all* fields - it's just a subset that constitutes the "text body" of a document. Of course everyone's experience will vary depending on the application. > Re: query parsing; wasn't there at one time an interval query parser? It had > operators like w() and n() IIRC I've tried that but it's really unusable unless the queries are automated - the syntax is difficult to use; mistakes cause cryptic parse errors and are hard to recover from. Dawid On Thu, Sep 10, 2020 at 10:40 PM Michael Sokolov wrote: > > A slightly different but related topic is how to manage lots of fields > > I agree that sub-fields are a pain and that mashing everything > together in an all-field is a mess, but for best performance with a > large number of fields/sub-fields, it is the only workable option I > can see? Expanding a query over numerous fields grows combinatorically > in the number of fields (if I want my query to match when all terms > match in *some* field), doesn't it? > > I would like to see a mechanism for defining sub-fields using > positions. Together with an absolute positional query this would > enable both match-any-field as well as field-specific matching with > each token indexed only once (multi-values are possible within this > with boundary tokens or big enough position ranges, as Alan > suggested). It does mean that the sub-field boundaries have to be > managed somehow. Without index support, you can set an arbitrary large > size for your sub-field and insert position gaps at the boundaries, > but maybe we could detect the largest sub-field at flush time and > write that metadata somewhere in the index to enable smaller gaps? > Another issue is differing analysis for the sub-fields, and properly > updating the positions during analysis: at the boundaries(you don't > want to insert a gap, rather advance to a fixed position, and you have > to index sub-fields in order. Maybe we could make it less horrible by > adding better support for it. > > Re: query parsing; wasn't there at one time an interval query parser? > It had operators like w() and n() IIRC > > On Thu, Sep 10, 2020 at 4:20 PM Dawid Weiss wrote: > > > > > Ok so the more general question is whether we need an interval query > > > parser > > > > Oh, to this I'd say: yes, yes, yes. > > > > I didn't have much prior experience writing frontend apps on top of > > Solr/Lucene but once I did have > > to go that route it quickly turns out that several things that are > > readily available from code-level > > are so darn difficult to achieve and integrate from the outside. > > Specifically: > > > > - Field expansion in query parsers is a must (so that unqualified > > terms are expanded over multiple fields). > > Any query parser that doesn't support this is in my opinion of zero > > use. The "default" copy-to sink field known > > from Solr brings more problems than it solves. > > > > - Exact match-region hit highlighting is a strong expectation. I > > solved this with matches API (see LUCENE-9461) > > and flexible query parser's multifield expansion. Works like a charm. > > > > - Multivalued fields are common and sub-document handling is a pain. > > The problem I raised here is a result of > > direct user feedback. In real life multivalued fields are omnipresent > > and searches over those fields can be complex. > > Users see hits that just should not be there and are confused. > > > > - People do use complex queries. Maybe not all people but there are > > people out there who do... Just recently I extended > > flexible query parser with a handcrafted min-should-match operator > > because it is otherwise not accessible in any Lucene > > query parser (!). I can make this code available (it's not terribly > > complex), although, since you asked, I think a query parser that > > exposes all sorts of "higher level" functionality of intervals would > > be very, very useful. > > > > It may end up that I'll have to write something for intervals anyway > > so we can work on this together if you like. > > Especially the syntax is an open question - should it be > > operator-based (like the current boost of fuzzy operators) or > > meta-function-based (so that pseudo-functions would be available). Or > > maybe a mix of both? I don't know, really. :) > > > > Dawid > > > > - > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail:
Re: Avoiding false-positives in multivalued field search with intervals?
You're thinking of SurroundQuery parser for span queries I think... https://lucene.apache.org/solr/guide/8_6/other-parsers.html#surround-query-parser and the Advanced Query Parser will have a similar syntax On Thu, Sep 10, 2020 at 4:40 PM Michael Sokolov wrote: > A slightly different but related topic is how to manage lots of fields > > I agree that sub-fields are a pain and that mashing everything > together in an all-field is a mess, but for best performance with a > large number of fields/sub-fields, it is the only workable option I > can see? Expanding a query over numerous fields grows combinatorically > in the number of fields (if I want my query to match when all terms > match in *some* field), doesn't it? > > I would like to see a mechanism for defining sub-fields using > positions. Together with an absolute positional query this would > enable both match-any-field as well as field-specific matching with > each token indexed only once (multi-values are possible within this > with boundary tokens or big enough position ranges, as Alan > suggested). It does mean that the sub-field boundaries have to be > managed somehow. Without index support, you can set an arbitrary large > size for your sub-field and insert position gaps at the boundaries, > but maybe we could detect the largest sub-field at flush time and > write that metadata somewhere in the index to enable smaller gaps? > Another issue is differing analysis for the sub-fields, and properly > updating the positions during analysis: at the boundaries(you don't > want to insert a gap, rather advance to a fixed position, and you have > to index sub-fields in order. Maybe we could make it less horrible by > adding better support for it. > > Re: query parsing; wasn't there at one time an interval query parser? > It had operators like w() and n() IIRC > > On Thu, Sep 10, 2020 at 4:20 PM Dawid Weiss wrote: > > > > > Ok so the more general question is whether we need an interval query > parser > > > > Oh, to this I'd say: yes, yes, yes. > > > > I didn't have much prior experience writing frontend apps on top of > > Solr/Lucene but once I did have > > to go that route it quickly turns out that several things that are > > readily available from code-level > > are so darn difficult to achieve and integrate from the outside. > Specifically: > > > > - Field expansion in query parsers is a must (so that unqualified > > terms are expanded over multiple fields). > > Any query parser that doesn't support this is in my opinion of zero > > use. The "default" copy-to sink field known > > from Solr brings more problems than it solves. > > > > - Exact match-region hit highlighting is a strong expectation. I > > solved this with matches API (see LUCENE-9461) > > and flexible query parser's multifield expansion. Works like a charm. > > > > - Multivalued fields are common and sub-document handling is a pain. > > The problem I raised here is a result of > > direct user feedback. In real life multivalued fields are omnipresent > > and searches over those fields can be complex. > > Users see hits that just should not be there and are confused. > > > > - People do use complex queries. Maybe not all people but there are > > people out there who do... Just recently I extended > > flexible query parser with a handcrafted min-should-match operator > > because it is otherwise not accessible in any Lucene > > query parser (!). I can make this code available (it's not terribly > > complex), although, since you asked, I think a query parser that > > exposes all sorts of "higher level" functionality of intervals would > > be very, very useful. > > > > It may end up that I'll have to write something for intervals anyway > > so we can work on this together if you like. > > Especially the syntax is an open question - should it be > > operator-based (like the current boost of fuzzy operators) or > > meta-function-based (so that pseudo-functions would be available). Or > > maybe a mix of both? I don't know, really. :) > > > > Dawid > > > > - > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > -- http://www.needhamsoftware.com (work) http://www.the111shift.com (play)
Re: Avoiding false-positives in multivalued field search with intervals?
A slightly different but related topic is how to manage lots of fields I agree that sub-fields are a pain and that mashing everything together in an all-field is a mess, but for best performance with a large number of fields/sub-fields, it is the only workable option I can see? Expanding a query over numerous fields grows combinatorically in the number of fields (if I want my query to match when all terms match in *some* field), doesn't it? I would like to see a mechanism for defining sub-fields using positions. Together with an absolute positional query this would enable both match-any-field as well as field-specific matching with each token indexed only once (multi-values are possible within this with boundary tokens or big enough position ranges, as Alan suggested). It does mean that the sub-field boundaries have to be managed somehow. Without index support, you can set an arbitrary large size for your sub-field and insert position gaps at the boundaries, but maybe we could detect the largest sub-field at flush time and write that metadata somewhere in the index to enable smaller gaps? Another issue is differing analysis for the sub-fields, and properly updating the positions during analysis: at the boundaries(you don't want to insert a gap, rather advance to a fixed position, and you have to index sub-fields in order. Maybe we could make it less horrible by adding better support for it. Re: query parsing; wasn't there at one time an interval query parser? It had operators like w() and n() IIRC On Thu, Sep 10, 2020 at 4:20 PM Dawid Weiss wrote: > > > Ok so the more general question is whether we need an interval query parser > > Oh, to this I'd say: yes, yes, yes. > > I didn't have much prior experience writing frontend apps on top of > Solr/Lucene but once I did have > to go that route it quickly turns out that several things that are > readily available from code-level > are so darn difficult to achieve and integrate from the outside. Specifically: > > - Field expansion in query parsers is a must (so that unqualified > terms are expanded over multiple fields). > Any query parser that doesn't support this is in my opinion of zero > use. The "default" copy-to sink field known > from Solr brings more problems than it solves. > > - Exact match-region hit highlighting is a strong expectation. I > solved this with matches API (see LUCENE-9461) > and flexible query parser's multifield expansion. Works like a charm. > > - Multivalued fields are common and sub-document handling is a pain. > The problem I raised here is a result of > direct user feedback. In real life multivalued fields are omnipresent > and searches over those fields can be complex. > Users see hits that just should not be there and are confused. > > - People do use complex queries. Maybe not all people but there are > people out there who do... Just recently I extended > flexible query parser with a handcrafted min-should-match operator > because it is otherwise not accessible in any Lucene > query parser (!). I can make this code available (it's not terribly > complex), although, since you asked, I think a query parser that > exposes all sorts of "higher level" functionality of intervals would > be very, very useful. > > It may end up that I'll have to write something for intervals anyway > so we can work on this together if you like. > Especially the syntax is an open question - should it be > operator-based (like the current boost of fuzzy operators) or > meta-function-based (so that pseudo-functions would be available). Or > maybe a mix of both? I don't know, really. :) > > Dawid > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Avoiding false-positives in multivalued field search with intervals?
> Ok so the more general question is whether we need an interval query parser Oh, to this I'd say: yes, yes, yes. I didn't have much prior experience writing frontend apps on top of Solr/Lucene but once I did have to go that route it quickly turns out that several things that are readily available from code-level are so darn difficult to achieve and integrate from the outside. Specifically: - Field expansion in query parsers is a must (so that unqualified terms are expanded over multiple fields). Any query parser that doesn't support this is in my opinion of zero use. The "default" copy-to sink field known from Solr brings more problems than it solves. - Exact match-region hit highlighting is a strong expectation. I solved this with matches API (see LUCENE-9461) and flexible query parser's multifield expansion. Works like a charm. - Multivalued fields are common and sub-document handling is a pain. The problem I raised here is a result of direct user feedback. In real life multivalued fields are omnipresent and searches over those fields can be complex. Users see hits that just should not be there and are confused. - People do use complex queries. Maybe not all people but there are people out there who do... Just recently I extended flexible query parser with a handcrafted min-should-match operator because it is otherwise not accessible in any Lucene query parser (!). I can make this code available (it's not terribly complex), although, since you asked, I think a query parser that exposes all sorts of "higher level" functionality of intervals would be very, very useful. It may end up that I'll have to write something for intervals anyway so we can work on this together if you like. Especially the syntax is an open question - should it be operator-based (like the current boost of fuzzy operators) or meta-function-based (so that pseudo-functions would be available). Or maybe a mix of both? I don't know, really. :) Dawid - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Avoiding false-positives in multivalued field search with intervals?
Ok so the more general question is whether we need an interval query parser Le jeu. 10 sept. 2020 à 17:28, Dawid Weiss a écrit : > I am fine with the boundary token suggestion, actually. What I don't > see at the moment is how I can marry it with an output of a general > query parser (which returns any Query). I could give an attempt to > process the query node tree from standard query parser (which we're > using at the moment anyway) but if the tree becomes complex there is > no guarantee I can extract subtrees that can be parsed into > IntervalSources (and then in turn into IntervalQuery). > > Dawid > > On Thu, Sep 10, 2020 at 4:28 PM jim ferenczi > wrote: > > > > Right, I misunderstood Alan's answer. The boundary option is not > "impure" in my opinion. It solves this issue nicely but maybe it needs > something more packaged to add the boundaries and build queries easily. > > > > Le jeu. 10 sept. 2020 à 16:16, Dawid Weiss a > écrit : > >> > >> Yup - similar to what Alan suggested. I'd have to rewrite the (general > >> text-to-query) query parser to only use intervals though. Still > >> thinking about possible approaches to this. > >> > >> D. > >> > >> On Thu, Sep 10, 2020 at 3:58 PM jim ferenczi > wrote: > >> > > >> > You could set a very high position increment gap for multi-valued > fields (Analyzer#getPositionIncrementGap) and perform something > >> > like Intervals.maxWidth(Intervals.unordered(...), pos_gap-1) ? > >> > > >> > > >> > Le jeu. 10 sept. 2020 à 12:32, Dawid Weiss a > écrit : > >> >> > >> >> Yeah... I was thinking about adding synthetic boundaries but this > >> >> seems... impure. :) Another quick reflection is that I'd have to > >> >> somehow translate the original query (which can be arbitrarily > >> >> complex) into an interval query. Tough. > >> >> > >> >> D. > >> >> > >> >> On Thu, Sep 10, 2020 at 12:22 PM Alan Woodward > wrote: > >> >> > > >> >> > I’ve solved this sort of thing in the past by indexing boundary > tokens, and wrapping the queries with the equivalent of > Intervals.notContaining(query, boundary-query); you could also put a very > large position increment gap and use a width filter, but that’s a bit more > error prone if you could conceivably have lots of text in the individual > field entries. > >> >> > > >> >> > > On 10 Sep 2020, at 10:38, Dawid Weiss > wrote: > >> >> > > > >> >> > > Hi Alan, > >> >> > > > >> >> > > You're the expert here so I thought I'd ask before I jump in > deep. Do > >> >> > > you think it's feasible to solve the following multivalued-field > >> >> > > problem: > >> >> > > > >> >> > > doc: field=["foo", "bar"] > >> >> > > query: field:(foo AND bar) > >> >> > > > >> >> > > I'd like the above to return zero hits (no single value contains > both > >> >> > > foo and bar), but since multi-valued fields are logically > indexed as a > >> >> > > single field, it returns doc. I recognize this as a well known > problem > >> >> > > but subdocuments are not fun to deal with so I'd like to avoid > them at > >> >> > > all costs. > >> >> > > > >> >> > > Would it be possible to solve the above with intervals? Say, > something > >> >> > > like this: > >> >> > > > >> >> > > Intervals.containing(valuePositionRanges(), query). > >> >> > > > >> >> > > I assume the containment relationship would get rid of > false-positives > >> >> > > crossing value boundary here. The problem is in how to construct > those > >> >> > > value position ranges... Store them at index-construction time > >> >> > > somehow? Compute them on the fly for anything that has a chance > to > >> >> > > match query? Your thoughts would be very appreciated. > >> >> > > > >> >> > > Dawid > >> >> > > > >> >> > > > - > >> >> > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >> >> > > For additional commands, e-mail: dev-h...@lucene.apache.org > >> >> > > > >> >> > > >> >> > > >> >> > > - > >> >> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >> >> > For additional commands, e-mail: dev-h...@lucene.apache.org > >> >> > > >> >> > >> >> - > >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >> >> For additional commands, e-mail: dev-h...@lucene.apache.org > >> >> > >> > >> - > >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: dev-h...@lucene.apache.org > >> > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >
Re: Avoiding false-positives in multivalued field search with intervals?
I am fine with the boundary token suggestion, actually. What I don't see at the moment is how I can marry it with an output of a general query parser (which returns any Query). I could give an attempt to process the query node tree from standard query parser (which we're using at the moment anyway) but if the tree becomes complex there is no guarantee I can extract subtrees that can be parsed into IntervalSources (and then in turn into IntervalQuery). Dawid On Thu, Sep 10, 2020 at 4:28 PM jim ferenczi wrote: > > Right, I misunderstood Alan's answer. The boundary option is not "impure" in > my opinion. It solves this issue nicely but maybe it needs something more > packaged to add the boundaries and build queries easily. > > Le jeu. 10 sept. 2020 à 16:16, Dawid Weiss a écrit : >> >> Yup - similar to what Alan suggested. I'd have to rewrite the (general >> text-to-query) query parser to only use intervals though. Still >> thinking about possible approaches to this. >> >> D. >> >> On Thu, Sep 10, 2020 at 3:58 PM jim ferenczi wrote: >> > >> > You could set a very high position increment gap for multi-valued fields >> > (Analyzer#getPositionIncrementGap) and perform something >> > like Intervals.maxWidth(Intervals.unordered(...), pos_gap-1) ? >> > >> > >> > Le jeu. 10 sept. 2020 à 12:32, Dawid Weiss a écrit >> > : >> >> >> >> Yeah... I was thinking about adding synthetic boundaries but this >> >> seems... impure. :) Another quick reflection is that I'd have to >> >> somehow translate the original query (which can be arbitrarily >> >> complex) into an interval query. Tough. >> >> >> >> D. >> >> >> >> On Thu, Sep 10, 2020 at 12:22 PM Alan Woodward >> >> wrote: >> >> > >> >> > I’ve solved this sort of thing in the past by indexing boundary tokens, >> >> > and wrapping the queries with the equivalent of >> >> > Intervals.notContaining(query, boundary-query); you could also put a >> >> > very large position increment gap and use a width filter, but that’s a >> >> > bit more error prone if you could conceivably have lots of text in the >> >> > individual field entries. >> >> > >> >> > > On 10 Sep 2020, at 10:38, Dawid Weiss wrote: >> >> > > >> >> > > Hi Alan, >> >> > > >> >> > > You're the expert here so I thought I'd ask before I jump in deep. Do >> >> > > you think it's feasible to solve the following multivalued-field >> >> > > problem: >> >> > > >> >> > > doc: field=["foo", "bar"] >> >> > > query: field:(foo AND bar) >> >> > > >> >> > > I'd like the above to return zero hits (no single value contains both >> >> > > foo and bar), but since multi-valued fields are logically indexed as a >> >> > > single field, it returns doc. I recognize this as a well known problem >> >> > > but subdocuments are not fun to deal with so I'd like to avoid them at >> >> > > all costs. >> >> > > >> >> > > Would it be possible to solve the above with intervals? Say, something >> >> > > like this: >> >> > > >> >> > > Intervals.containing(valuePositionRanges(), query). >> >> > > >> >> > > I assume the containment relationship would get rid of false-positives >> >> > > crossing value boundary here. The problem is in how to construct those >> >> > > value position ranges... Store them at index-construction time >> >> > > somehow? Compute them on the fly for anything that has a chance to >> >> > > match query? Your thoughts would be very appreciated. >> >> > > >> >> > > Dawid >> >> > > >> >> > > - >> >> > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> >> > > For additional commands, e-mail: dev-h...@lucene.apache.org >> >> > > >> >> > >> >> > >> >> > - >> >> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> >> > For additional commands, e-mail: dev-h...@lucene.apache.org >> >> > >> >> >> >> - >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >> >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Avoiding false-positives in multivalued field search with intervals?
Right, I misunderstood Alan's answer. The boundary option is not "impure" in my opinion. It solves this issue nicely but maybe it needs something more packaged to add the boundaries and build queries easily. Le jeu. 10 sept. 2020 à 16:16, Dawid Weiss a écrit : > Yup - similar to what Alan suggested. I'd have to rewrite the (general > text-to-query) query parser to only use intervals though. Still > thinking about possible approaches to this. > > D. > > On Thu, Sep 10, 2020 at 3:58 PM jim ferenczi > wrote: > > > > You could set a very high position increment gap for multi-valued fields > (Analyzer#getPositionIncrementGap) and perform something > > like Intervals.maxWidth(Intervals.unordered(...), pos_gap-1) ? > > > > > > Le jeu. 10 sept. 2020 à 12:32, Dawid Weiss a > écrit : > >> > >> Yeah... I was thinking about adding synthetic boundaries but this > >> seems... impure. :) Another quick reflection is that I'd have to > >> somehow translate the original query (which can be arbitrarily > >> complex) into an interval query. Tough. > >> > >> D. > >> > >> On Thu, Sep 10, 2020 at 12:22 PM Alan Woodward > wrote: > >> > > >> > I’ve solved this sort of thing in the past by indexing boundary > tokens, and wrapping the queries with the equivalent of > Intervals.notContaining(query, boundary-query); you could also put a very > large position increment gap and use a width filter, but that’s a bit more > error prone if you could conceivably have lots of text in the individual > field entries. > >> > > >> > > On 10 Sep 2020, at 10:38, Dawid Weiss > wrote: > >> > > > >> > > Hi Alan, > >> > > > >> > > You're the expert here so I thought I'd ask before I jump in deep. > Do > >> > > you think it's feasible to solve the following multivalued-field > >> > > problem: > >> > > > >> > > doc: field=["foo", "bar"] > >> > > query: field:(foo AND bar) > >> > > > >> > > I'd like the above to return zero hits (no single value contains > both > >> > > foo and bar), but since multi-valued fields are logically indexed > as a > >> > > single field, it returns doc. I recognize this as a well known > problem > >> > > but subdocuments are not fun to deal with so I'd like to avoid them > at > >> > > all costs. > >> > > > >> > > Would it be possible to solve the above with intervals? Say, > something > >> > > like this: > >> > > > >> > > Intervals.containing(valuePositionRanges(), query). > >> > > > >> > > I assume the containment relationship would get rid of > false-positives > >> > > crossing value boundary here. The problem is in how to construct > those > >> > > value position ranges... Store them at index-construction time > >> > > somehow? Compute them on the fly for anything that has a chance to > >> > > match query? Your thoughts would be very appreciated. > >> > > > >> > > Dawid > >> > > > >> > > > - > >> > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >> > > For additional commands, e-mail: dev-h...@lucene.apache.org > >> > > > >> > > >> > > >> > - > >> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >> > For additional commands, e-mail: dev-h...@lucene.apache.org > >> > > >> > >> - > >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: dev-h...@lucene.apache.org > >> > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >
Re: Avoiding false-positives in multivalued field search with intervals?
Yup - similar to what Alan suggested. I'd have to rewrite the (general text-to-query) query parser to only use intervals though. Still thinking about possible approaches to this. D. On Thu, Sep 10, 2020 at 3:58 PM jim ferenczi wrote: > > You could set a very high position increment gap for multi-valued fields > (Analyzer#getPositionIncrementGap) and perform something > like Intervals.maxWidth(Intervals.unordered(...), pos_gap-1) ? > > > Le jeu. 10 sept. 2020 à 12:32, Dawid Weiss a écrit : >> >> Yeah... I was thinking about adding synthetic boundaries but this >> seems... impure. :) Another quick reflection is that I'd have to >> somehow translate the original query (which can be arbitrarily >> complex) into an interval query. Tough. >> >> D. >> >> On Thu, Sep 10, 2020 at 12:22 PM Alan Woodward wrote: >> > >> > I’ve solved this sort of thing in the past by indexing boundary tokens, >> > and wrapping the queries with the equivalent of >> > Intervals.notContaining(query, boundary-query); you could also put a very >> > large position increment gap and use a width filter, but that’s a bit more >> > error prone if you could conceivably have lots of text in the individual >> > field entries. >> > >> > > On 10 Sep 2020, at 10:38, Dawid Weiss wrote: >> > > >> > > Hi Alan, >> > > >> > > You're the expert here so I thought I'd ask before I jump in deep. Do >> > > you think it's feasible to solve the following multivalued-field >> > > problem: >> > > >> > > doc: field=["foo", "bar"] >> > > query: field:(foo AND bar) >> > > >> > > I'd like the above to return zero hits (no single value contains both >> > > foo and bar), but since multi-valued fields are logically indexed as a >> > > single field, it returns doc. I recognize this as a well known problem >> > > but subdocuments are not fun to deal with so I'd like to avoid them at >> > > all costs. >> > > >> > > Would it be possible to solve the above with intervals? Say, something >> > > like this: >> > > >> > > Intervals.containing(valuePositionRanges(), query). >> > > >> > > I assume the containment relationship would get rid of false-positives >> > > crossing value boundary here. The problem is in how to construct those >> > > value position ranges... Store them at index-construction time >> > > somehow? Compute them on the fly for anything that has a chance to >> > > match query? Your thoughts would be very appreciated. >> > > >> > > Dawid >> > > >> > > - >> > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> > > For additional commands, e-mail: dev-h...@lucene.apache.org >> > > >> > >> > >> > - >> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> > For additional commands, e-mail: dev-h...@lucene.apache.org >> > >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Avoiding false-positives in multivalued field search with intervals?
You could set a very high position increment gap for multi-valued fields (Analyzer#getPositionIncrementGap) and perform something like Intervals.maxWidth(Intervals.unordered(...), pos_gap-1) ? Le jeu. 10 sept. 2020 à 12:32, Dawid Weiss a écrit : > Yeah... I was thinking about adding synthetic boundaries but this > seems... impure. :) Another quick reflection is that I'd have to > somehow translate the original query (which can be arbitrarily > complex) into an interval query. Tough. > > D. > > On Thu, Sep 10, 2020 at 12:22 PM Alan Woodward > wrote: > > > > I’ve solved this sort of thing in the past by indexing boundary tokens, > and wrapping the queries with the equivalent of > Intervals.notContaining(query, boundary-query); you could also put a very > large position increment gap and use a width filter, but that’s a bit more > error prone if you could conceivably have lots of text in the individual > field entries. > > > > > On 10 Sep 2020, at 10:38, Dawid Weiss wrote: > > > > > > Hi Alan, > > > > > > You're the expert here so I thought I'd ask before I jump in deep. Do > > > you think it's feasible to solve the following multivalued-field > > > problem: > > > > > > doc: field=["foo", "bar"] > > > query: field:(foo AND bar) > > > > > > I'd like the above to return zero hits (no single value contains both > > > foo and bar), but since multi-valued fields are logically indexed as a > > > single field, it returns doc. I recognize this as a well known problem > > > but subdocuments are not fun to deal with so I'd like to avoid them at > > > all costs. > > > > > > Would it be possible to solve the above with intervals? Say, something > > > like this: > > > > > > Intervals.containing(valuePositionRanges(), query). > > > > > > I assume the containment relationship would get rid of false-positives > > > crossing value boundary here. The problem is in how to construct those > > > value position ranges... Store them at index-construction time > > > somehow? Compute them on the fly for anything that has a chance to > > > match query? Your thoughts would be very appreciated. > > > > > > Dawid > > > > > > - > > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > > > > > > - > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >
Re: Avoiding false-positives in multivalued field search with intervals?
Yeah... I was thinking about adding synthetic boundaries but this seems... impure. :) Another quick reflection is that I'd have to somehow translate the original query (which can be arbitrarily complex) into an interval query. Tough. D. On Thu, Sep 10, 2020 at 12:22 PM Alan Woodward wrote: > > I’ve solved this sort of thing in the past by indexing boundary tokens, and > wrapping the queries with the equivalent of Intervals.notContaining(query, > boundary-query); you could also put a very large position increment gap and > use a width filter, but that’s a bit more error prone if you could > conceivably have lots of text in the individual field entries. > > > On 10 Sep 2020, at 10:38, Dawid Weiss wrote: > > > > Hi Alan, > > > > You're the expert here so I thought I'd ask before I jump in deep. Do > > you think it's feasible to solve the following multivalued-field > > problem: > > > > doc: field=["foo", "bar"] > > query: field:(foo AND bar) > > > > I'd like the above to return zero hits (no single value contains both > > foo and bar), but since multi-valued fields are logically indexed as a > > single field, it returns doc. I recognize this as a well known problem > > but subdocuments are not fun to deal with so I'd like to avoid them at > > all costs. > > > > Would it be possible to solve the above with intervals? Say, something > > like this: > > > > Intervals.containing(valuePositionRanges(), query). > > > > I assume the containment relationship would get rid of false-positives > > crossing value boundary here. The problem is in how to construct those > > value position ranges... Store them at index-construction time > > somehow? Compute them on the fly for anything that has a chance to > > match query? Your thoughts would be very appreciated. > > > > Dawid > > > > - > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Avoiding false-positives in multivalued field search with intervals?
I’ve solved this sort of thing in the past by indexing boundary tokens, and wrapping the queries with the equivalent of Intervals.notContaining(query, boundary-query); you could also put a very large position increment gap and use a width filter, but that’s a bit more error prone if you could conceivably have lots of text in the individual field entries. > On 10 Sep 2020, at 10:38, Dawid Weiss wrote: > > Hi Alan, > > You're the expert here so I thought I'd ask before I jump in deep. Do > you think it's feasible to solve the following multivalued-field > problem: > > doc: field=["foo", "bar"] > query: field:(foo AND bar) > > I'd like the above to return zero hits (no single value contains both > foo and bar), but since multi-valued fields are logically indexed as a > single field, it returns doc. I recognize this as a well known problem > but subdocuments are not fun to deal with so I'd like to avoid them at > all costs. > > Would it be possible to solve the above with intervals? Say, something > like this: > > Intervals.containing(valuePositionRanges(), query). > > I assume the containment relationship would get rid of false-positives > crossing value boundary here. The problem is in how to construct those > value position ranges... Store them at index-construction time > somehow? Compute them on the fly for anything that has a chance to > match query? Your thoughts would be very appreciated. > > Dawid > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org