Re: Proximity Search with Phrases

Mikhail Khludnev Thu, 02 Jul 2026 15:42:54 -0700

Working on Intervals QP https://github.com/apache/solr/pull/4582 Stay
tuned.
Feedback is welcome!


On Fri, Sep 12, 2025 at 10:43 PM Mikhail Khludnev <[email protected]> wrote:

> I've checked the surround parser. Turns out it lacks braces support.
> I've also added a reproducer for nested spans issue, which intervals are
> able to handle
>
> https://github.com/mkhludnev/solr-flexible-qparser/blob/860e17c16153b1d3ef337f099b0d9f572620e9b1/src/test/java/org/apache/solr/flexibleqp/TestCompeteWithSpans.java#L49
>
>
> On Tue, Sep 9, 2025 at 1:12 PM Mikhail Khludnev <[email protected]> wrote:
>
>> Right. complexphrase is not an option for nesting.
>> I'm wondering if you encounter
>> https://issues.apache.org/jira/browse/LUCENE-7398 Let us know please if
>> you do.
>> I'm interested in whether intervals are an option for such cases.
>>
>> On Mon, Sep 8, 2025 at 6:31 PM Matt Kuiper <[email protected]> wrote:
>>
>>> Thanks for the feedback!
>>>
>>> Mikhail - I did not see the complex query parser supporting proximity
>>> between 2 phrases, however the XmlQParser might via spans.  Thanks for
>>> the
>>> tip!
>>>
>>> Gus - we currently use the Surround query  parser for proximity between
>>> two
>>> terms. Do you know of a means to use it for proximity between phrases?
>>> This would be ideal as we have a search client tool already using this
>>> syntax.
>>>
>>> Dave - This type of approach might work for us (possibly like the complex
>>> query parser) where it is not exactly finding proximity between two
>>> phrases.  But verifying that all the worlds within two phrases are
>>> within a
>>> proximity range.  As you say this could handle stop words that may still
>>> be
>>> in the index from not blocking a match.
>>>
>>> Matt
>>>
>>> On Mon, Sep 8, 2025 at 7:29 AM Dave <[email protected]>
>>> wrote:
>>>
>>> > There are other clever ways to do it too, using the within parameter,
>>> and
>>> > other things I don’t remember off the top of my head but I gave a
>>> > presentation a few years ago that utilized it.   It uses more raw solr
>>> > parameters that you can take in a phrase but tokenize them and find out
>>> > documents that have that phrase but may have words inside them, so you
>>> > restrict the results to only documents that have all the words in the
>>> > phrase but within that number of words plus 2 or 3 to take care of stop
>>> > words that may show up, like “red house hill” would still find “red
>>> house
>>> > on top of the hill” within a proximity to each other of about 7.
>>> >
>>> > > On Sep 7, 2025, at 7:15 PM, Gus Heck <[email protected]> wrote:
>>> > >
>>> > > Or
>>> > >
>>> >
>>> https://solr.apache.org/guide/solr/latest/query-guide/other-parsers.html#surround-query-parser
>>> > >
>>> > >> On Sun, Sep 7, 2025 at 4:32 PM Mikhail Khludnev <[email protected]>
>>> > wrote:
>>> > >>
>>> > >> Hi
>>> > >> I might be missing a point. But the way to create spans in Solr are:
>>> > >>
>>> > >>
>>> >
>>> https://solr.apache.org/guide/solr/latest/query-guide/other-parsers.html#xml-query-parser
>>> > >>
>>> > >>
>>> >
>>> https://solr.apache.org/guide/solr/latest/query-guide/other-parsers.html#complex-phrase-query-parser
>>> > >>
>>> > >>
>>> > >>> On Fri, Sep 5, 2025 at 6:32 PM mtn search <[email protected]>
>>> wrote:
>>> > >>>
>>> > >>> I may have found what I am running up against - if Chatgpt is
>>> correct
>>> > >>> on diagnosis?
>>> > >>>
>>> > >>> *My sample query*
>>> > >>> /select?debug=true&indent=true&q={!lucene}spanNear(
>>> > >>>  spanNear(spanTerm(body:separate),spanTerm(body:email),0,true),
>>> > >>>  spanNear(spanTerm(body:will),spanTerm(body:be),0,true),
>>> > >>>  10,false)
>>> > >>>
>>> > >>> *Text from body field from a message where the messages is returned
>>> > from
>>> > >>> the spanNear query above (I believe incorrectly)*
>>> > >>>       "separate device there will not be any load on the email
>>> servers"
>>> > >>>
>>> > >>> *Same text through analyzer*
>>> > >>> text
>>> > >>> raw_bytes
>>> > >>> start
>>> > >>> end
>>> > >>>
>>> > >>>
>>> > >>> separate
>>> > >>> [73 65 70 61 72 61 74 65]
>>> > >>> 5
>>> > >>> 13
>>> > >>>
>>> > >>> device
>>> > >>> [64 65 76 69 63 65]
>>> > >>> 14
>>> > >>> 20
>>> > >>>
>>> > >>> there
>>> > >>> [74 68 65 72 65]
>>> > >>> 21
>>> > >>> 26
>>> > >>>
>>> > >>> will
>>> > >>> [77 69 6c 6c]
>>> > >>> 27
>>> > >>> 31
>>> > >>>
>>> > >>> not
>>> > >>> [6e 6f 74]
>>> > >>> 32
>>> > >>> 35
>>> > >>>
>>> > >>> be
>>> > >>> [62 65]
>>> > >>> 36
>>> > >>> 38
>>> > >>>
>>> > >>> any
>>> > >>> [61 6e 79]
>>> > >>> 39
>>> > >>> 42
>>> > >>>
>>> > >>> load
>>> > >>> [6c 6f 61 64]
>>> > >>> 43
>>> > >>> 47
>>> > >>>
>>> > >>> on
>>> > >>> [6f 6e]
>>> > >>> 48
>>> > >>> 50
>>> > >>>
>>> > >>> the
>>> > >>> [74 68 65]
>>> > >>> 51
>>> > >>> 54
>>> > >>>
>>> > >>> email
>>> > >>> [65 6d 61 69 6c]
>>> > >>> 55
>>> > >>> 60
>>> > >>>
>>> > >>> server
>>> > >>> [73 65 72 76 65 72]
>>> > >>> 61
>>> > >>> 68
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>> *Chatgpt assessment*
>>> > >>>
>>> > >>>    Now, let’s check the spans:
>>> > >>>
>>> > >>>   -
>>> > >>>
>>> > >>>   Inner spanNear(separate, email, 0, true) is *not* going to match
>>> > >>>   directly, because email isn’t right after separate.
>>> > >>>   -
>>> > >>>
>>> > >>>   But Lucene is allowed to *reposition* the spans when used as
>>> children
>>> > >> of
>>> > >>>   the outer spanNear. Each child span doesn’t need to be contiguous
>>> > >> unless
>>> > >>>   it resolves to a valid match somewhere in the text.
>>> > >>>
>>> > >>> *Conclusion: *This last line may explain why the message above was
>>> > >> returned
>>> > >>> by the query above, but appears to be incorrect.  While the
>>> > words/tokens
>>> > >> in
>>> > >>> the query are in the message they do not honor the proximity
>>> specified.
>>> > >>> But apparently children spans do not have to honor the proximity
>>> rules
>>> > >>> specified.  AI suggested this query for proximity, I am now
>>> concluding
>>> > it
>>> > >>> is not a valid approach.
>>> > >>>
>>> > >>> I am not seeing a Solr/Lucene http query approach for a proximity
>>> > search
>>> > >>> between phrases,  other than possibly to use the Lucene Java API
>>> for
>>> > more
>>> > >>> control.
>>> > >>>
>>> > >>> If others have found a workable solution, please let me know.
>>> > >>>
>>> > >>> Thanks,
>>> > >>> Matt
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>>> On Thu, Sep 4, 2025 at 3:26 PM mtn search <[email protected]>
>>> > wrote:
>>> > >>>
>>> > >>>> Also, I am using the SolrAdmin Analysis UI to verify how Solr is
>>> > >>>> tokenizing the messages and verifying manually position between
>>> > tokens.
>>> > >>>>
>>> > >>>> Debug view of the query side:
>>> > >>>> For query:
>>> > >>>> "*params*":{
>>> > >>>>      "q":"{!lucene}SpanNearQuery(body,(money question),5,true)",
>>> > >>>>      "df":"body",
>>> > >>>>      "debug":"true",
>>> > >>>>      "indent":"true",
>>> > >>>>      "q.op":"OR",
>>> > >>>>      "wt":"json"}},
>>> > >>>>
>>> > >>>> It seems odd that in the parsed query that the "body" field named
>>> is
>>> > >>>> pre-appended to the value 5 and the text true.
>>> > >>>>  "*debug*":{
>>> > >>>>    "rawquerystring":"{!lucene}SpanNearQuery(body,(money
>>> > >>>> question),5,true)",
>>> > >>>>    "querystring":"{!lucene}SpanNearQuery(body,(money
>>> > >> question),5,true)",
>>> > >>>>    "parsedquery":"body:spannearquery (body:body (body:money
>>> > >>>> body:question) (body:5 body:true))",
>>> > >>>>    "*parsedquery_toString*":*"body:spannearquery *(body:body
>>> > >> (body:money
>>> > >>>> body:question)* (body:5 body:true*))",
>>> > >>>>    "explain":{
>>> > >>>>
>>> > >>>> On Thu, Sep 4, 2025 at 12:04 PM mtn search <[email protected]>
>>> > >> wrote:
>>> > >>>>
>>> > >>>>> Thanks Tim!  Yes I have tried a variety of values and am aware
>>> > >>>>> of ordering vs non ordering.  I am getting more results than
>>> expected
>>> > >>> and
>>> > >>>>> some that do not match the proximity criteria.   So when I set
>>> it to
>>> > a
>>> > >>>>> small value like 2, I was seeking to see the result count drop
>>> > >>>>> significantly as many would not match criteria.  Unfortunately,
>>> the
>>> > >>> count
>>> > >>>>> does not drop.   Looks like a fundamental problem with how I am
>>> using
>>> > >>> the
>>> > >>>>> syntax.  Still researching, and open to suggestions.
>>> > >>>>>
>>> > >>>>> Matt
>>> > >>>>>
>>> > >>>>> On Thu, Sep 4, 2025 at 11:54 AM Tim Casey <[email protected]>
>>> wrote:
>>> > >>>>>
>>> > >>>>>> usually the span and proximities are off-by-one issues.
>>> > Specifically
>>> > >>> the
>>> > >>>>>> order of the tokens will change the distance calculation.  I do
>>> not
>>> > >>> have
>>> > >>>>>> an
>>> > >>>>>> example off the top of my head.   But, when I was doing this, I
>>> > >> usually
>>> > >>>>>> started with a larger span and brought it down through looking
>>> at
>>> > >>>>>> results.
>>> > >>>>>>
>>> > >>>>>> This is the case for the old 5~"phrase words" syntax.
>>> > >>>>>>
>>> > >>>>>> As an aside, "Not working" is taken by me to mean you are not
>>> > getting
>>> > >>>>>> results but the query passes parse.  Not working could mean a
>>> lot
>>> > >> more
>>> > >>> in
>>> > >>>>>> this context.  So I am suggesting, instead of 2, try 10.
>>> > >>>>>>
>>> > >>>>>> On Thu, Sep 4, 2025 at 10:43 AM mtn search <[email protected]
>>> >
>>> > >>> wrote:
>>> > >>>>>>
>>> > >>>>>>> Hello,
>>> > >>>>>>>
>>> > >>>>>>> Looking for guidance on approaches to implement a proximity
>>> search
>>> > >>>>>> between
>>> > >>>>>>> phrases.
>>> > >>>>>>>
>>> > >>>>>>> Initially tried:
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>
>>> > >>>
>>> > >>
>>> >
>>> "q":"{!lucene}spanNear(spanNear(spanNear(spanTerm(body:off),spanTerm(body:the),0,true),
>>> > >>>>>>> spanTerm(body: record),0,true),
>>> > >>>>>> spanNear(spanTerm(body:new),spanTerm(body:
>>> > >>>>>>> information),0,true) , 2N,false)",
>>> > >>>>>>>      "defType":"lucene",
>>> > >>>>>>>      "df":"body",
>>> > >>>>>>>
>>> > >>>>>>> However then simplified to just two terms:
>>> > >>>>>>>
>>> > >>>
>>> "q":"{!lucene}spanNear(spanTerm(body:off),spanTerm(body:call),2,true)",
>>> > >>>>>>>      "defType":"lucene",
>>> > >>>>>>>      "df":"body",
>>> > >>>>>>>
>>> > >>>>>>> Both are not working.  Any tips?  Currently on Solr 9.4, but
>>> will
>>> > >>>>>> likely
>>> > >>>>>>> need to run for some time on a Solr 6 instance.
>>> > >>>>>>>
>>> > >>>>>>> Thanks,
>>> > >>>>>>> Matt
>>> > >>>>>>>
>>> > >>>>>>
>>> > >>>>>
>>> > >>>
>>> > >>
>>> > >>
>>> > >> --
>>> > >> Sincerely yours
>>> > >> Mikhail Khludnev
>>> > >>
>>> > >
>>> > >
>>> > > --
>>> > > http://www.needhamsoftware.com (work)
>>> > > https://a.co/d/b2sZLD9 (my fantasy fiction book)
>>> >
>>>
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


-- 
Sincerely yours
Mikhail Khludnev

Re: Proximity Search with Phrases

Reply via email to