RE: priority of query results with text alternates

Trevor Nicholls Sun, 06 Oct 2024 02:01:02 -0700

Hi


You’ve broken through a blind spot – I’ve been very comfortable with putting 
different content into different fields (with different analyzers etc.) but 
completely missed the idea of putting the same content into two different 
fields.

 

I’m quite confident that this is the way to do it.

 

Thanks

T

 

From: Ralf Heyde <ralf.he...@gmx.de.INVALID> 
Sent: Sunday, 6 October 2024 19:27
To: java-user@lucene.apache.org
Subject: Re: priority of query results with text alternates

 

Hey,

 

In case I have such an issue, i usually tend to use more than one field with 
different analyzer setups and weight/multiply score them individually for each 
field (index / query).

 



 
<https://lucene.apache.org/core/9_11_0/core/org/apache/lucene/search/BoostQuery.html>
 BoostQuery (Lucene 9.11.0 core API)

 
<https://lucene.apache.org/core/9_11_0/core/org/apache/lucene/search/BoostQuery.html>
 lucene.apache.org

 
<https://lucene.apache.org/core/9_11_0/core/org/apache/lucene/search/BoostQuery.html>
 

 

That may solve it. 

 

Cheers

 

 

Von meinem Telefon gesendet, etwaige Rechtschreibfehler kann ich nicht 
ausschliessen

 

Telefonieren verkürzt das Email Hin und Her





Am 06.10.2024 um 06:28 schrieb Trevor Nicholls <tre...@castingthevoid.com 
<mailto:tre...@castingthevoid.com> >:

(Currently using Lucene 8_6_3, although not averse to moving to a later
release if there's a recent feature I need for this)



My application searches technical documents, a mix of normal text, source
code and expressions involving more than letters and digits.



The users want to be able to search for "compound" terms and find any of the
ways the terms may be joined. As an example let's use
"app.server-file_name";  this should also find "app_server_file_name",
"app-server_file-name" and "app-serverfilename", etc.



I have implemented this via filters in the analyzer which duplicate compound
terms by splitting them at the conjunction character and outputting copies
with and without the conjunction.



Thus given the input "app.server-file_name" we first obtain the tokens
[app.] [server-] [file_] [name], then replicate them so that the token
stream output by the analyzer contains both



   [app.] [server-] [file_] [name]



and



   [app] [server] [file] [name]



with all the correct offsets.



The same analyzer is applied both to the indexed content and to the search
terms.



This works beautifully for compound terms; the query results are
conjunction-character-agnostic and all the possible ways of finding the
compound are matched.



However there's a flaw here, because a couple of the possible conjunction
characters (specifically hyphen and fullstop) have other uses as well, as
e.g. a minus sign in an expression or a decimal point in a value.



Because the analyzer is treating input a-b, ab, a.b identically, the results
of a search for e.g. "a-b" do not put "a-b" matches ahead of "ab" (or
"a_b"). If I could somehow fix this issue I'd be completely happy. Is there
a better way of doing what I am trying to do here?



cheers

T

RE: priority of query results with text alternates

Reply via email to