Hey,
In case I have such an issue, i usually tend to use more than one field with different analyzer setups and weight/multiply score them individually for each field (index / query).
That may solve it.
Cheers
Von meinem Telefon gesendet, etwaige Rechtschreibfehler kann ich nicht ausschliessen
Telefonieren verkürzt das Email Hin und Her Am 06.10.2024 um 06:28 schrieb Trevor Nicholls <tre...@castingthevoid.com>:
(Currently using Lucene 8_6_3, although not averse to moving to a later release if there's a recent feature I need for this)
My application searches technical documents, a mix of normal text, source code and expressions involving more than letters and digits.
The users want to be able to search for "compound" terms and find any of the ways the terms may be joined. As an example let's use "app.server-file_name"; this should also find "app_server_file_name", "app-server_file-name" and "app-serverfilename", etc.
I have implemented this via filters in the analyzer which duplicate compound terms by splitting them at the conjunction character and outputting copies with and without the conjunction.
Thus given the input "app.server-file_name" we first obtain the tokens [app.] [server-] [file_] [name], then replicate them so that the token stream output by the analyzer contains both
[app.] [server-] [file_] [name]
and
[app] [server] [file] [name]
with all the correct offsets.
The same analyzer is applied both to the indexed content and to the search terms.
This works beautifully for compound terms; the query results are conjunction-character-agnostic and all the possible ways of finding the compound are matched.
However there's a flaw here, because a couple of the possible conjunction characters (specifically hyphen and fullstop) have other uses as well, as e.g. a minus sign in an _expression_ or a decimal point in a value.
Because the analyzer is treating input a-b, ab, a.b identically, the results of a search for e.g. "a-b" do not put "a-b" matches ahead of "ab" (or "a_b"). If I could somehow fix this issue I'd be completely happy. Is there a better way of doing what I am trying to do here?
cheers
T
|