(Currently using Lucene 8_6_3, although not averse to moving to a later
release if there's a recent feature I need for this)

 

My application searches technical documents, a mix of normal text, source
code and expressions involving more than letters and digits.

 

The users want to be able to search for "compound" terms and find any of the
ways the terms may be joined. As an example let's use
"app.server-file_name";  this should also find "app_server_file_name",
"app-server_file-name" and "app-serverfilename", etc.

 

I have implemented this via filters in the analyzer which duplicate compound
terms by splitting them at the conjunction character and outputting copies
with and without the conjunction.

 

Thus given the input "app.server-file_name" we first obtain the tokens
[app.] [server-] [file_] [name], then replicate them so that the token
stream output by the analyzer contains both

 

    [app.] [server-] [file_] [name]

 

and

 

    [app] [server] [file] [name]

 

with all the correct offsets.

 

The same analyzer is applied both to the indexed content and to the search
terms.

 

This works beautifully for compound terms; the query results are
conjunction-character-agnostic and all the possible ways of finding the
compound are matched.

 

However there's a flaw here, because a couple of the possible conjunction
characters (specifically hyphen and fullstop) have other uses as well, as
e.g. a minus sign in an expression or a decimal point in a value.

 

Because the analyzer is treating input a-b, ab, a.b identically, the results
of a search for e.g. "a-b" do not put "a-b" matches ahead of "ab" (or
"a_b"). If I could somehow fix this issue I'd be completely happy. Is there
a better way of doing what I am trying to do here?

 

cheers

T

 

 

 

 

 

Reply via email to