Hi Uwe/Mikhail

(Note: you'll probably need to restore any line breaks your mailer has decided 
to filter out for the following to make sense.)

At the moment the content is analysed twice, so I have a "text" field and an 
"exactText" field.

The text field uses a custom analyzer. This is because none of the standard 
analysers really matches the content well; imagine our content containing 
things like:

"string 1" + "string 2"
price * 1.025/decimalplaces=3
customer:title+" "+first_name/newline
http_proxy_server
license.server.address
act as web server

The custom analyzer splits the input with a PatternTokenizer, and includes a 
LowerCaseFilter and a custom filter which I have called ClipWordFilter.
The pattern tokenizer basically retains all punctuation but splits words at 
delimiters like full stops, underscores and hyphens. The tokens it emits 
includes these trailing delimiters and then the ClipWord filter duplicates the 
tokens without them.
Thus for example http_proxy_server produces the tokens [http_] [http] [proxy_] 
[proxy] and [server].

The exactText field is analysed using a TextAnalyzer, which combines a 
WhiteSpaceTokenizer and a LowerCaseFilter. Thus http_proxy_server produces one 
token, [http_proxy_server].

The text index means that somebody searching for proxy.server, proxy_server, or 
proxy server will find matches to any of those forms. But because my search 
method boosts exactText matches, the exact form they search for will be scored 
higher.

Because the custom analyzer produces word tokens I have incorporated synonyms 
(when indexing). The synonyms are bidirectional.

OK, that's all background.

The problem I am trying to fix at the moment is that searching for these 
compounds is working provided that there are delimiters, but is not working if 
any of them is omitted. So, as a trivial example, searching for "act as web 
server" or for "act_as_web_server" will find all the targets, but searching for 
"act as webserver" will not.

I have a list of synonyms loaded by the analyzer which adds every possible 
breakdown of that phrase. Thus if I log the activity in building the synonym 
map I can see the following:

act as = actas
actas = act as
act as web = actasweb
actasweb = act as web
act as web server = actaswebserver
actaswebserver = act as web server
as web = asweb
asweb = as web
as web server = aswebserver
aswebserver = as web server
web server = webserver
webserver = web server

If I analyze a document which contains a single line: act as web server
these are the tokens generated:

[actaswebserver]
[act]
[as]
[web]
[server]

So I've got one synonym, but not all the possibilities.

If the document contains: act_as_web_server
then the tokens are:

[act]
[act_]
[as]
[as_]
[web]
[web_]
[server]

i.e. no synonyms at all.

In either case a user searching for the phrase but specifying 'webserver' will 
not find it.

Just to demonstrate that this is only a problem with multi-word synonyms, I 
repeated the exercise with a different synonym table:

act = work
as = like
web = internet
server = host

The input: [act as web server] generates:

[work]
[act]
[like]
[as]
[internet]
[web]
[host]
[server]

and the input: [act_as_web_server] generates:

[work]
[act]
[act_]
[like]
[as]
[as_]
[internet]
[web]
[web_]
[host]
[server]

So with single word synonyms, I'm getting everything I would expect, and a user 
searching for any particular combination of synonyms will find it.

I had hoped for the same with phrases. But maybe my expectations are too high, 
or maybe I'm just doing it wrong.

cheers
T


-----Original Message-----
From: Uwe Schindler <u...@thetaphi.de> 
Sent: Monday, 10 March 2025 23:38
To: java-user@lucene.apache.org
Subject: Re: Synonyms and searching

Hi,

Another way to do this is using Word Delimiter Filter and use "catenate" 
options. Be aware that you need special text tokenization (not use standard 
tokenizer, but instead WhitespaceTokenizer). This approach is common for 
product numbers.

To no break you "normal" analysis, it is often a good idea to have a separate 
field with the "alternative" analysis, which also applies other things like 
stemming and search in both fields using a Dismax Query, ranking the "normal" 
standard analyzer based query field higher.

Uwe

Am 05.03.2025 um 20:19 schrieb Mikhail Khludnev:
> Hello Trevor.
>
> Maintaining such a synonym map is too much of a burden.
> One idea: sticks words together with "" separator with 
> https://lucene.apache.org/core/8_0_0/analyzers-common/org/apache/lucen
> e/analysis/shingle/ShingleFilter.html
> Another idea, the opposite breaks user's words via dictionary 
> https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucen
> e/analysis/compound/DictionaryCompoundWordTokenFilter.html
> However, it's actually a suggester's duty 
> https://lucene.apache.org/core//8_0_0/suggest/org/apache/lucene/search
> /spell/WordBreakSpellChecker.html however it's aside of the main 
> search flow.
>
>
> On Wed, Mar 5, 2025 at 5:28 PM Trevor Nicholls 
> <tre...@castingthevoid.com>
> wrote:
>
>> I don't know if I have completely the wrong idea or not, hopefully 
>> somebody can point out where I have got this wrong
>>
>>
>>
>> I am indexing technical documentation; the content contains strings 
>> like "http_proxy_server". When building the index my analyzer breaks 
>> this into the tokens "http", "proxy" and "server". It generatees the 
>> same tokens for "http.proxy.server"; constructions like this are also 
>> common in the documents.
>>
>>
>>
>> At the moment the application is using Lucene 8.6.3.
>>
>>
>>
>> If the document contains "http_proxy_server" the user can search for 
>> "http proxy server", "http.proxy.server" or "http proxy server" and 
>> all will match.
>>
>>
>>
>> However, I am trying to construct the index and the search so that if 
>> the user searches for e.g. "http proxyserver" they also find a match. 
>> I thought it would be sufficient to add an entry to the synonym map 
>> specifying that "http proxy" and "httpproxy" are synonyms, and 
>> likewise "proxy server" and "proxyserver". (When adding multiple-word 
>> phrases the spaces are replaced by SynonymMap.WORD_SEPARATOR).
>>
>>
>>
>> The analyzer incorporates the synonym map when building the index, 
>> but not when searching - the synonyms (both words and phrases) should 
>> already be in the index so a user's search pattern should not need to 
>> be extended by them.
>>
>>
>>
>> Unfortunately this doesn't appear to be working as I expected. If a 
>> user searches for "httpproxy" or "proxyserver" nothing is matched.
>>
>> When I print the tokens in the stream emitted by the analyser, I can 
>> see all the word for word synonyms output (e.g. if the content 
>> contains "license", the emerging tokens include both "licence" and 
>> "license"), but the phrase substitutions are not. "http", "proxy" and 
>> "server " are there, but none of the conjunctions appear.
>>
>>
>>
>> I don't think synonym replacement should be occurring at search time, 
>> if only for performance reasons, but what have I missed in how this 
>> should work? Am I chasing the impossible dream?
>>
>>
>>
>> cheers
>>
>> T
>>
>>
>>
>>
>>
>>
--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to