Hello Trevor.

Maintaining such a synonym map is too much of a burden.
One idea: sticks words together with "" separator with
https://lucene.apache.org/core/8_0_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html
Another idea, the opposite breaks user's words via dictionary
https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html
However, it's actually a suggester's duty
https://lucene.apache.org/core//8_0_0/suggest/org/apache/lucene/search/spell/WordBreakSpellChecker.html
however it's aside of the main search flow.


On Wed, Mar 5, 2025 at 5:28 PM Trevor Nicholls <tre...@castingthevoid.com>
wrote:

> I don't know if I have completely the wrong idea or not, hopefully somebody
> can point out where I have got this wrong
>
>
>
> I am indexing technical documentation; the content contains strings like
> "http_proxy_server". When building the index my analyzer breaks this into
> the tokens "http", "proxy" and "server". It generatees the same tokens for
> "http.proxy.server"; constructions like this are also common in the
> documents.
>
>
>
> At the moment the application is using Lucene 8.6.3.
>
>
>
> If the document contains "http_proxy_server" the user can search for "http
> proxy server", "http.proxy.server" or "http proxy server" and all will
> match.
>
>
>
> However, I am trying to construct the index and the search so that if the
> user searches for e.g. "http proxyserver" they also find a match. I thought
> it would be sufficient to add an entry to the synonym map specifying that
> "http proxy" and "httpproxy" are synonyms, and likewise "proxy server" and
> "proxyserver". (When adding multiple-word phrases the spaces are replaced
> by
> SynonymMap.WORD_SEPARATOR).
>
>
>
> The analyzer incorporates the synonym map when building the index, but not
> when searching - the synonyms (both words and phrases) should already be in
> the index so a user's search pattern should not need to be extended by
> them.
>
>
>
> Unfortunately this doesn't appear to be working as I expected. If a user
> searches for "httpproxy" or "proxyserver" nothing is matched.
>
> When I print the tokens in the stream emitted by the analyser, I can see
> all
> the word for word synonyms output (e.g. if the content contains "license",
> the emerging tokens include both "licence" and "license"), but the phrase
> substitutions are not. "http", "proxy" and "server " are there, but none of
> the conjunctions appear.
>
>
>
> I don't think synonym replacement should be occurring at search time, if
> only for performance reasons, but what have I missed in how this should
> work? Am I chasing the impossible dream?
>
>
>
> cheers
>
> T
>
>
>
>
>
>

-- 
Sincerely yours
Mikhail Khludnev

Reply via email to