Hi,
Another way to do this is using Word Delimiter Filter and use "catenate"
options. Be aware that you need special text tokenization (not use
standard tokenizer, but instead WhitespaceTokenizer). This approach is
common for product numbers.
To no break you "normal" analysis, it is often a good idea to have a
separate field with the "alternative" analysis, which also applies other
things like stemming and search in both fields using a Dismax Query,
ranking the "normal" standard analyzer based query field higher.
Uwe
Am 05.03.2025 um 20:19 schrieb Mikhail Khludnev:
Hello Trevor.
Maintaining such a synonym map is too much of a burden.
One idea: sticks words together with "" separator with
https://lucene.apache.org/core/8_0_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html
Another idea, the opposite breaks user's words via dictionary
https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html
However, it's actually a suggester's duty
https://lucene.apache.org/core//8_0_0/suggest/org/apache/lucene/search/spell/WordBreakSpellChecker.html
however it's aside of the main search flow.
On Wed, Mar 5, 2025 at 5:28 PM Trevor Nicholls <tre...@castingthevoid.com>
wrote:
I don't know if I have completely the wrong idea or not, hopefully somebody
can point out where I have got this wrong
I am indexing technical documentation; the content contains strings like
"http_proxy_server". When building the index my analyzer breaks this into
the tokens "http", "proxy" and "server". It generatees the same tokens for
"http.proxy.server"; constructions like this are also common in the
documents.
At the moment the application is using Lucene 8.6.3.
If the document contains "http_proxy_server" the user can search for "http
proxy server", "http.proxy.server" or "http proxy server" and all will
match.
However, I am trying to construct the index and the search so that if the
user searches for e.g. "http proxyserver" they also find a match. I thought
it would be sufficient to add an entry to the synonym map specifying that
"http proxy" and "httpproxy" are synonyms, and likewise "proxy server" and
"proxyserver". (When adding multiple-word phrases the spaces are replaced
by
SynonymMap.WORD_SEPARATOR).
The analyzer incorporates the synonym map when building the index, but not
when searching - the synonyms (both words and phrases) should already be in
the index so a user's search pattern should not need to be extended by
them.
Unfortunately this doesn't appear to be working as I expected. If a user
searches for "httpproxy" or "proxyserver" nothing is matched.
When I print the tokens in the stream emitted by the analyser, I can see
all
the word for word synonyms output (e.g. if the content contains "license",
the emerging tokens include both "licence" and "license"), but the phrase
substitutions are not. "http", "proxy" and "server " are there, but none of
the conjunctions appear.
I don't think synonym replacement should be occurring at search time, if
only for performance reasons, but what have I missed in how this should
work? Am I chasing the impossible dream?
cheers
T
--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org