Hello Trevor. Maintaining such a synonym map is too much of a burden. One idea: sticks words together with "" separator with https://lucene.apache.org/core/8_0_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html Another idea, the opposite breaks user's words via dictionary https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html However, it's actually a suggester's duty https://lucene.apache.org/core//8_0_0/suggest/org/apache/lucene/search/spell/WordBreakSpellChecker.html however it's aside of the main search flow.
On Wed, Mar 5, 2025 at 5:28 PM Trevor Nicholls <tre...@castingthevoid.com> wrote: > I don't know if I have completely the wrong idea or not, hopefully somebody > can point out where I have got this wrong > > > > I am indexing technical documentation; the content contains strings like > "http_proxy_server". When building the index my analyzer breaks this into > the tokens "http", "proxy" and "server". It generatees the same tokens for > "http.proxy.server"; constructions like this are also common in the > documents. > > > > At the moment the application is using Lucene 8.6.3. > > > > If the document contains "http_proxy_server" the user can search for "http > proxy server", "http.proxy.server" or "http proxy server" and all will > match. > > > > However, I am trying to construct the index and the search so that if the > user searches for e.g. "http proxyserver" they also find a match. I thought > it would be sufficient to add an entry to the synonym map specifying that > "http proxy" and "httpproxy" are synonyms, and likewise "proxy server" and > "proxyserver". (When adding multiple-word phrases the spaces are replaced > by > SynonymMap.WORD_SEPARATOR). > > > > The analyzer incorporates the synonym map when building the index, but not > when searching - the synonyms (both words and phrases) should already be in > the index so a user's search pattern should not need to be extended by > them. > > > > Unfortunately this doesn't appear to be working as I expected. If a user > searches for "httpproxy" or "proxyserver" nothing is matched. > > When I print the tokens in the stream emitted by the analyser, I can see > all > the word for word synonyms output (e.g. if the content contains "license", > the emerging tokens include both "licence" and "license"), but the phrase > substitutions are not. "http", "proxy" and "server " are there, but none of > the conjunctions appear. > > > > I don't think synonym replacement should be occurring at search time, if > only for performance reasons, but what have I missed in how this should > work? Am I chasing the impossible dream? > > > > cheers > > T > > > > > > -- Sincerely yours Mikhail Khludnev