In my work, I usually use Automaton to convert "http proxy" or "http-proxy" into "httpproxy" which is more storage-efficient than a synonym. If you also want to search by "http" or "proxy" alone, then one way would be to extend from CompoundWordTokenFilterBase and break httpproxy into "http proxy" (you can have a Map-based implementation that is straightforward and fast than a typical brute-force decomposition algorithm)
On Tue, Mar 11, 2025 at 3:27 PM Trevor Nicholls <tre...@castingthevoid.com> wrote: > Hi Uwe/Mikhail > > (Note: you'll probably need to restore any line breaks your mailer has > decided to filter out for the following to make sense.) > > At the moment the content is analysed twice, so I have a "text" field and > an "exactText" field. > > The text field uses a custom analyzer. This is because none of the > standard analysers really matches the content well; imagine our content > containing things like: > > "string 1" + "string 2" > price * 1.025/decimalplaces=3 > customer:title+" "+first_name/newline > http_proxy_server > license.server.address > act as web server > > The custom analyzer splits the input with a PatternTokenizer, and includes > a LowerCaseFilter and a custom filter which I have called ClipWordFilter. > The pattern tokenizer basically retains all punctuation but splits words > at delimiters like full stops, underscores and hyphens. The tokens it emits > includes these trailing delimiters and then the ClipWord filter duplicates > the tokens without them. > Thus for example http_proxy_server produces the tokens [http_] [http] > [proxy_] [proxy] and [server]. > > The exactText field is analysed using a TextAnalyzer, which combines a > WhiteSpaceTokenizer and a LowerCaseFilter. Thus http_proxy_server produces > one token, [http_proxy_server]. > > The text index means that somebody searching for proxy.server, > proxy_server, or proxy server will find matches to any of those forms. But > because my search method boosts exactText matches, the exact form they > search for will be scored higher. > > Because the custom analyzer produces word tokens I have incorporated > synonyms (when indexing). The synonyms are bidirectional. > > OK, that's all background. > > The problem I am trying to fix at the moment is that searching for these > compounds is working provided that there are delimiters, but is not working > if any of them is omitted. So, as a trivial example, searching for "act as > web server" or for "act_as_web_server" will find all the targets, but > searching for "act as webserver" will not. > > I have a list of synonyms loaded by the analyzer which adds every possible > breakdown of that phrase. Thus if I log the activity in building the > synonym map I can see the following: > > act as = actas > actas = act as > act as web = actasweb > actasweb = act as web > act as web server = actaswebserver > actaswebserver = act as web server > as web = asweb > asweb = as web > as web server = aswebserver > aswebserver = as web server > web server = webserver > webserver = web server > > If I analyze a document which contains a single line: act as web server > these are the tokens generated: > > [actaswebserver] > [act] > [as] > [web] > [server] > > So I've got one synonym, but not all the possibilities. > > If the document contains: act_as_web_server > then the tokens are: > > [act] > [act_] > [as] > [as_] > [web] > [web_] > [server] > > i.e. no synonyms at all. > > In either case a user searching for the phrase but specifying 'webserver' > will not find it. > > Just to demonstrate that this is only a problem with multi-word synonyms, > I repeated the exercise with a different synonym table: > > act = work > as = like > web = internet > server = host > > The input: [act as web server] generates: > > [work] > [act] > [like] > [as] > [internet] > [web] > [host] > [server] > > and the input: [act_as_web_server] generates: > > [work] > [act] > [act_] > [like] > [as] > [as_] > [internet] > [web] > [web_] > [host] > [server] > > So with single word synonyms, I'm getting everything I would expect, and a > user searching for any particular combination of synonyms will find it. > > I had hoped for the same with phrases. But maybe my expectations are too > high, or maybe I'm just doing it wrong. > > cheers > T > > > -----Original Message----- > From: Uwe Schindler <u...@thetaphi.de> > Sent: Monday, 10 March 2025 23:38 > To: java-user@lucene.apache.org > Subject: Re: Synonyms and searching > > Hi, > > Another way to do this is using Word Delimiter Filter and use "catenate" > options. Be aware that you need special text tokenization (not use > standard tokenizer, but instead WhitespaceTokenizer). This approach is > common for product numbers. > > To no break you "normal" analysis, it is often a good idea to have a > separate field with the "alternative" analysis, which also applies other > things like stemming and search in both fields using a Dismax Query, > ranking the "normal" standard analyzer based query field higher. > > Uwe > > Am 05.03.2025 um 20:19 schrieb Mikhail Khludnev: > > Hello Trevor. > > > > Maintaining such a synonym map is too much of a burden. > > One idea: sticks words together with "" separator with > > https://lucene.apache.org/core/8_0_0/analyzers-common/org/apache/lucen > > e/analysis/shingle/ShingleFilter.html > > Another idea, the opposite breaks user's words via dictionary > > https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucen > > e/analysis/compound/DictionaryCompoundWordTokenFilter.html > > However, it's actually a suggester's duty > > https://lucene.apache.org/core//8_0_0/suggest/org/apache/lucene/search > > /spell/WordBreakSpellChecker.html however it's aside of the main > > search flow. > > > > > > On Wed, Mar 5, 2025 at 5:28 PM Trevor Nicholls > > <tre...@castingthevoid.com> > > wrote: > > > >> I don't know if I have completely the wrong idea or not, hopefully > >> somebody can point out where I have got this wrong > >> > >> > >> > >> I am indexing technical documentation; the content contains strings > >> like "http_proxy_server". When building the index my analyzer breaks > >> this into the tokens "http", "proxy" and "server". It generatees the > >> same tokens for "http.proxy.server"; constructions like this are also > >> common in the documents. > >> > >> > >> > >> At the moment the application is using Lucene 8.6.3. > >> > >> > >> > >> If the document contains "http_proxy_server" the user can search for > >> "http proxy server", "http.proxy.server" or "http proxy server" and > >> all will match. > >> > >> > >> > >> However, I am trying to construct the index and the search so that if > >> the user searches for e.g. "http proxyserver" they also find a match. > >> I thought it would be sufficient to add an entry to the synonym map > >> specifying that "http proxy" and "httpproxy" are synonyms, and > >> likewise "proxy server" and "proxyserver". (When adding multiple-word > >> phrases the spaces are replaced by SynonymMap.WORD_SEPARATOR). > >> > >> > >> > >> The analyzer incorporates the synonym map when building the index, > >> but not when searching - the synonyms (both words and phrases) should > >> already be in the index so a user's search pattern should not need to > >> be extended by them. > >> > >> > >> > >> Unfortunately this doesn't appear to be working as I expected. If a > >> user searches for "httpproxy" or "proxyserver" nothing is matched. > >> > >> When I print the tokens in the stream emitted by the analyser, I can > >> see all the word for word synonyms output (e.g. if the content > >> contains "license", the emerging tokens include both "licence" and > >> "license"), but the phrase substitutions are not. "http", "proxy" and > >> "server " are there, but none of the conjunctions appear. > >> > >> > >> > >> I don't think synonym replacement should be occurring at search time, > >> if only for performance reasons, but what have I missed in how this > >> should work? Am I chasing the impossible dream? > >> > >> > >> > >> cheers > >> > >> T > >> > >> > >> > >> > >> > >> > -- > Uwe Schindler > Achterdiek 19, D-28357 Bremen > https://www.thetaphi.de > eMail: u...@thetaphi.de > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >