Re: Synonyms and searching

Anh Dũng Bùi Mon, 21 Apr 2025 04:59:43 -0700

In my work, I usually use Automaton to convert "http proxy" or "http-proxy"
into "httpproxy" which is more storage-efficient than a synonym. If you
also want to search by "http" or "proxy" alone, then one way would be to
extend from CompoundWordTokenFilterBase and break httpproxy into "http
proxy" (you can have a Map-based implementation that is straightforward and
fast than a typical brute-force decomposition algorithm)


On Tue, Mar 11, 2025 at 3:27 PM Trevor Nicholls <tre...@castingthevoid.com>
wrote:

> Hi Uwe/Mikhail
>
> (Note: you'll probably need to restore any line breaks your mailer has
> decided to filter out for the following to make sense.)
>
> At the moment the content is analysed twice, so I have a "text" field and
> an "exactText" field.
>
> The text field uses a custom analyzer. This is because none of the
> standard analysers really matches the content well; imagine our content
> containing things like:
>
> "string 1" + "string 2"
> price * 1.025/decimalplaces=3
> customer:title+" "+first_name/newline
> http_proxy_server
> license.server.address
> act as web server
>
> The custom analyzer splits the input with a PatternTokenizer, and includes
> a LowerCaseFilter and a custom filter which I have called ClipWordFilter.
> The pattern tokenizer basically retains all punctuation but splits words
> at delimiters like full stops, underscores and hyphens. The tokens it emits
> includes these trailing delimiters and then the ClipWord filter duplicates
> the tokens without them.
> Thus for example http_proxy_server produces the tokens [http_] [http]
> [proxy_] [proxy] and [server].
>
> The exactText field is analysed using a TextAnalyzer, which combines a
> WhiteSpaceTokenizer and a LowerCaseFilter. Thus http_proxy_server produces
> one token, [http_proxy_server].
>
> The text index means that somebody searching for proxy.server,
> proxy_server, or proxy server will find matches to any of those forms. But
> because my search method boosts exactText matches, the exact form they
> search for will be scored higher.
>
> Because the custom analyzer produces word tokens I have incorporated
> synonyms (when indexing). The synonyms are bidirectional.
>
> OK, that's all background.
>
> The problem I am trying to fix at the moment is that searching for these
> compounds is working provided that there are delimiters, but is not working
> if any of them is omitted. So, as a trivial example, searching for "act as
> web server" or for "act_as_web_server" will find all the targets, but
> searching for "act as webserver" will not.
>
> I have a list of synonyms loaded by the analyzer which adds every possible
> breakdown of that phrase. Thus if I log the activity in building the
> synonym map I can see the following:
>
> act as = actas
> actas = act as
> act as web = actasweb
> actasweb = act as web
> act as web server = actaswebserver
> actaswebserver = act as web server
> as web = asweb
> asweb = as web
> as web server = aswebserver
> aswebserver = as web server
> web server = webserver
> webserver = web server
>
> If I analyze a document which contains a single line: act as web server
> these are the tokens generated:
>
> [actaswebserver]
> [act]
> [as]
> [web]
> [server]
>
> So I've got one synonym, but not all the possibilities.
>
> If the document contains: act_as_web_server
> then the tokens are:
>
> [act]
> [act_]
> [as]
> [as_]
> [web]
> [web_]
> [server]
>
> i.e. no synonyms at all.
>
> In either case a user searching for the phrase but specifying 'webserver'
> will not find it.
>
> Just to demonstrate that this is only a problem with multi-word synonyms,
> I repeated the exercise with a different synonym table:
>
> act = work
> as = like
> web = internet
> server = host
>
> The input: [act as web server] generates:
>
> [work]
> [act]
> [like]
> [as]
> [internet]
> [web]
> [host]
> [server]
>
> and the input: [act_as_web_server] generates:
>
> [work]
> [act]
> [act_]
> [like]
> [as]
> [as_]
> [internet]
> [web]
> [web_]
> [host]
> [server]
>
> So with single word synonyms, I'm getting everything I would expect, and a
> user searching for any particular combination of synonyms will find it.
>
> I had hoped for the same with phrases. But maybe my expectations are too
> high, or maybe I'm just doing it wrong.
>
> cheers
> T
>
>
> -----Original Message-----
> From: Uwe Schindler <u...@thetaphi.de>
> Sent: Monday, 10 March 2025 23:38
> To: java-user@lucene.apache.org
> Subject: Re: Synonyms and searching
>
> Hi,
>
> Another way to do this is using Word Delimiter Filter and use "catenate"
> options. Be aware that you need special text tokenization (not use
> standard tokenizer, but instead WhitespaceTokenizer). This approach is
> common for product numbers.
>
> To no break you "normal" analysis, it is often a good idea to have a
> separate field with the "alternative" analysis, which also applies other
> things like stemming and search in both fields using a Dismax Query,
> ranking the "normal" standard analyzer based query field higher.
>
> Uwe
>
> Am 05.03.2025 um 20:19 schrieb Mikhail Khludnev:
> > Hello Trevor.
> >
> > Maintaining such a synonym map is too much of a burden.
> > One idea: sticks words together with "" separator with
> > https://lucene.apache.org/core/8_0_0/analyzers-common/org/apache/lucen
> > e/analysis/shingle/ShingleFilter.html
> > Another idea, the opposite breaks user's words via dictionary
> > https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucen
> > e/analysis/compound/DictionaryCompoundWordTokenFilter.html
> > However, it's actually a suggester's duty
> > https://lucene.apache.org/core//8_0_0/suggest/org/apache/lucene/search
> > /spell/WordBreakSpellChecker.html however it's aside of the main
> > search flow.
> >
> >
> > On Wed, Mar 5, 2025 at 5:28 PM Trevor Nicholls
> > <tre...@castingthevoid.com>
> > wrote:
> >
> >> I don't know if I have completely the wrong idea or not, hopefully
> >> somebody can point out where I have got this wrong
> >>
> >>
> >>
> >> I am indexing technical documentation; the content contains strings
> >> like "http_proxy_server". When building the index my analyzer breaks
> >> this into the tokens "http", "proxy" and "server". It generatees the
> >> same tokens for "http.proxy.server"; constructions like this are also
> >> common in the documents.
> >>
> >>
> >>
> >> At the moment the application is using Lucene 8.6.3.
> >>
> >>
> >>
> >> If the document contains "http_proxy_server" the user can search for
> >> "http proxy server", "http.proxy.server" or "http proxy server" and
> >> all will match.
> >>
> >>
> >>
> >> However, I am trying to construct the index and the search so that if
> >> the user searches for e.g. "http proxyserver" they also find a match.
> >> I thought it would be sufficient to add an entry to the synonym map
> >> specifying that "http proxy" and "httpproxy" are synonyms, and
> >> likewise "proxy server" and "proxyserver". (When adding multiple-word
> >> phrases the spaces are replaced by SynonymMap.WORD_SEPARATOR).
> >>
> >>
> >>
> >> The analyzer incorporates the synonym map when building the index,
> >> but not when searching - the synonyms (both words and phrases) should
> >> already be in the index so a user's search pattern should not need to
> >> be extended by them.
> >>
> >>
> >>
> >> Unfortunately this doesn't appear to be working as I expected. If a
> >> user searches for "httpproxy" or "proxyserver" nothing is matched.
> >>
> >> When I print the tokens in the stream emitted by the analyser, I can
> >> see all the word for word synonyms output (e.g. if the content
> >> contains "license", the emerging tokens include both "licence" and
> >> "license"), but the phrase substitutions are not. "http", "proxy" and
> >> "server " are there, but none of the conjunctions appear.
> >>
> >>
> >>
> >> I don't think synonym replacement should be occurring at search time,
> >> if only for performance reasons, but what have I missed in how this
> >> should work? Am I chasing the impossible dream?
> >>
> >>
> >>
> >> cheers
> >>
> >> T
> >>
> >>
> >>
> >>
> >>
> >>
> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Synonyms and searching

Reply via email to