I don't know if I have completely the wrong idea or not, hopefully somebody
can point out where I have got this wrong

 

I am indexing technical documentation; the content contains strings like
"http_proxy_server". When building the index my analyzer breaks this into
the tokens "http", "proxy" and "server". It generatees the same tokens for
"http.proxy.server"; constructions like this are also common in the
documents.

 

At the moment the application is using Lucene 8.6.3. 

 

If the document contains "http_proxy_server" the user can search for "http
proxy server", "http.proxy.server" or "http proxy server" and all will
match.

 

However, I am trying to construct the index and the search so that if the
user searches for e.g. "http proxyserver" they also find a match. I thought
it would be sufficient to add an entry to the synonym map specifying that
"http proxy" and "httpproxy" are synonyms, and likewise "proxy server" and
"proxyserver". (When adding multiple-word phrases the spaces are replaced by
SynonymMap.WORD_SEPARATOR).

 

The analyzer incorporates the synonym map when building the index, but not
when searching - the synonyms (both words and phrases) should already be in
the index so a user's search pattern should not need to be extended by them.

 

Unfortunately this doesn't appear to be working as I expected. If a user
searches for "httpproxy" or "proxyserver" nothing is matched.

When I print the tokens in the stream emitted by the analyser, I can see all
the word for word synonyms output (e.g. if the content contains "license",
the emerging tokens include both "licence" and "license"), but the phrase
substitutions are not. "http", "proxy" and "server " are there, but none of
the conjunctions appear.

 

I don't think synonym replacement should be occurring at search time, if
only for performance reasons, but what have I missed in how this should
work? Am I chasing the impossible dream?

 

cheers

T

  

 

Reply via email to