Hi, thanks for such a quick response!

No I hadn't thought of that. In how many of the following would I need to do 
this:
- synonym map creation
- analyzing text for indexing
- analyzing text for querying

If either of the latter two then I can see lots of complications ensuing; it 
more or less makes a synonym map redundant if I have to manually parse the text 
and identify all the potential synonyms in advance. I may be missing something 
critical, of course.

cheers
T

-----Original Message-----
From: Bernd Fehling <bernd.fehl...@uni-bielefeld.de> 
Sent: Tuesday, 15 March 2022 04:16
To: java-user@lucene.apache.org
Subject: Re: synonym question

Hello,

just a guess, have you tried escaping the space in your multi-word terms with 
backslash?

isoweek,iso\ week

Regards
Bernd


Am 14.03.22 um 15:54 schrieb Trevor Nicholls:
> I have technical data which I am querying with Lucene; one of the 
> features of the content is that a large number of technical terms may 
> be written as multiple words or as a compound word. For example, 
> ISOWEEK or ISO WEEK. Or SynonymFilter or synonym filter.
> 
>   
> 
> I have a synonym table which includes all of these phrases, thus 
> isoweek=iso week, isoyear=iso year, etc.
> 
>   
> 
> My understanding is that including the synonyms (with a SynonymFilter 
> in my
> analyzer) at index time means that I shouldn't have to include the 
> synonym filter in the query analyzer because if any of the synonyms 
> appear in a query they will match records containing any of the 
> synonymous terms, as all values are indexed for any one of them.
> 
>   
> 
> Checking with Luke, this appears to be the case, however the queries 
> are not matching all the records I expect them too, so I am taking a deeper 
> look.
> 
>   
> 
> In the indexing phase, input text is tokenised on whitespace and 
> punctuation, lowercased, and then processed by a synonym filter. The 
> relevant part of the analyzer is this:
> 
>   
> 
>     @Override
> 
>     protected TokenStreamComponents createComponents(String fieldName) 
> {
> 
>        WhitespaceTokenizer src = new WhitespaceTokenizer();
> 
>        TokenStream result = new TechTokenFilter( new 
> LowerCaseFilter(src));
> 
>         result = new SynonymGraphFilter(result, 
> getSynonyms(options.getSynonymsList()), Boolean.TRUE);
> 
>         result = new FlattenGraphFilter(result);
> 
>        }
> 
>        return new TokenStreamComponents(src, result);
> 
>   
> 
> The getSynonyms method builds a synonym map from a comma-delimited 
> text file and I know this is working because all the one-word synonym 
> replacements index and search perfectly. The problem I have is with synonym 
> phrases.
> 
>   
> 
> So if the synonyms input file contains
> 
>   
> 
>    isoweek,isodate
> 
>   
> 
> then (using Luke) I can see that any document containing either 
> 'isoweek' or 'isodate' has indexed both terms, and a search with 
> either term returns matching results for both. Great.
> 
>   
> 
> However if the input file contains
> 
>   
> 
>    isoweek,iso week
> 
>   
> 
> then (again using Luke) I can see that while any document containing 
> 'isoweek' has indexed the terms 'isoweek', 'iso' and 'week', 
> unfortunately any document containing 'iso week' has only indexed 'iso' and 
> 'week'.
> 
>   
> 
> Am I chasing the impossible here? Is there something I can do in the 
> query analyzer to make it work? (Currently the query analyzer is the 
> same as the indexing analyzer with the synonymgraphfilter and 
> flattengraphfilter
> omitted.) Or do I have to manually pre-process any query to include OR 
> options for all phrase synonyms?
> 
>   
> 
> I haven't produced a small test case for this because I'm hoping a 
> high level discussion is all I need to put me on the right track.
> 
>   
> 
> cheers
> 
> T
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to