Just to confirm, escaping the spaces in synonym table construction, query construction, or both, does not solve the problem.
-----Original Message----- From: Trevor Nicholls <tre...@castingthevoid.com> Sent: Tuesday, 15 March 2022 05:02 To: java-user@lucene.apache.org Subject: RE: synonym question Hi, thanks for such a quick response! No I hadn't thought of that. In how many of the following would I need to do this: - synonym map creation - analyzing text for indexing - analyzing text for querying If either of the latter two then I can see lots of complications ensuing; it more or less makes a synonym map redundant if I have to manually parse the text and identify all the potential synonyms in advance. I may be missing something critical, of course. cheers T -----Original Message----- From: Bernd Fehling <bernd.fehl...@uni-bielefeld.de> Sent: Tuesday, 15 March 2022 04:16 To: java-user@lucene.apache.org Subject: Re: synonym question Hello, just a guess, have you tried escaping the space in your multi-word terms with backslash? isoweek,iso\ week Regards Bernd Am 14.03.22 um 15:54 schrieb Trevor Nicholls: > I have technical data which I am querying with Lucene; one of the > features of the content is that a large number of technical terms may > be written as multiple words or as a compound word. For example, > ISOWEEK or ISO WEEK. Or SynonymFilter or synonym filter. > > > > I have a synonym table which includes all of these phrases, thus > isoweek=iso week, isoyear=iso year, etc. > > > > My understanding is that including the synonyms (with a SynonymFilter > in my > analyzer) at index time means that I shouldn't have to include the > synonym filter in the query analyzer because if any of the synonyms > appear in a query they will match records containing any of the > synonymous terms, as all values are indexed for any one of them. > > > > Checking with Luke, this appears to be the case, however the queries > are not matching all the records I expect them too, so I am taking a deeper > look. > > > > In the indexing phase, input text is tokenised on whitespace and > punctuation, lowercased, and then processed by a synonym filter. The > relevant part of the analyzer is this: > > > > @Override > > protected TokenStreamComponents createComponents(String fieldName) > { > > WhitespaceTokenizer src = new WhitespaceTokenizer(); > > TokenStream result = new TechTokenFilter( new > LowerCaseFilter(src)); > > result = new SynonymGraphFilter(result, > getSynonyms(options.getSynonymsList()), Boolean.TRUE); > > result = new FlattenGraphFilter(result); > > } > > return new TokenStreamComponents(src, result); > > > > The getSynonyms method builds a synonym map from a comma-delimited > text file and I know this is working because all the one-word synonym > replacements index and search perfectly. The problem I have is with synonym > phrases. > > > > So if the synonyms input file contains > > > > isoweek,isodate > > > > then (using Luke) I can see that any document containing either > 'isoweek' or 'isodate' has indexed both terms, and a search with > either term returns matching results for both. Great. > > > > However if the input file contains > > > > isoweek,iso week > > > > then (again using Luke) I can see that while any document containing > 'isoweek' has indexed the terms 'isoweek', 'iso' and 'week', > unfortunately any document containing 'iso week' has only indexed 'iso' and > 'week'. > > > > Am I chasing the impossible here? Is there something I can do in the > query analyzer to make it work? (Currently the query analyzer is the > same as the indexing analyzer with the synonymgraphfilter and > flattengraphfilter > omitted.) Or do I have to manually pre-process any query to include OR > options for all phrase synonyms? > > > > I haven't produced a small test case for this because I'm hoping a > high level discussion is all I need to put me on the right track. > > > > cheers > > T > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org