[ https://issues.apache.org/jira/browse/LUCENE-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703790#action_12703790 ]
Earwin Burrfoot commented on LUCENE-1622: ----------------------------------------- I'll shortly cite my experiences mentioned on the list. * Injecting "synonym group id" token instead of all tokens for all synonyms in group is a big win with index size and saves you from matching for "big". It also plays better with highlighting (still had to rewrite it to handle all corner cases). * Properly handling multiword synonyms only on index-side is impossible, you have to dabble in query rewriting (even then low-probability corner cases exist, and you might find extra docs). * Query expansion is the only absolutely clear way to have multiword synonyms with current Lucene, but it is impractical on any adequate synonym dictionary. * There is a possible change to the way Lucene indexes tokens+positions to enable fully proper multiword synonyms - adding a notion of 'length' or 'span' to a token, this length should play together with positionIncrement when calculating distance between tokens in phrase/spannear queries. > Multi-word synonym filter (synonym expansion at indexing time). > --------------------------------------------------------------- > > Key: LUCENE-1622 > URL: https://issues.apache.org/jira/browse/LUCENE-1622 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* > Reporter: Dawid Weiss > Priority: Minor > Attachments: synonyms.patch > > > It would be useful to have a filter that provides support for indexing-time > synonym expansion, especially for multi-word synonyms (with multi-word > matching for original tokens). > The problem is not trivial, as observed on the mailing list. The problems I > was able to identify (mentioned in the unit tests as well): > - if multi-word synonyms are indexed together with the original token stream > (at overlapping positions), then a query for a partial synonym sequence > (e.g., "big" in the synonym "big apple" for "new york city") causes the > document to match; > - there are problems with highlighting the original document when synonym is > matched (see unit tests for an example), > - if the synonym is of different length than the original sequence of tokens > to be matched, then phrase queries spanning the synonym and the original > sequence boundary won't be found. Example "big apple" synonym for "new york > city". A phrase query "big apple restaurants" won't match "new york city > restaurants". > I am posting the patch that implements phrase synonyms as a token filter. > This is not necessarily intended for immediate inclusion, but may provide a > basis for many people to experiment and adjust to their own scenarios. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org