[jira] Issue Comment Edited: (LUCENE-1622) Multi-word synonym filter (synonym expansion at indexing time).

Earwin Burrfoot (JIRA) Tue, 28 Apr 2009 11:50:54 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703790#action_12703790
 ]


Earwin Burrfoot edited comment on LUCENE-1622 at 4/28/09 11:50 AM:
-------------------------------------------------------------------

I'll shortly cite my experiences mentioned on the list.

* Injecting "synonym group id" token instead of all tokens for all synonyms in 
group is a big win with index size and saves you from matching for "big". It 
also plays better with highlighting (still had to rewrite it to handle all 
corner cases).
* Properly handling multiword synonyms only on index-side is impossible, you 
have to dabble in query rewriting (even then low-probability corner cases 
exist, and you might find extra docs).
* Query expansion is the only absolutely clear way to have multiword synonyms 
with current Lucene, but it is impractical on any adequate synonym dictionary.
* There is a possible change to the way Lucene indexes tokens+positions to 
enable fully proper multiword synonyms (with index+query rewrite approach) - 
adding a notion of 'length' or 'span' to a token, this length should play 
together with positionIncrement when calculating distance between tokens in 
phrase/spannear queries.

      was (Author: earwin):
    I'll shortly cite my experiences mentioned on the list.

* Injecting "synonym group id" token instead of all tokens for all synonyms in 
group is a big win with index size and saves you from matching for "big". It 
also plays better with highlighting (still had to rewrite it to handle all 
corner cases).
* Properly handling multiword synonyms only on index-side is impossible, you 
have to dabble in query rewriting (even then low-probability corner cases 
exist, and you might find extra docs).
* Query expansion is the only absolutely clear way to have multiword synonyms 
with current Lucene, but it is impractical on any adequate synonym dictionary.
* There is a possible change to the way Lucene indexes tokens+positions to 
enable fully proper multiword synonyms - adding a notion of 'length' or 'span' 
to a token, this length should play together with positionIncrement when 
calculating distance between tokens in phrase/spannear queries.
  
> Multi-word synonym filter (synonym expansion at indexing time).
> ---------------------------------------------------------------
>
>                 Key: LUCENE-1622
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1622
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>            Reporter: Dawid Weiss
>            Priority: Minor
>         Attachments: synonyms.patch
>
>
> It would be useful to have a filter that provides support for indexing-time 
> synonym expansion, especially for multi-word synonyms (with multi-word 
> matching for original tokens).
> The problem is not trivial, as observed on the mailing list. The problems I 
> was able to identify (mentioned in the unit tests as well):
> - if multi-word synonyms are indexed together with the original token stream 
> (at overlapping positions), then a query for a partial synonym sequence 
> (e.g., "big" in the synonym "big apple" for "new york city") causes the 
> document to match;
> - there are problems with highlighting the original document when synonym is 
> matched (see unit tests for an example),
> - if the synonym is of different length than the original sequence of tokens 
> to be matched, then phrase queries spanning the synonym and the original 
> sequence boundary won't be found. Example "big apple" synonym for "new york 
> city". A phrase query "big apple restaurants" won't match "new york city 
> restaurants".
> I am posting the patch that implements phrase synonyms as a token filter. 
> This is not necessarily intended for immediate inclusion, but may provide a 
> basis for many people to experiment and adjust to their own scenarios.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Issue Comment Edited: (LUCENE-1622) Multi-word synonym filter (synonym expansion at indexing time).

Reply via email to