[jira] [Updated] (LUCENE-5252) add NGramSynonymTokenizer

Koji Sekiguchi (JIRA) Fri, 18 Oct 2013 06:57:59 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-5252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Koji Sekiguchi updated LUCENE-5252:
-----------------------------------

    Attachment:     (was: LUCENE-5252_b4.patch)

> add NGramSynonymTokenizer
> -------------------------
>
>                 Key: LUCENE-5252
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5252
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>         Attachments: LUCENE-5252_4x.patch, LUCENE-5252_4x.patch, 
> LUCENE-5252_4x.patch, LUCENE-5252_4x.patch
>
>
> I'd like to propose that we have another n-gram tokenizer which can process 
> synonyms. That is NGramSynonymTokenizer. Note that in this ticket, the gram 
> size is fixed, i.e. minGramSize = maxGramSize.
> Today, I think we have the following problems when using SynonymFilter with 
> NGramTokenizer. 
> For purpose of illustration, we have a synonym setting "ABC, DEFG" w/ 
> expand=true and N = 2 (2-gram).
> # There is no consensus (I think :-) how we assign offsets to generated 
> synonym tokens DE, EF and FG when expanding source token AB and BC.
> # If the query pattern looks like ABCY, it cannot be matched even if there is 
> a document "…ABCY…" in index when autoGeneratePhraseQueries set to true, 
> because there is no "CY" token (but "GY" is there) in the index.
> NGramSynonymTokenizer can solve these problems by providing the following 
> methods.
> * NGramSynonymTokenizer reads synonym settings (synonyms.txt) and it doesn't 
> tokenize registered words. e.g.
> ||source text||NGramTokenizer+SynonymFilter||NGramSynonymTokenizer||
> |ABC|AB/DE/BC/EF/FG|ABC/DEFG|
> * The back and forth of the registered words, NGramSynonymTokenizer generates 
> *extra* tokens w/ posInc=0. e.g.
> ||source text||NGramTokenizer+SynonymFilter||NGramSynonymTokenizer||
> |XYZABC123|XY/YZ/ZA/AB/DE/BC/EF/C1/FG/12/23|XY/YZ/Z/ABC/DEFG/1/12/23|
> In the above sample, "Z" and "1" are the extra tokens.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-5252) add NGramSynonymTokenizer

Reply via email to