[jira] Commented: (LUCENE-1787) Standard Tokenizer doesn't recognise I.B.M as Acronym, it requires it ends with a dot i.e I.B.M.

Michael McCandless (JIRA) Mon, 24 Aug 2009 03:18:26 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746802#action_12746802
 ]


Michael McCandless commented on LUCENE-1787:
--------------------------------------------

The big challenge here is back compat.  Ie, if we make this fix (which is a 
good fix!), then users upgrade to 2.9, suddenly queries may stop hitting the 
right documents because those documents had been indexed against the old 
StandardAnalyzer that has this bug.  Ie, the bug is "cached" in their index.

This is why we added "matchVersion" to StandardAnalyzer, but unfortunately we 
don't yet have a clean means of carrying out matchVersion when changes to the 
JFlex grammar are entailed.

> Standard Tokenizer doesn't recognise I.B.M as Acronym, it requires it ends 
> with a dot i.e I.B.M.
> ------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1787
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1787
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Paul taylor
>         Attachments: LUCENE-1787.patch
>
>
> Standard Tokenzizer doesn't recognise I.B.M it requires it end with a dot i.e 
> I.B.M. This is particulary problematic if I.B.M is added tot the index, with 
> the StandardAnalyser it will get added as  IBM , a search for I.B.M will not 
> match because I.B.M will be left as is, I would expect a match in this 
> scenario
> I think it could be fixed by modifying the  grammar ACRONYM_DEP  in 
> StandardTokenizerImpl.jflex so that it also supports
> {ALPHANUM} ("." {ALPHANUM})+
> dot only required between each character, (I'm not familiar with jflex syntax 
> )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1787) Standard Tokenizer doesn't recognise I.B.M as Acronym, it requires it ends with a dot i.e I.B.M.

Reply via email to