[jira] Updated: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl

Shai Erera (JIRA) Thu, 29 Nov 2007 05:38:17 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Shai Erera updated LUCENE-1068:
-------------------------------

    Attachment: StandardTokenizerImpl-2.patch

I've found a way to do it (I think):
I've added a new type called ACRONYM_DEP that identifies the old ACRONYMs and 
fixed the current ACRONYM to identify proper ones.
I also marked ACRONYM_DEP as deprecated.
I added code to StandardTokenizer to set the type of a token to HOST if the 
type returned is ACRONYM_DEP. This behavior can be changed if you think the 
type should be set to ACRONYM, in case there are applications that count on the 
Token type.

I wrote these 4 lines of code to verify it works:
        public static void main(String[] args) throws Exception {
                parse("www.abc.com.");
                parse("www.abc.com");
                parse("I.B.M.");
        }

        public static void parse(String text) throws Exception {
                Analyzer analyzer = new StandardAnalyzer();
                TokenStream ts = analyzer.tokenStream("content", new 
StringReader(text));
                Token t;
                while ((t = ts.next()) != null) {
                        System.out.println(t);
                }
        }
And the output is: 
(www.abc.com.,0,12,type=<HOST>)
(www.abc.com,0,11,type=<HOST>)
(ibm,0,6,type=<ACRONYM>)

> Invalid behavior of StandardTokenizerImpl
> -----------------------------------------
>
>                 Key: LUCENE-1068
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1068
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Shai Erera
>         Attachments: StandardTokenizerImpl-2.patch, 
> standardTokenizerImpl.jflex.patch, standardTokenizerImpl.patch
>
>
> The following code prints the output of StandardAnalyzer:
>         Analyzer analyzer = new StandardAnalyzer();
>         TokenStream ts = analyzer.tokenStream("content", new 
> StringReader("<some text>"));
>         Token t;
>         while ((t = ts.next()) != null) {
>             System.out.println(t);
>         }
> If you pass "www.abc.com", the output is (www.abc.com,0,11,type=<HOST>) 
> (which is correct in my opinion).
> However, if you pass "www.abc.com." (notice the extra '.' at the end), the 
> output is (wwwabccom,0,12,type=<ACRONYM>).
> I think the behavior in the second case is incorrect for several reasons:
> 1. It recognizes the string incorrectly (no argue on that).
> 2. It kind of prevents you from putting URLs at the end of a sentence, which 
> is perfectly legal.
> 3. An ACRONYM, at least to the best of my understanding, is of the form 
> A.B.C. and not ABC.DEF.
> I looked at StandardTokenizerImpl.jflex and I think the problem comes from 
> this definition:
> // acronyms: U.S.A., I.B.M., etc.
> // use a post-filter to remove dots
> ACRONYM    =  {ALPHA} "." ({ALPHA} ".")+
> Notice how the comment relates to acronym as U.S.A., I.B.M. and not something 
> else. I changed the definition to
> ACRONYM    =  {LETTER} "." ({LETTER} ".")+
> and it solved the problem.
> This was also reported here:
> http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
> http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl

Reply via email to