[jira] Created: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl

Shai Erera (JIRA) Tue, 27 Nov 2007 03:55:14 -0800

Invalid behavior of StandardTokenizerImpl
-----------------------------------------


                 Key: LUCENE-1068
                 URL: https://issues.apache.org/jira/browse/LUCENE-1068
             Project: Lucene - Java
          Issue Type: Bug
          Components: Analysis
            Reporter: Shai Erera


The following code prints the output of StandardAnalyzer:

        Analyzer analyzer = new StandardAnalyzer();
        TokenStream ts = analyzer.tokenStream("content", new 
StringReader("<some text>"));
        Token t;
        while ((t = ts.next()) != null) {
            System.out.println(t);
        }

If you pass "www.abc.com", the output is (www.abc.com,0,11,type=<HOST>) (which 
is correct in my opinion).
However, if you pass "www.abc.com." (notice the extra '.' at the end), the 
output is (wwwabccom,0,12,type=<ACRONYM>).

I think the behavior in the second case is incorrect for several reasons:
1. It recognizes the string incorrectly (no argue on that).
2. It kind of prevents you from putting URLs at the end of a sentence, which is 
perfectly legal.
3. An ACRONYM, at least to the best of my understanding, is of the form A.B.C. 
and not ABC.DEF.

I looked at StandardTokenizerImpl.jflex and I think the problem comes from this 
definition:
// acronyms: U.S.A., I.B.M., etc.
// use a post-filter to remove dots
ACRONYM    =  {ALPHA} "." ({ALPHA} ".")+

Notice how the comment relates to acronym as U.S.A., I.B.M. and not something 
else. I changed the definition to
ACRONYM    =  {LETTER} "." ({LETTER} ".")+
and it solved the problem.

This was also reported here:
http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl

Reply via email to