Yeah, I think it can be made a separate issue.

-Grant

On Dec 23, 2007, at 2:36 AM, Shai Erera (JIRA) wrote:


[ https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel #action_12554174 ]

Shai Erera commented on LUCENE-1068:
------------------------------------

Maybe this is a separate issue?
Notice that IP addresses are also recognized as HOST, however StandardTokenizerImpl.jflex documentation specifies they should be recognized as NUM.
// floating point, serial, model numbers, ip addresses, etc.
// every other segment must have at least one digit
NUM        = ({ALPHANUM} {P} {HAS_DIGIT}
          | {HAS_DIGIT} {P} {ALPHANUM}
          | {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+
          | {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
| {ALPHANUM} {P} {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+ | {HAS_DIGIT} {P} {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+)


Invalid behavior of StandardTokenizerImpl
-----------------------------------------

               Key: LUCENE-1068
               URL: https://issues.apache.org/jira/browse/LUCENE-1068
           Project: Lucene - Java
        Issue Type: Bug
        Components: Analysis
          Reporter: Shai Erera
          Assignee: Grant Ingersoll
          Priority: Minor
           Fix For: 2.3

Attachments: LUCENE-1068.patch, StandardTokenizer- java-4.patch, StandardTokenizer-test-4.patch, StandardTokenizerImpl-2.patch, StandardTokenizerImpl-3.patch, StandardTokenizerImpl-5.patch, standardTokenizerImpl.jflex.patch, standardTokenizerImpl.patch


The following code prints the output of StandardAnalyzer:
       Analyzer analyzer = new StandardAnalyzer();
TokenStream ts = analyzer.tokenStream("content", new StringReader("<some text>"));
       Token t;
       while ((t = ts.next()) != null) {
           System.out.println(t);
       }
If you pass "www.abc.com", the output is (www.abc.com, 0,11,type=<HOST>) (which is correct in my opinion). However, if you pass "www.abc.com." (notice the extra '.' at the end), the output is (wwwabccom,0,12,type=<ACRONYM>). I think the behavior in the second case is incorrect for several reasons:
1. It recognizes the string incorrectly (no argue on that).
2. It kind of prevents you from putting URLs at the end of a sentence, which is perfectly legal. 3. An ACRONYM, at least to the best of my understanding, is of the form A.B.C. and not ABC.DEF. I looked at StandardTokenizerImpl.jflex and I think the problem comes from this definition:
// acronyms: U.S.A., I.B.M., etc.
// use a post-filter to remove dots
ACRONYM    =  {ALPHA} "." ({ALPHA} ".")+
Notice how the comment relates to acronym as U.S.A., I.B.M. and not something else. I changed the definition to
ACRONYM    =  {LETTER} "." ({LETTER} ".")+
and it solved the problem.
This was also reported here:
http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to