Re: [jira] Commented: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl

Grant Ingersoll Wed, 26 Dec 2007 04:58:51 -0800

Yeah, I think it can be made a separate issue.

-Grant


On Dec 23, 2007, at 2:36 AM, Shai Erera (JIRA) wrote:

[ https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554174 ]
Shai Erera commented on LUCENE-1068:
------------------------------------

Maybe this is a separate issue?
Notice that IP addresses are also recognized as HOST, howeverStandardTokenizerImpl.jflex documentation specifies they should berecognized as NUM.
// floating point, serial, model numbers, ip addresses, etc.
// every other segment must have at least one digit
NUM        = ({ALPHANUM} {P} {HAS_DIGIT}
          | {HAS_DIGIT} {P} {ALPHANUM}
          | {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+
          | {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
| {ALPHANUM} {P} {HAS_DIGIT} ({P} {ALPHANUM} {P}{HAS_DIGIT})+| {HAS_DIGIT} {P} {ALPHANUM} ({P} {HAS_DIGIT} {P}{ALPHANUM})+)
Invalid behavior of StandardTokenizerImpl
-----------------------------------------

               Key: LUCENE-1068
               URL: https://issues.apache.org/jira/browse/LUCENE-1068
           Project: Lucene - Java
        Issue Type: Bug
        Components: Analysis
          Reporter: Shai Erera
          Assignee: Grant Ingersoll
          Priority: Minor
           Fix For: 2.3
Attachments: LUCENE-1068.patch, StandardTokenizer-java-4.patch, StandardTokenizer-test-4.patch,StandardTokenizerImpl-2.patch, StandardTokenizerImpl-3.patch,StandardTokenizerImpl-5.patch, standardTokenizerImpl.jflex.patch,standardTokenizerImpl.patch
The following code prints the output of StandardAnalyzer:
       Analyzer analyzer = new StandardAnalyzer();
TokenStream ts = analyzer.tokenStream("content", newStringReader("<some text>"));
       Token t;
       while ((t = ts.next()) != null) {
           System.out.println(t);
       }
If you pass "www.abc.com", the output is (www.abc.com,0,11,type=<HOST>) (which is correct in my opinion).However, if you pass "www.abc.com." (notice the extra '.' at theend), the output is (wwwabccom,0,12,type=<ACRONYM>).I think the behavior in the second case is incorrect for severalreasons:
1. It recognizes the string incorrectly (no argue on that).
2. It kind of prevents you from putting URLs at the end of asentence, which is perfectly legal.3. An ACRONYM, at least to the best of my understanding, is of theform A.B.C. and not ABC.DEF.I looked at StandardTokenizerImpl.jflex and I think the problemcomes from this definition:
// acronyms: U.S.A., I.B.M., etc.
// use a post-filter to remove dots
ACRONYM    =  {ALPHA} "." ({ALPHA} ".")+
Notice how the comment relates to acronym as U.S.A., I.B.M. and notsomething else. I changed the definition to
ACRONYM    =  {LETTER} "." ({LETTER} ".")+
and it solved the problem.
This was also reported here:
http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl

Reply via email to