[ https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554174 ]
Shai Erera commented on LUCENE-1068: ------------------------------------ Maybe this is a separate issue? Notice that IP addresses are also recognized as HOST, however StandardTokenizerImpl.jflex documentation specifies they should be recognized as NUM. // floating point, serial, model numbers, ip addresses, etc. // every other segment must have at least one digit NUM = ({ALPHANUM} {P} {HAS_DIGIT} | {HAS_DIGIT} {P} {ALPHANUM} | {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+ | {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+ | {ALPHANUM} {P} {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+ | {HAS_DIGIT} {P} {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+) > Invalid behavior of StandardTokenizerImpl > ----------------------------------------- > > Key: LUCENE-1068 > URL: https://issues.apache.org/jira/browse/LUCENE-1068 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis > Reporter: Shai Erera > Assignee: Grant Ingersoll > Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1068.patch, StandardTokenizer-java-4.patch, > StandardTokenizer-test-4.patch, StandardTokenizerImpl-2.patch, > StandardTokenizerImpl-3.patch, StandardTokenizerImpl-5.patch, > standardTokenizerImpl.jflex.patch, standardTokenizerImpl.patch > > > The following code prints the output of StandardAnalyzer: > Analyzer analyzer = new StandardAnalyzer(); > TokenStream ts = analyzer.tokenStream("content", new > StringReader("<some text>")); > Token t; > while ((t = ts.next()) != null) { > System.out.println(t); > } > If you pass "www.abc.com", the output is (www.abc.com,0,11,type=<HOST>) > (which is correct in my opinion). > However, if you pass "www.abc.com." (notice the extra '.' at the end), the > output is (wwwabccom,0,12,type=<ACRONYM>). > I think the behavior in the second case is incorrect for several reasons: > 1. It recognizes the string incorrectly (no argue on that). > 2. It kind of prevents you from putting URLs at the end of a sentence, which > is perfectly legal. > 3. An ACRONYM, at least to the best of my understanding, is of the form > A.B.C. and not ABC.DEF. > I looked at StandardTokenizerImpl.jflex and I think the problem comes from > this definition: > // acronyms: U.S.A., I.B.M., etc. > // use a post-filter to remove dots > ACRONYM = {ALPHA} "." ({ALPHA} ".")+ > Notice how the comment relates to acronym as U.S.A., I.B.M. and not something > else. I changed the definition to > ACRONYM = {LETTER} "." ({LETTER} ".")+ > and it solved the problem. > This was also reported here: > http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383 > http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]