[ 
https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554171
 ] 

Shai Erera commented on LUCENE-1068:
------------------------------------

Even if you run testNumeric() on the trunk version, it recognizes "21.35" as 
HOST and not NUM ... The problem is that HOST is configured to recognized 
letters or digits. I'll check if there's a way to define precedence in JFlex, 
i.e., first detect NUM, then HOST (as every floating number is a HOST).
Another option would be to set HOST do detect series of xxx.yyy.(zzz .)+, 
meaning aaa.bbb won't be a HOST, but aaa.bbb.ccc will be. Do you see any 
problem with that? Are you aware of hosts that are of the form aa.bb?

> Invalid behavior of StandardTokenizerImpl
> -----------------------------------------
>
>                 Key: LUCENE-1068
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1068
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Shai Erera
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1068.patch, StandardTokenizer-java-4.patch, 
> StandardTokenizer-test-4.patch, StandardTokenizerImpl-2.patch, 
> StandardTokenizerImpl-3.patch, StandardTokenizerImpl-5.patch, 
> standardTokenizerImpl.jflex.patch, standardTokenizerImpl.patch
>
>
> The following code prints the output of StandardAnalyzer:
>         Analyzer analyzer = new StandardAnalyzer();
>         TokenStream ts = analyzer.tokenStream("content", new 
> StringReader("<some text>"));
>         Token t;
>         while ((t = ts.next()) != null) {
>             System.out.println(t);
>         }
> If you pass "www.abc.com", the output is (www.abc.com,0,11,type=<HOST>) 
> (which is correct in my opinion).
> However, if you pass "www.abc.com." (notice the extra '.' at the end), the 
> output is (wwwabccom,0,12,type=<ACRONYM>).
> I think the behavior in the second case is incorrect for several reasons:
> 1. It recognizes the string incorrectly (no argue on that).
> 2. It kind of prevents you from putting URLs at the end of a sentence, which 
> is perfectly legal.
> 3. An ACRONYM, at least to the best of my understanding, is of the form 
> A.B.C. and not ABC.DEF.
> I looked at StandardTokenizerImpl.jflex and I think the problem comes from 
> this definition:
> // acronyms: U.S.A., I.B.M., etc.
> // use a post-filter to remove dots
> ACRONYM    =  {ALPHA} "." ({ALPHA} ".")+
> Notice how the comment relates to acronym as U.S.A., I.B.M. and not something 
> else. I changed the definition to
> ACRONYM    =  {LETTER} "." ({LETTER} ".")+
> and it solved the problem.
> This was also reported here:
> http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
> http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to