Hi I attached two patch files (for "java" and "test"). Due to a problem in my checkout project in Eclipse, I don't have them under "src". I also added a test and modified two tests in TestStandardAnalyzer.
On Dec 10, 2007 11:44 PM, Grant Ingersoll (JIRA) <[EMAIL PROTECTED]> wrote: > > [ > https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550202] > > Grant Ingersoll commented on LUCENE-1068: > ----------------------------------------- > > Hmmm, maybe there is a way in Eclipse to make the path relative to the > working directory? Otherwise, from the command line in the Lucene > directory: svn diff > StandardTokenizer-4.patch > > -Grant > > > > -------------------------- > Grant Ingersoll > http://lucene.grantingersoll.com > > Lucene Helpful Hints: > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > > > Invalid behavior of StandardTokenizerImpl > > ----------------------------------------- > > > > Key: LUCENE-1068 > > URL: https://issues.apache.org/jira/browse/LUCENE-1068 > > Project: Lucene - Java > > Issue Type: Bug > > Components: Analysis > > Reporter: Shai Erera > > Assignee: Grant Ingersoll > > Attachments: StandardTokenizerImpl-2.patch, > StandardTokenizerImpl-3.patch, standardTokenizerImpl.jflex.patch, > standardTokenizerImpl.patch > > > > > > The following code prints the output of StandardAnalyzer: > > Analyzer analyzer = new StandardAnalyzer(); > > TokenStream ts = analyzer.tokenStream("content", new > StringReader("<some text>")); > > Token t; > > while ((t = ts.next()) != null) { > > System.out.println(t); > > } > > If you pass "www.abc.com", the output is (www.abc.com,0,11,type=<HOST>) > (which is correct in my opinion). > > However, if you pass "www.abc.com." (notice the extra '.' at the end), > the output is (wwwabccom,0,12,type=<ACRONYM>). > > I think the behavior in the second case is incorrect for several > reasons: > > 1. It recognizes the string incorrectly (no argue on that). > > 2. It kind of prevents you from putting URLs at the end of a sentence, > which is perfectly legal. > > 3. An ACRONYM, at least to the best of my understanding, is of the form > A.B.C. and not ABC.DEF. > > I looked at StandardTokenizerImpl.jflex and I think the problem comes > from this definition: > > // acronyms: U.S.A., I.B.M., etc. > > // use a post-filter to remove dots > > ACRONYM = {ALPHA} "." ({ALPHA} ".")+ > > Notice how the comment relates to acronym as U.S.A., I.B.M. and not > something else. I changed the definition to > > ACRONYM = {LETTER} "." ({LETTER} ".")+ > > and it solved the problem. > > This was also reported here: > > > http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383 > > > http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926 > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- Regards, Shai Erera