Hi Assuming "+1" means I agree (forgive me for the lack of familiarity with the jargon), I'll make a new patch shortly.
On Dec 12, 2007 3:14 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > > On Dec 12, 2007, at 7:24 AM, Michael Busch (JIRA) wrote: > > > > > [ > https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel > > #action_12550948 ] > > > > Michael Busch commented on LUCENE-1068: > > --------------------------------------- > > > > {quote} > > The member is marked deprecated so we can remove it in the next > > release. Applications that would like to new behavior need to do > > nothing, and therefore will not be impacted once we remove that > > member. Applications that want the old behavior need to explicitly > > set it and in the next major release remove it. > > {quote} > > > > Doesn't this mean it is an API change if we make the new behavior > > the default? Apps that upgrade will see the new behavior unless they > > set they call replaceDepAcronym. > > > > To be fully backwards compatible I think this patch should use the > > old behavior as default. Then in 3.0 we can make the new behavior > > the default. > > +1 > > > > > > >> Invalid behavior of StandardTokenizerImpl > >> ----------------------------------------- > >> > >> Key: LUCENE-1068 > >> URL: https://issues.apache.org/jira/browse/LUCENE-1068 > >> Project: Lucene - Java > >> Issue Type: Bug > >> Components: Analysis > >> Reporter: Shai Erera > >> Assignee: Grant Ingersoll > >> Attachments: StandardTokenizer-java-4.patch, > >> StandardTokenizer-test-4.patch, StandardTokenizerImpl-2.patch, > >> StandardTokenizerImpl-3.patch, standardTokenizerImpl.jflex.patch, > >> standardTokenizerImpl.patch > >> > >> > >> The following code prints the output of StandardAnalyzer: > >> Analyzer analyzer = new StandardAnalyzer(); > >> TokenStream ts = analyzer.tokenStream("content", new > >> StringReader("<some text>")); > >> Token t; > >> while ((t = ts.next()) != null) { > >> System.out.println(t); > >> } > >> If you pass "www.abc.com", the output is (www.abc.com, > >> 0,11,type=<HOST>) (which is correct in my opinion). > >> However, if you pass "www.abc.com." (notice the extra '.' at the > >> end), the output is (wwwabccom,0,12,type=<ACRONYM>). > >> I think the behavior in the second case is incorrect for several > >> reasons: > >> 1. It recognizes the string incorrectly (no argue on that). > >> 2. It kind of prevents you from putting URLs at the end of a > >> sentence, which is perfectly legal. > >> 3. An ACRONYM, at least to the best of my understanding, is of the > >> form A.B.C. and not ABC.DEF. > >> I looked at StandardTokenizerImpl.jflex and I think the problem > >> comes from this definition: > >> // acronyms: U.S.A., I.B.M., etc. > >> // use a post-filter to remove dots > >> ACRONYM = {ALPHA} "." ({ALPHA} ".")+ > >> Notice how the comment relates to acronym as U.S.A., I.B.M. and not > >> something else. I changed the definition to > >> ACRONYM = {LETTER} "." ({LETTER} ".")+ > >> and it solved the problem. > >> This was also reported here: > >> > http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383 > >> > http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926 > > > > -- > > This message is automatically generated by JIRA. > > - > > You can reply to this email to add a comment to the issue online. > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- Regards, Shai Erera