Re: [jira] Commented: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl

Shai Erera Wed, 12 Dec 2007 05:49:57 -0800

Hi

Assuming "+1" means I agree (forgive me for the lack of familiarity with the
jargon), I'll make a new patch shortly.


On Dec 12, 2007 3:14 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:

>
> On Dec 12, 2007, at 7:24 AM, Michael Busch (JIRA) wrote:
>
> >
> >    [
> https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel
> > #action_12550948 ]
> >
> > Michael Busch commented on LUCENE-1068:
> > ---------------------------------------
> >
> > {quote}
> > The member is marked deprecated so we can remove it in the next
> > release. Applications that would like to new behavior need to do
> > nothing, and therefore will not be impacted once we remove that
> > member. Applications that want the old behavior need to explicitly
> > set it and in the next major release remove it.
> > {quote}
> >
> > Doesn't this mean it is an API change if we make the new behavior
> > the default? Apps that upgrade will see the new behavior unless they
> > set they call replaceDepAcronym.
> >
> > To be fully backwards compatible I think this patch should use the
> > old behavior as default. Then in 3.0 we can make the new behavior
> > the default.
>
> +1
>
> >
> >
> >> Invalid behavior of StandardTokenizerImpl
> >> -----------------------------------------
> >>
> >>                Key: LUCENE-1068
> >>                URL: https://issues.apache.org/jira/browse/LUCENE-1068
> >>            Project: Lucene - Java
> >>         Issue Type: Bug
> >>         Components: Analysis
> >>           Reporter: Shai Erera
> >>           Assignee: Grant Ingersoll
> >>        Attachments: StandardTokenizer-java-4.patch,
> >> StandardTokenizer-test-4.patch, StandardTokenizerImpl-2.patch,
> >> StandardTokenizerImpl-3.patch, standardTokenizerImpl.jflex.patch,
> >> standardTokenizerImpl.patch
> >>
> >>
> >> The following code prints the output of StandardAnalyzer:
> >>        Analyzer analyzer = new StandardAnalyzer();
> >>        TokenStream ts = analyzer.tokenStream("content", new
> >> StringReader("<some text>"));
> >>        Token t;
> >>        while ((t = ts.next()) != null) {
> >>            System.out.println(t);
> >>        }
> >> If you pass "www.abc.com", the output is (www.abc.com,
> >> 0,11,type=<HOST>) (which is correct in my opinion).
> >> However, if you pass "www.abc.com." (notice the extra '.' at the
> >> end), the output is (wwwabccom,0,12,type=<ACRONYM>).
> >> I think the behavior in the second case is incorrect for several
> >> reasons:
> >> 1. It recognizes the string incorrectly (no argue on that).
> >> 2. It kind of prevents you from putting URLs at the end of a
> >> sentence, which is perfectly legal.
> >> 3. An ACRONYM, at least to the best of my understanding, is of the
> >> form A.B.C. and not ABC.DEF.
> >> I looked at StandardTokenizerImpl.jflex and I think the problem
> >> comes from this definition:
> >> // acronyms: U.S.A., I.B.M., etc.
> >> // use a post-filter to remove dots
> >> ACRONYM    =  {ALPHA} "." ({ALPHA} ".")+
> >> Notice how the comment relates to acronym as U.S.A., I.B.M. and not
> >> something else. I changed the definition to
> >> ACRONYM    =  {LETTER} "." ({LETTER} ".")+
> >> and it solved the problem.
> >> This was also reported here:
> >>
> http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
> >>
> http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-- 
Regards,

Shai Erera

Re: [jira] Commented: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl

Reply via email to