Hi

I attached two patch files (for "java" and "test"). Due to a problem in my
checkout project in Eclipse, I don't have them under "src".
I also added a test and modified two tests in TestStandardAnalyzer.

On Dec 10, 2007 11:44 PM, Grant Ingersoll (JIRA) <[EMAIL PROTECTED]> wrote:

>
>    [
> https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550202]
>
> Grant Ingersoll commented on LUCENE-1068:
> -----------------------------------------
>
> Hmmm, maybe there is a way in Eclipse to make the path relative to the
> working directory?  Otherwise, from the command line in the Lucene
> directory:  svn diff > StandardTokenizer-4.patch
>
> -Grant
>
>
>
> --------------------------
> Grant Ingersoll
> http://lucene.grantingersoll.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
> > Invalid behavior of StandardTokenizerImpl
> > -----------------------------------------
> >
> >                 Key: LUCENE-1068
> >                 URL: https://issues.apache.org/jira/browse/LUCENE-1068
> >             Project: Lucene - Java
> >          Issue Type: Bug
> >          Components: Analysis
> >            Reporter: Shai Erera
> >            Assignee: Grant Ingersoll
> >         Attachments: StandardTokenizerImpl-2.patch,
> StandardTokenizerImpl-3.patch, standardTokenizerImpl.jflex.patch,
> standardTokenizerImpl.patch
> >
> >
> > The following code prints the output of StandardAnalyzer:
> >         Analyzer analyzer = new StandardAnalyzer();
> >         TokenStream ts = analyzer.tokenStream("content", new
> StringReader("<some text>"));
> >         Token t;
> >         while ((t = ts.next()) != null) {
> >             System.out.println(t);
> >         }
> > If you pass "www.abc.com", the output is (www.abc.com,0,11,type=<HOST>)
> (which is correct in my opinion).
> > However, if you pass "www.abc.com." (notice the extra '.' at the end),
> the output is (wwwabccom,0,12,type=<ACRONYM>).
> > I think the behavior in the second case is incorrect for several
> reasons:
> > 1. It recognizes the string incorrectly (no argue on that).
> > 2. It kind of prevents you from putting URLs at the end of a sentence,
> which is perfectly legal.
> > 3. An ACRONYM, at least to the best of my understanding, is of the form
> A.B.C. and not ABC.DEF.
> > I looked at StandardTokenizerImpl.jflex and I think the problem comes
> from this definition:
> > // acronyms: U.S.A., I.B.M., etc.
> > // use a post-filter to remove dots
> > ACRONYM    =  {ALPHA} "." ({ALPHA} ".")+
> > Notice how the comment relates to acronym as U.S.A., I.B.M. and not
> something else. I changed the definition to
> > ACRONYM    =  {LETTER} "." ({LETTER} ".")+
> > and it solved the problem.
> > This was also reported here:
> >
> http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
> >
> http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-- 
Regards,

Shai Erera

Reply via email to