On Dec 12, 2007, at 7:24 AM, Michael Busch (JIRA) wrote:
[ https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel
#action_12550948 ]
Michael Busch commented on LUCENE-1068:
---------------------------------------
{quote}
The member is marked deprecated so we can remove it in the next
release. Applications that would like to new behavior need to do
nothing, and therefore will not be impacted once we remove that
member. Applications that want the old behavior need to explicitly
set it and in the next major release remove it.
{quote}
Doesn't this mean it is an API change if we make the new behavior
the default? Apps that upgrade will see the new behavior unless they
set they call replaceDepAcronym.
To be fully backwards compatible I think this patch should use the
old behavior as default. Then in 3.0 we can make the new behavior
the default.
+1
Invalid behavior of StandardTokenizerImpl
-----------------------------------------
Key: LUCENE-1068
URL: https://issues.apache.org/jira/browse/LUCENE-1068
Project: Lucene - Java
Issue Type: Bug
Components: Analysis
Reporter: Shai Erera
Assignee: Grant Ingersoll
Attachments: StandardTokenizer-java-4.patch,
StandardTokenizer-test-4.patch, StandardTokenizerImpl-2.patch,
StandardTokenizerImpl-3.patch, standardTokenizerImpl.jflex.patch,
standardTokenizerImpl.patch
The following code prints the output of StandardAnalyzer:
Analyzer analyzer = new StandardAnalyzer();
TokenStream ts = analyzer.tokenStream("content", new
StringReader("<some text>"));
Token t;
while ((t = ts.next()) != null) {
System.out.println(t);
}
If you pass "www.abc.com", the output is (www.abc.com,
0,11,type=<HOST>) (which is correct in my opinion).
However, if you pass "www.abc.com." (notice the extra '.' at the
end), the output is (wwwabccom,0,12,type=<ACRONYM>).
I think the behavior in the second case is incorrect for several
reasons:
1. It recognizes the string incorrectly (no argue on that).
2. It kind of prevents you from putting URLs at the end of a
sentence, which is perfectly legal.
3. An ACRONYM, at least to the best of my understanding, is of the
form A.B.C. and not ABC.DEF.
I looked at StandardTokenizerImpl.jflex and I think the problem
comes from this definition:
// acronyms: U.S.A., I.B.M., etc.
// use a post-filter to remove dots
ACRONYM = {ALPHA} "." ({ALPHA} ".")+
Notice how the comment relates to acronym as U.S.A., I.B.M. and not
something else. I changed the definition to
ACRONYM = {LETTER} "." ({LETTER} ".")+
and it solved the problem.
This was also reported here:
http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]