Invalid behavior of StandardTokenizerImpl
-----------------------------------------
Key: LUCENE-1068
URL: https://issues.apache.org/jira/browse/LUCENE-1068
Project: Lucene - Java
Issue Type: Bug
Components: Analysis
Reporter: Shai Erera
The following code prints the output of StandardAnalyzer:
Analyzer analyzer = new StandardAnalyzer();
TokenStream ts = analyzer.tokenStream("content", new
StringReader("<some text>"));
Token t;
while ((t = ts.next()) != null) {
System.out.println(t);
}
If you pass "www.abc.com", the output is (www.abc.com,0,11,type=<HOST>) (which
is correct in my opinion).
However, if you pass "www.abc.com." (notice the extra '.' at the end), the
output is (wwwabccom,0,12,type=<ACRONYM>).
I think the behavior in the second case is incorrect for several reasons:
1. It recognizes the string incorrectly (no argue on that).
2. It kind of prevents you from putting URLs at the end of a sentence, which is
perfectly legal.
3. An ACRONYM, at least to the best of my understanding, is of the form A.B.C.
and not ABC.DEF.
I looked at StandardTokenizerImpl.jflex and I think the problem comes from this
definition:
// acronyms: U.S.A., I.B.M., etc.
// use a post-filter to remove dots
ACRONYM = {ALPHA} "." ({ALPHA} ".")+
Notice how the comment relates to acronym as U.S.A., I.B.M. and not something
else. I changed the definition to
ACRONYM = {LETTER} "." ({LETTER} ".")+
and it solved the problem.
This was also reported here:
http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]