[ https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shai Erera updated LUCENE-1068: ------------------------------- Attachment: StandardTokenizerImpl-2.patch I've found a way to do it (I think): I've added a new type called ACRONYM_DEP that identifies the old ACRONYMs and fixed the current ACRONYM to identify proper ones. I also marked ACRONYM_DEP as deprecated. I added code to StandardTokenizer to set the type of a token to HOST if the type returned is ACRONYM_DEP. This behavior can be changed if you think the type should be set to ACRONYM, in case there are applications that count on the Token type. I wrote these 4 lines of code to verify it works: public static void main(String[] args) throws Exception { parse("www.abc.com."); parse("www.abc.com"); parse("I.B.M."); } public static void parse(String text) throws Exception { Analyzer analyzer = new StandardAnalyzer(); TokenStream ts = analyzer.tokenStream("content", new StringReader(text)); Token t; while ((t = ts.next()) != null) { System.out.println(t); } } And the output is: (www.abc.com.,0,12,type=<HOST>) (www.abc.com,0,11,type=<HOST>) (ibm,0,6,type=<ACRONYM>) > Invalid behavior of StandardTokenizerImpl > ----------------------------------------- > > Key: LUCENE-1068 > URL: https://issues.apache.org/jira/browse/LUCENE-1068 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis > Reporter: Shai Erera > Attachments: StandardTokenizerImpl-2.patch, > standardTokenizerImpl.jflex.patch, standardTokenizerImpl.patch > > > The following code prints the output of StandardAnalyzer: > Analyzer analyzer = new StandardAnalyzer(); > TokenStream ts = analyzer.tokenStream("content", new > StringReader("<some text>")); > Token t; > while ((t = ts.next()) != null) { > System.out.println(t); > } > If you pass "www.abc.com", the output is (www.abc.com,0,11,type=<HOST>) > (which is correct in my opinion). > However, if you pass "www.abc.com." (notice the extra '.' at the end), the > output is (wwwabccom,0,12,type=<ACRONYM>). > I think the behavior in the second case is incorrect for several reasons: > 1. It recognizes the string incorrectly (no argue on that). > 2. It kind of prevents you from putting URLs at the end of a sentence, which > is perfectly legal. > 3. An ACRONYM, at least to the best of my understanding, is of the form > A.B.C. and not ABC.DEF. > I looked at StandardTokenizerImpl.jflex and I think the problem comes from > this definition: > // acronyms: U.S.A., I.B.M., etc. > // use a post-filter to remove dots > ACRONYM = {ALPHA} "." ({ALPHA} ".")+ > Notice how the comment relates to acronym as U.S.A., I.B.M. and not something > else. I changed the definition to > ACRONYM = {LETTER} "." ({LETTER} ".")+ > and it solved the problem. > This was also reported here: > http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383 > http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]