Improve StandardTokenizer's understanding of non ASCII punctuation and quotes -----------------------------------------------------------------------------
Key: LUCENE-2244 URL: https://issues.apache.org/jira/browse/LUCENE-2244 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 3.0 Reporter: Andi Vajda In the vein of LUCENE-1126 and LUCENE-1390, StandardTokenizerImpl.jflex should do a better job at understanding non-ASCII punctuation characters. For example, its understanding of the single-quote character "'" is currently limited to that character only. It will set a token's type to APOSTROPHE only if the "'" was used. In the patch attached, I added all the characters that ASCIIFoldingFilter would change into "'". I'm not sure that this is the right approach so I didn't write a complete patch for all the other hardcoded characters used in jflex rules such as ".", "-" which have some variants in ASCIIFoldingFilter that could be used as well. Maybe a better approach would be to make it possible to have an ASCIIFoldingFilter-like reader as a character filter that could be in inserted in front of StandardTokenizer ? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org