[ https://issues.apache.org/jira/browse/LUCENE-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916044#action_12916044 ]
Robert Muir commented on LUCENE-2244: ------------------------------------- Andi, now that LUCENE-2167 is resolved, i think we should revisit this one. There are two issues: * does the tokenization work the way it should for these quotes (I think LUCENE-2167 will do the right thing)? * does the 's-stripping (analysis.en.EnglishPossessiveFilter) work the way you want? For the latter: it could be extended itself to include more apostrophes, maybe from this list: http://unicode.org/cldr/utility/confusables.jsp?a='&r=None Or someone could put ASCIIFoldingFilter before the EnglishPossessiveFilter. > Improve StandardTokenizer's understanding of non ASCII punctuation and quotes > ----------------------------------------------------------------------------- > > Key: LUCENE-2244 > URL: https://issues.apache.org/jira/browse/LUCENE-2244 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/analyzers > Affects Versions: 3.0 > Reporter: Andi Vajda > Attachments: StandardTokenizerImpl.jflex.diff > > > In the vein of LUCENE-1126 and LUCENE-1390, StandardTokenizerImpl.jflex > should do a better job at understanding non-ASCII punctuation characters. > For example, its understanding of the single-quote character "'" is currently > limited to that character only. It will set a token's type to APOSTROPHE only > if the "'" was used. > In the patch attached, I added all the characters that ASCIIFoldingFilter > would change into "'". > I'm not sure that this is the right approach so I didn't write a complete > patch for all the other hardcoded characters used in jflex rules such as ".", > "-" which have some variants in ASCIIFoldingFilter that could be used as well. > Maybe a better approach would be to make it possible to have an > ASCIIFoldingFilter-like reader as a character filter that could be in > inserted in front of StandardTokenizer ? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org