Hi Paul, StandardTokenizer implements the Word Boundaries rules in the Unicode Text Segmentation Standard Annex UAX#29 - here’s the relevant section for Unicode 6.1.0, which is the version supported by Lucene 4.1.0: <http://www.unicode.org/reports/tr29/tr29-19.html#Word_Boundaries>.
Only those sequences between boundaries that contain letters and/or digits are returned as tokens; all other sequences between boundaries are skipped over and not returned as tokens. Steve On Sep 30, 2014, at 3:54 PM, Paul Taylor <paul_t...@fastmail.fm> wrote: > Does StandardTokenizer remove punctuation (in Lucene 4.1) > > Im just trying to move back to StandardTokenizer from my own old custom > implemenation because the newer version seems to have much better support for > Asian languages > > However this code except fails on incrementToken() implying that the !!! are > removed from output, yet looking at the jflex classes I cant see anything to > indicate punctuation is removed, is it removed and if so can i remove it ? > > Tokenizer tokenizer = new StandardTokenizer(LuceneVersion.LUCENE_VERSION, new > StringReader("!!!")); > assertNotNull(tokenizer); > tokenizer.reset(); > assertTrue(tokenizer.incrementToken()); > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org