Yes, most special characters are treated as term delimiters, except that
underscores, dots, and commas have some special rules.
See the details under Standard Tokenizer in my Solr e-book:
http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html
That doesn't give you Java details for Lucene, but the tokenizer rules are
the same.
-- Jack Krupansky
-----Original Message-----
From: Paul Taylor
Sent: Tuesday, September 30, 2014 3:54 PM
To: java-user@lucene.apache.org
Subject: Does StandardTokenizer remove punctuation (in Lucene 4.1)
Does StandardTokenizer remove punctuation (in Lucene 4.1)
Im just trying to move back to StandardTokenizer from my own old custom
implemenation because the newer version seems to have much better
support for Asian languages
However this code except fails on incrementToken() implying that the !!!
are removed from output, yet looking at the jflex classes I cant see
anything to indicate punctuation is removed, is it removed and if so can
i remove it ?
Tokenizer tokenizer = new
StandardTokenizer(LuceneVersion.LUCENE_VERSION, new StringReader("!!!"));
assertNotNull(tokenizer);
tokenizer.reset();
assertTrue(tokenizer.incrementToken());
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org