Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

Jack Krupansky Tue, 30 Sep 2014 14:37:19 -0700

Yes, most special characters are treated as term delimiters, except thatunderscores, dots, and commas have some special rules.


See the details under Standard Tokenizer in my Solr e-book:
http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html

That doesn't give you Java details for Lucene, but the tokenizer rules arethe same.


-- Jack Krupansky

-----Original Message-----From: Paul Taylor

Sent: Tuesday, September 30, 2014 3:54 PM
To: java-user@lucene.apache.org
Subject: Does StandardTokenizer remove punctuation (in Lucene 4.1)

Does StandardTokenizer remove punctuation (in Lucene 4.1)

Im just trying to move back to StandardTokenizer from my own old custom
implemenation because the newer version seems to have much better
support for Asian languages

However this code except fails on incrementToken() implying that the !!!
are removed from output, yet looking at the jflex classes I cant see
anything to indicate punctuation is removed, is it removed and if so can
i remove it ?

Tokenizer tokenizer = new
StandardTokenizer(LuceneVersion.LUCENE_VERSION, new StringReader("!!!"));
assertNotNull(tokenizer);
tokenizer.reset();
assertTrue(tokenizer.incrementToken());

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org

For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

Reply via email to