Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

Steve Rowe Tue, 30 Sep 2014 22:02:48 -0700

Hi Paul,

StandardTokenizer implements the Word Boundaries rules in the Unicode Text 
Segmentation Standard Annex UAX#29 - here’s the relevant section for Unicode 
6.1.0, which is the version supported by Lucene 4.1.0: 
<http://www.unicode.org/reports/tr29/tr29-19.html#Word_Boundaries>.


Only those sequences between boundaries that contain letters and/or digits are 
returned as tokens; all other sequences between boundaries are skipped over and 
not returned as tokens.

Steve

On Sep 30, 2014, at 3:54 PM, Paul Taylor <paul_t...@fastmail.fm> wrote:

> Does StandardTokenizer remove punctuation (in Lucene 4.1)
> 
> Im just trying to move back to StandardTokenizer from my own old custom 
> implemenation because the newer version seems to have much better support for 
> Asian languages
> 
> However this code except fails on incrementToken() implying that the !!! are 
> removed from output, yet looking at the jflex classes I cant see anything to 
> indicate punctuation is removed, is it removed and if so can i remove it ?
> 
> Tokenizer tokenizer = new StandardTokenizer(LuceneVersion.LUCENE_VERSION, new 
> StringReader("!!!"));
> assertNotNull(tokenizer);
> tokenizer.reset();
> assertTrue(tokenizer.incrementToken());
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

Reply via email to