[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15002248#comment-15002248 ]
Uwe Schindler commented on LUCENE-6874: --------------------------------------- Here is the output of the reuters test: {noformat} ------------> Report Sum By (any) Name and Round (28 about 33 out of 34) Operation round runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem AnalyzerFactory(name:WhitespaceTokenizer,WhitespaceTokenizer(rule:java)) 0 1 0 0.00 0.00 9,569,344 124,256,256 AnalyzerFactory(name:UnicodeWhitespaceTokenizer,WhitespaceTokenizer(rule:unicode)) - 0 - - 1 - - - - 0 - - - 0.00 - - 0.00 - 9,569,344 - 124,256,256 Rounds_5 0 1 24493540 360,841.19 67.88 16,566,472 124,256,256 NewAnalyzer(WhitespaceTokenizer) - - - - - - - - - - - - - - - - - - 0 - - 1 - - - - 0 - - - 0.00 - - 0.00 - 9,569,344 - 124,256,256 [Character.isWhitespace()] WhitespaceTokenizer 0 1 2449354 331,038.53 7.40 22,121,256 124,256,256 Seq_20000 - - - - - - - - - - - - - - - - - - 0 - - 2 - - 2449354 - 344,131.22 - - 14.23 - 22,121,256 - 118,489,088 NewAnalyzer(UnicodeWhitespaceTokenizer) 0 1 0 0.00 0.00 22,121,256 112,721,920 [UnicodeProps.WHITESPACE.get()] UnicodeWhitespaceTokenizer - - - - - - - - - 0 - - 1 - - 2449354 - 358,302.22 - - 6.84 - 22,121,256 - 112,721,920 NewAnalyzer(WhitespaceTokenizer) 1 1 0 0.00 0.00 12,138,024 112,721,920 [Character.isWhitespace()] WhitespaceTokenizer - - - - - - - - - - - - - 1 - - 1 - - 2449354 - 366,724.66 - - 6.68 - 22,374,536 - 112,721,920 Seq_20000 1 2 2449354 365,139.25 13.42 27,477,352 117,702,656 NewAnalyzer(UnicodeWhitespaceTokenizer) - - - - - - - - - - - - - - - - 1 - - 1 - - - - 0 - - - 0.00 - - 0.00 - 22,374,536 - 111,673,344 [UnicodeProps.WHITESPACE.get()] UnicodeWhitespaceTokenizer 1 1 2449354 363,567.47 6.74 32,580,168 122,683,392 NewAnalyzer(WhitespaceTokenizer) - - - - - - - - - - - - - - - - - - 2 - - 1 - - - - 0 - - - 0.00 - - 0.00 - 32,580,168 - 122,683,392 [Character.isWhitespace()] WhitespaceTokenizer 2 1 2449354 365,793.59 6.70 33,461,280 122,683,392 Seq_20000 - - - - - - - - - - - - - - - - - - 2 - - 2 - - 2449354 - 365,112.03 - - 13.42 - 33,461,280 - 117,178,368 NewAnalyzer(UnicodeWhitespaceTokenizer) 2 1 0 0.00 0.00 33,461,280 111,673,344 [UnicodeProps.WHITESPACE.get()] UnicodeWhitespaceTokenizer - - - - - - - - - 2 - - 1 - - 2449354 - 364,432.97 - - 6.72 - 33,461,280 - 111,673,344 NewAnalyzer(WhitespaceTokenizer) 3 1 0 0.00 0.00 10,836,464 111,673,344 [Character.isWhitespace()] WhitespaceTokenizer - - - - - - - - - - - - - 3 - - 1 - - 2449354 - 367,660.47 - - 6.66 - 12,451,400 - 111,673,344 Seq_20000 3 2 2449354 365,820.94 13.39 13,235,672 111,673,344 NewAnalyzer(UnicodeWhitespaceTokenizer) - - - - - - - - - - - - - - - - 3 - - 1 - - - - 0 - - - 0.00 - - 0.00 - 12,451,400 - 111,673,344 [UnicodeProps.WHITESPACE.get()] UnicodeWhitespaceTokenizer 3 1 2449354 363,999.69 6.73 14,019,944 111,673,344 NewAnalyzer(WhitespaceTokenizer) - - - - - - - - - - - - - - - - - - 4 - - 1 - - - - 0 - - - 0.00 - - 0.00 - 14,019,944 - 111,673,344 [Character.isWhitespace()] WhitespaceTokenizer 4 1 2449354 367,329.62 6.67 15,061,368 111,673,344 Seq_20000 - - - - - - - - - - - - - - - - - - 4 - - 2 - - 2449354 - 365,057.59 - - 13.42 - 15,813,920 - 111,673,344 NewAnalyzer(UnicodeWhitespaceTokenizer) 4 1 0 0.00 0.00 15,061,368 111,673,344 [UnicodeProps.WHITESPACE.get()] UnicodeWhitespaceTokenizer - - - - - - - - - 4 - - 1 - - 2449354 - 362,813.50 - - 6.75 - 16,566,472 - 111,673,344 {noformat} As you see, both Tokenizers are almost same speed. > WhitespaceTokenizer should tokenize on NBSP > ------------------------------------------- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis > Reporter: David Smiley > Priority: Minor > Attachments: LUCENE-6874-chartokenizer.patch, > LUCENE-6874-chartokenizer.patch, LUCENE-6874-jflex.patch, LUCENE-6874.patch, > LUCENE_6874_jflex.patch, icu-datasucker.patch, unicode-ws-tokenizer.patch, > unicode-ws-tokenizer.patch, unicode-ws-tokenizer.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org