[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Steve Rowe updated LUCENE-6874:
-------------------------------
Attachment: LUCENE-6874-jflex.patch
Patch adding a JFlex-based UnicodeWhitespaceTokenizer(Factory), along with some
performance testing bits, some of which aren't commitable (hard-coded paths).
Also includes the SPI entry missing from Uwe's patch for
{{ICUWhitespaceTokenizerFactory}} in
{{lucene/icu/src/resources/META-INF/services/o.a.l.analysis.util.TokenizerFactory}},
as well as a couple bugfixes for {{lucene/benchmark}} (which I'll commit under
a separate JIRA). The patch also includes all of Uwe's patch.
I did three performance comparisons on my Macbook Pro with Oracle Java 1.8.0_20
of the {{Character.isWhitespace()}}-based {{WhitespaceTokenizer}}, Uwe's
{{ICUWhitespaceTokenizer}}, and the JFlex {{UnicodeWhitespaceTokenizer}}:
1. Using the {{wstok.alg}} in the patch, I ran {{lucene/benchmark}} over 20k
(English news) Reuters docs: dropping the lowest throughput of 5 rounds and
averaging the other 4:
||Tokenizer||Avg tok/sec||Throughput compared to {{WhitespaceTokenizer}}||
|{{WhitespaceTokenizer}}|1.515M|N/A|
|{{ICUWhitespaceTokenizer}}|1.447M|{color:red}-5.5%{color}|
|{{UnicodeWhitespaceTokenizer}}|1.514M|{color:red}-0.1%{color}|
2. I concatenated all ~20k Reuters docs into one file, loaded it into memory
and then ran each tokenizer over it 11 times, discarding info from the first
and averaging the other 10 (this is {{testReuters()}} in the {{Test*}} files in
the patch:
||Tokenizer||Avg tok/sec||Throughput compared to {{WhitespaceTokenizer}}||
|{{WhitespaceTokenizer}}|14.47M|N/A|
|{{ICUWhitespaceTokenizer}}|9.26M|{color:red}-36%{color}|
|{{UnicodeWhitespaceTokenizer}}|11.60M|{color:red}-20%{color}|
3. I used a fixed random seed and generated 10k random Unicode strings of at
most 10k chars using {{TestUtil.randomUnicodeString()}}. Note that this is
non-realistic data for tokenization, not least because the average whitespace
density is very low compared to natural language. In running this test I
noticed that {{WhitespaceTokenizer}} was returning many more tokens than the
other two, and I tracked it down to differences in the definition of whitespace:
* {{Character.isWhitespace()}} returns true for the following while Unicode
6.3.0 (Lucene's current Unicode version) does not: U+001C, U+001D, U+001E,
U+001F, U+0180E. (U+0180E was removed from Unicode's whitespace definition in
Unicode 6.3.0. Java 8 uses Unicode 6.2.0.)
* Unicode 6.3.0 says the following are whitespace while
{{Character.isWhitespace()}} does not: U+0085, U+00A0, U+2007, U+202F. The
last 3 are documented, but U+0085 {{NEXT LINE (NEL)}} isn't documented anywhere
I can see; it was added to Unicode's whitespace definition in Unicode 3.0
(released 2001).
So in order to be able to directly compare the performance of three tokenizers
over this data, I replaced all non-consensus whitespace characters with a space
before running the test.
||Tokenizer||Avg tok/sec||Throughput compared to {{WhitespaceTokenizer}}||
|{{WhitespaceTokenizer}}|897k|N/A|
|{{ICUWhitespaceTokenizer}}|880k|{color:red}-2%{color}|
|{{UnicodeWhitespaceTokenizer}}|1,605k|{color:green}+79%{color}|
One other thing I noticed for this test when I compared
{{ICUWhitespaceTokenizer}}'s output with that of
{{UnicodeWhitespaceTokenizer}}'s: they don't always find the same break points.
This is because although both forcibly break at the max token length (255
chars, fixed for {{CharTokenizer}} and the default for Lucene's JFlex
scanners), {{CharTokenizer}} allows tokens to exceed its max token char length
of 255 by one char when a surrogate pair would otherwise be broken, while
Lucene's JFlex scanners break at 254 chars in this case.
-----
Conclusion: for throughput over realistic ASCII data, the original
{{WhitespaceTokenizer}} performs best, followed by the JFlex-based tokenizer in
this patch ({{UnicodeWhitespaceTokenizer}}), followed by the ICU-based
{{ICUWhitespaceTokenizer}} in Uwe's patch.
> WhitespaceTokenizer should tokenize on NBSP
> -------------------------------------------
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/analysis
> Reporter: David Smiley
> Priority: Minor
> Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
> to decide what is whitespace. Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0',
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to
> work around but why leave this trap in by default?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]