[jira] [Updated] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

Steve Rowe (JIRA) Tue, 03 Nov 2015 06:46:15 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Steve Rowe updated LUCENE-6874:
-------------------------------
    Attachment: LUCENE-6874-jflex.patch

Patch adding a JFlex-based UnicodeWhitespaceTokenizer(Factory), along with some 
performance testing bits, some of which aren't commitable (hard-coded paths).  
Also includes the SPI entry missing from Uwe's patch for 
{{ICUWhitespaceTokenizerFactory}} in 
{{lucene/icu/src/resources/META-INF/services/o.a.l.analysis.util.TokenizerFactory}},
 as well as a couple bugfixes for {{lucene/benchmark}} (which I'll commit under 
a separate JIRA).  The patch also includes all of Uwe's patch.

I did three performance comparisons on my Macbook Pro with Oracle Java 1.8.0_20 
of the {{Character.isWhitespace()}}-based {{WhitespaceTokenizer}}, Uwe's 
{{ICUWhitespaceTokenizer}}, and the JFlex {{UnicodeWhitespaceTokenizer}}:

1. Using the {{wstok.alg}} in the patch, I ran {{lucene/benchmark}} over 20k 
(English news) Reuters docs: dropping the lowest throughput of 5 rounds and 
averaging the other 4:

||Tokenizer||Avg tok/sec||Throughput compared to {{WhitespaceTokenizer}}||
|{{WhitespaceTokenizer}}|1.515M|N/A|
|{{ICUWhitespaceTokenizer}}|1.447M|{color:red}-5.5%{color}|
|{{UnicodeWhitespaceTokenizer}}|1.514M|{color:red}-0.1%{color}|

2. I concatenated all ~20k Reuters docs into one file, loaded it into memory 
and then ran each tokenizer over it 11 times, discarding info from the first 
and averaging the other 10 (this is {{testReuters()}} in the {{Test*}} files in 
the patch:

||Tokenizer||Avg tok/sec||Throughput compared to {{WhitespaceTokenizer}}||
|{{WhitespaceTokenizer}}|14.47M|N/A|
|{{ICUWhitespaceTokenizer}}|9.26M|{color:red}-36%{color}|
|{{UnicodeWhitespaceTokenizer}}|11.60M|{color:red}-20%{color}|

3. I used a fixed random seed and generated 10k random Unicode strings of at 
most 10k chars using {{TestUtil.randomUnicodeString()}}.  Note that this is 
non-realistic data for tokenization, not least because the average whitespace 
density is very low compared to natural language.  In running this test I 
noticed that {{WhitespaceTokenizer}} was returning many more tokens than the 
other two, and I tracked it down to differences in the definition of whitespace:

* {{Character.isWhitespace()}} returns true for the following while Unicode 
6.3.0 (Lucene's current Unicode version) does not: U+001C, U+001D, U+001E, 
U+001F, U+0180E.  (U+0180E was removed from Unicode's whitespace definition in 
Unicode 6.3.0. Java 8 uses Unicode 6.2.0.)
* Unicode 6.3.0 says the following are whitespace while 
{{Character.isWhitespace()}} does not: U+0085, U+00A0, U+2007, U+202F.  The 
last 3 are documented, but U+0085 {{NEXT LINE (NEL)}} isn't documented anywhere 
I can see; it was added to Unicode's whitespace definition in Unicode 3.0 
(released 2001). 

So in order to be able to directly compare the performance of three tokenizers 
over this data, I replaced all non-consensus whitespace characters with a space 
before running the test.

||Tokenizer||Avg tok/sec||Throughput compared to {{WhitespaceTokenizer}}||
|{{WhitespaceTokenizer}}|897k|N/A|
|{{ICUWhitespaceTokenizer}}|880k|{color:red}-2%{color}|
|{{UnicodeWhitespaceTokenizer}}|1,605k|{color:green}+79%{color}|

One other thing I noticed for this test when I compared 
{{ICUWhitespaceTokenizer}}'s output with that of 
{{UnicodeWhitespaceTokenizer}}'s: they don't always find the same break points. 
 This is because although both forcibly break at the max token length (255 
chars, fixed for {{CharTokenizer}} and the default for Lucene's JFlex 
scanners), {{CharTokenizer}} allows tokens to exceed its max token char length 
of 255 by one char when a surrogate pair would otherwise be broken, while 
Lucene's JFlex scanners break at 254 chars in this case. 

-----

Conclusion: for throughput over realistic ASCII data, the original 
{{WhitespaceTokenizer}} performs best, followed by the JFlex-based tokenizer in 
this patch ({{UnicodeWhitespaceTokenizer}}), followed by the ICU-based 
{{ICUWhitespaceTokenizer}} in Uwe's patch.

> WhitespaceTokenizer should tokenize on NBSP
> -------------------------------------------
>
>                 Key: LUCENE-6874
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6874
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: David Smiley
>            Priority: Minor
>         Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

Reply via email to