[
https://issues.apache.org/jira/browse/LUCENE-5770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033837#comment-14033837
]
Steve Rowe commented on LUCENE-5770:
------------------------------------
To assess relative performance of the modified StandardTokenizerImpl, I ran
luceneutil's {{TestAnalyzerPerf}} (the history of results of the 4.x version of
which are shown here:
http://people.apache.org/~mikemccand/lucenebench/analyzers.html).
Here are the raw results of ten runs (after a run to populate the OS filesystem
cache) on Linux with Oracle 1.7.0_60, against unmodified trunk, using
{{enwiki-20130102-lines.txt}}:
{noformat}
Standard time=48581.34 msec hash=-16468587987622665 tokens=203498795
Standard time=48103.02 msec hash=-16468587987622665 tokens=203498795
Standard time=44514.19 msec hash=-16468587987622665 tokens=203498795
Standard time=48997.35 msec hash=-16468587987622665 tokens=203498795
Standard time=47794.26 msec hash=-16468587987622665 tokens=203498795
Standard time=48973.45 msec hash=-16468587987622665 tokens=203498795
Standard time=52409.88 msec hash=-16468587987622665 tokens=203498795
Standard time=49674.48 msec hash=-16468587987622665 tokens=203498795
Standard time=48257.42 msec hash=-16468587987622665 tokens=203498795
Standard time=48075.62 msec hash=-16468587987622665 tokens=203498795
Mean time=48538.10 msec
Mean toks/sec=4,192,557
{noformat}
and the patched results:
{noformat}
Standard time=49561.77 msec hash=-16468594357435165 tokens=203498791
Standard time=49465.50 msec hash=-16468594357435165 tokens=203498791
Standard time=50194.16 msec hash=-16468594357435165 tokens=203498791
Standard time=48548.19 msec hash=-16468594357435165 tokens=203498791
Standard time=49449.01 msec hash=-16468594357435165 tokens=203498791
Standard time=52377.06 msec hash=-16468594357435165 tokens=203498791
Standard time=52433.60 msec hash=-16468594357435165 tokens=203498791
Standard time=50495.17 msec hash=-16468594357435165 tokens=203498791
Standard time=46098.29 msec hash=-16468594357435165 tokens=203498791
Standard time=48078.95 msec hash=-16468594357435165 tokens=203498791
Mean time=49670.17 msec
Mean toks/sec=4,097,002
{noformat}
Comparing the mean throughput numbers, the patched version is ~2.3% slower.
Comparing the highest throughput numbers, the patched version is ~3.5% slower.
I believe the reason for the relative slowdown is the use of Java's codepoint
APIs ({{Character.codePointAt()}}, {{.charCount()}}, etc.) over the input
{{char[]}} buffer. I think this is an acceptable reduction in performance in
exchange for the more easily maintainable single-source specifications.
The number of tokens, and the hash (calculated over the token text and their
positions and offsets) differ slightly - I tracked this down to an unrelated
change I made to the specification: I changed the {{ComplexContext}} rule, a
specialization for Southeast Asian scripts, to include following {{WB:Format}}
and/or {{WB:Extend}} characters, as is done with most other rules in the
specification, following the [UAX#29 WB4
rule|http://www.unicode.org/reports/tr29/#WB4]. All tokenization differences
are caused by the orginal specification triggering breaks at U+200C ZERO WIDTH
NON-JOINER, which is a {{WB:Extend}} character, after and between Myanmar
characters. When I reverted changes to that rule in the patched version, the
same hash and number of tokens is produced as in the original unpatched version.
> Upgrade JFlex to 1.6.0
> ----------------------
>
> Key: LUCENE-5770
> URL: https://issues.apache.org/jira/browse/LUCENE-5770
> Project: Lucene - Core
> Issue Type: Task
> Reporter: Steve Rowe
> Assignee: Steve Rowe
> Priority: Minor
> Fix For: 5.0, 4.10
>
> Attachments: LUCENE-5770.patch
>
>
> JFlex 1.6, to be released shortly, will have direct support for supplementary
> code points - JFlex 1.5 and earlier only support code points in the BMP.
> We can drop the use of ICU4J to generate surrogate pairs to extend our JFlex
> scanner specifications to handle supplementary code points.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]