[ 
https://issues.apache.org/jira/browse/LUCENE-5770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033837#comment-14033837
 ] 

Steve Rowe commented on LUCENE-5770:
------------------------------------

To assess relative performance of the modified StandardTokenizerImpl, I ran 
luceneutil's {{TestAnalyzerPerf}} (the history of results of the 4.x version of 
which are shown here: 
http://people.apache.org/~mikemccand/lucenebench/analyzers.html).

Here are the raw results of ten runs (after a run to populate the OS filesystem 
cache) on Linux with Oracle 1.7.0_60, against unmodified trunk, using 
{{enwiki-20130102-lines.txt}}:

{noformat}
Standard time=48581.34 msec hash=-16468587987622665 tokens=203498795
Standard time=48103.02 msec hash=-16468587987622665 tokens=203498795
Standard time=44514.19 msec hash=-16468587987622665 tokens=203498795
Standard time=48997.35 msec hash=-16468587987622665 tokens=203498795
Standard time=47794.26 msec hash=-16468587987622665 tokens=203498795
Standard time=48973.45 msec hash=-16468587987622665 tokens=203498795
Standard time=52409.88 msec hash=-16468587987622665 tokens=203498795
Standard time=49674.48 msec hash=-16468587987622665 tokens=203498795
Standard time=48257.42 msec hash=-16468587987622665 tokens=203498795
Standard time=48075.62 msec hash=-16468587987622665 tokens=203498795
    Mean time=48538.10 msec
Mean toks/sec=4,192,557
{noformat}

and the patched results:

{noformat}
Standard time=49561.77 msec hash=-16468594357435165 tokens=203498791
Standard time=49465.50 msec hash=-16468594357435165 tokens=203498791
Standard time=50194.16 msec hash=-16468594357435165 tokens=203498791
Standard time=48548.19 msec hash=-16468594357435165 tokens=203498791
Standard time=49449.01 msec hash=-16468594357435165 tokens=203498791
Standard time=52377.06 msec hash=-16468594357435165 tokens=203498791
Standard time=52433.60 msec hash=-16468594357435165 tokens=203498791
Standard time=50495.17 msec hash=-16468594357435165 tokens=203498791
Standard time=46098.29 msec hash=-16468594357435165 tokens=203498791
Standard time=48078.95 msec hash=-16468594357435165 tokens=203498791
    Mean time=49670.17 msec
Mean toks/sec=4,097,002
{noformat}

Comparing the mean throughput numbers, the patched version is ~2.3% slower.

Comparing the highest throughput numbers, the patched version is ~3.5% slower.

I believe the reason for the relative slowdown is the use of Java's codepoint 
APIs ({{Character.codePointAt()}}, {{.charCount()}}, etc.) over the input 
{{char[]}} buffer.  I think this is an acceptable reduction in performance in 
exchange for the more easily maintainable single-source specifications.

The number of tokens, and the hash (calculated over the token text and their 
positions and offsets) differ slightly - I tracked this down to an unrelated 
change I made to the specification: I changed the {{ComplexContext}} rule, a 
specialization for Southeast Asian scripts, to include following {{WB:Format}} 
and/or {{WB:Extend}} characters, as is done with most other rules in the 
specification, following the [UAX#29 WB4 
rule|http://www.unicode.org/reports/tr29/#WB4].  All tokenization differences 
are caused by the orginal specification triggering breaks at U+200C ZERO WIDTH 
NON-JOINER, which is a  {{WB:Extend}} character, after and between Myanmar 
characters.  When I reverted changes to that rule in the patched version, the 
same hash and number of tokens is produced as in the original unpatched version.

> Upgrade JFlex to 1.6.0
> ----------------------
>
>                 Key: LUCENE-5770
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5770
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Steve Rowe
>            Assignee: Steve Rowe
>            Priority: Minor
>             Fix For: 5.0, 4.10
>
>         Attachments: LUCENE-5770.patch
>
>
> JFlex 1.6, to be released shortly, will have direct support for supplementary 
> code points - JFlex 1.5 and earlier only support code points in the BMP.
> We can drop the use of ICU4J to generate surrogate pairs to extend our JFlex 
> scanner specifications to handle supplementary code points.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to