[ 
https://issues.apache.org/jira/browse/LUCENE-5770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033854#comment-14033854
 ] 

Steve Rowe commented on LUCENE-5770:
------------------------------------

I was curious whether Java8 would reduce or eliminate the penalty for using the 
codepoint APIs, so I ran  luceneutil's {{TestAnalyzerPerf}} under Oracle Java 
1.8.0_05, again on LInux.  The raw results against unpatched trunk:

{noformat}
Standard time=27605.51 msec hash=-16468587987622665 tokens=203498795
Standard time=27767.41 msec hash=-16468587987622665 tokens=203498795
Standard time=29705.14 msec hash=-16468587987622665 tokens=203498795
Standard time=30312.27 msec hash=-16468587987622665 tokens=203498795
Standard time=28091.85 msec hash=-16468587987622665 tokens=203498795
Standard time=29408.59 msec hash=-16468587987622665 tokens=203498795
Standard time=28107.20 msec hash=-16468587987622665 tokens=203498795
Standard time=28228.80 msec hash=-16468587987622665 tokens=203498795
Standard time=28487.87 msec hash=-16468587987622665 tokens=203498795
Standard time=31785.43 msec hash=-16468587987622665 tokens=203498795
    Mean time=28950.01 msec
Mean toks/sec=7,029,316
{noformat}

And against the patched version (I left the {{ComplexContext}} rule reverted, 
so the hashes and token counts match):

{noformat}
Standard time=31967.65 msec hash=-16468587987622665 tokens=203498795
Standard time=29123.18 msec hash=-16468587987622665 tokens=203498795
Standard time=28408.14 msec hash=-16468587987622665 tokens=203498795
Standard time=29412.19 msec hash=-16468587987622665 tokens=203498795
Standard time=30255.32 msec hash=-16468587987622665 tokens=203498795
Standard time=31915.55 msec hash=-16468587987622665 tokens=203498795
Standard time=30301.20 msec hash=-16468587987622665 tokens=203498795
Standard time=32921.60 msec hash=-16468587987622665 tokens=203498795
Standard time=28528.48 msec hash=-16468587987622665 tokens=203498795
Standard time=30649.49 msec hash=-16468587987622665 tokens=203498795
    Mean time=30348.28 msec
Mean toks/sec=6,705,447
{noformat}

Comparing the mean throughput numbers, the patched version is ~4.6% slower.

Comparing the highest throughput numbers, the patched version is ~1.1% slower.

But the huge result  here is that StandardAnalyzer is *way* faster under Java8 
on Linux than under Java7: 68% better throughput on average for the unpatched 
version.  I haven't run the benchmark on other platforms, but I did run a 
throughput test over 20k Reuters docs with lucene/benchmark on Windows 7 using 
Oracle 1.8.0_05, and it was actually somewhat slower, so it's clearly not the 
case that there are speedups to be had everywhere by upgrading to Java8.

I did one run of {{TestAnalyzerPerf}} on Linux over all of the analyzers it 
benchmarks, instead of just over {{StandardAnalyzer}}, and each analyzer shows 
serious improvements on Java8 over Java7:

Oracle 1.7.0_60:

{noformat}
Standard time=48173.65 msec hash=-16468587987622665 tokens=203498795
LowerCase time=42118.84 msec hash=-4828213998132430 tokens=184607939
EdgeNGrams time=51357.61 msec hash=1432428577478099 tokens=504918366
Shingles time=67035.01 msec hash=-21741319767311116 tokens=369115878
WordDelimiterFilter time=50846.88 msec hash=-18262747001660775 tokens=219918096
{noformat}

Oracle Java 1.8.0_05:

{noformat}
(63% faster) Standard time=29627.88 msec hash=-16468587987622665 
tokens=203498795
(86% faster) LowerCase time=22692.98 msec hash=-4828213998132430 
tokens=184607939
(30% faster) EdgeNGrams time=39463.84 msec hash=1432428577478099 
tokens=504918366
(31% faster) Shingles time=51205.34 msec hash=-21741319767311116 
tokens=369115878
(44% faster) WordDelimiterFilter time=35398.86 msec hash=-18262749927564663 
tokens=219918098
{noformat}


> Upgrade JFlex to 1.6.0
> ----------------------
>
>                 Key: LUCENE-5770
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5770
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Steve Rowe
>            Assignee: Steve Rowe
>            Priority: Minor
>             Fix For: 5.0, 4.10
>
>         Attachments: LUCENE-5770.patch
>
>
> JFlex 1.6, to be released shortly, will have direct support for supplementary 
> code points - JFlex 1.5 and earlier only support code points in the BMP.
> We can drop the use of ICU4J to generate surrogate pairs to extend our JFlex 
> scanner specifications to handle supplementary code points.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to