[
https://issues.apache.org/jira/browse/LUCENE-4889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13616137#comment-13616137
]
Dawid Weiss commented on LUCENE-4889:
-------------------------------------
http://i.stack.imgur.com/jiFfM.jpg [Double facepalm]
It couldn't be right, it was surreal. I checked the code again and indeed,
there was a subtle bug in the Java code -- it was looping over the decode loop,
all right, but it was never rewiding the input buffer after the first time,
damn it. Corrected it shows sensible output:
{code}
implementation dataType ms linear runtime
[current lucene]
LUCENE UNICODE 167.3 ==========
LUCENE ASCII 333.9 =====================
[patch]
LUCENE_MOD1 UNICODE 103.0 ======
LUCENE_MOD1 ASCII 77.2 ====
[if-based version but without assertions]
NOLOOKUP_IF UNICODE 90.2 =====
NOLOOKUP_IF ASCII 29.1 =
[java version]
JAVA UNICODE 465.6 ==============================
JAVA ASCII 103.1 ======
[no branching/counting, just pass over data]
NO_COUNT UNICODE 52.1 ===
NO_COUNT ASCII 26.0 =
{code}
I also did a non-benchmark loop in which everything is just counted 10 times.
{code}
time codepoints version data set
1.676 <= [ 500000010] UNICODE LUCENE
0.905 <= [ 500000010] UNICODE LUCENE_MOD1
0.905 <= [ 500000010] UNICODE NOLOOKUP_IF
4.686 <= [ 500000010] UNICODE JAVA
3.339 <= [1000000000] ASCII LUCENE
1.028 <= [1000000000] ASCII LUCENE_MOD1
1.027 <= [1000000000] ASCII NOLOOKUP_IF
1.591 <= [1000000000] ASCII JAVA
{code}
I'll commit the patch since it improves both the speed and the internal
validation logic.
> UnicodeUtil.codePointCount microbenchmarks (wtf)
> ------------------------------------------------
>
> Key: LUCENE-4889
> URL: https://issues.apache.org/jira/browse/LUCENE-4889
> Project: Lucene - Core
> Issue Type: Task
> Reporter: Dawid Weiss
> Assignee: Dawid Weiss
> Priority: Trivial
> Fix For: 5.0
>
> Attachments: LUCENE-4889.patch
>
>
> This is interesting. I posted a link to a state-machine-based UTF8
> parser/recognizer:
> http://bjoern.hoehrmann.de/utf-8/decoder/dfa/
> I spent some time thinking if the lookup table could be converted into a
> stateless computational function, which would avoid a table lookup (which in
> Java will cause an additional bounds check that will be hard to eliminate I
> think). This didn't turn out to be easy (it boils down to finding a simple
> function that would map a set of integers to its concrete permutation; a
> generalization of minimal perfect hashing).
> But out of curiosity I though it'd be fun to compare how Lucene's codepoint
> counting compares to Java's built-in one (Decoder) and a sequence of if's.
> I've put together a Caliper benchmark that processes 50 million unicode
> codepoints; one only ASCII, one Unicode. The results are interesting. On my
> win/I7:
> {code}
> implementation dataType ns linear runtime
> LUCENE UNICODE 167359502.6 ===============
> LUCENE ASCII 334015746.5 ==============================
> NOLOOKUP_SWITCH UNICODE 154294141.8 =============
> NOLOOKUP_SWITCH ASCII 119500892.8 ==========
> NOLOOKUP_IF UNICODE 90149072.6 ========
> NOLOOKUP_IF ASCII 29151411.4 ==
> {code}
> Disregard the switch lookup -- it's for fun only. But a sequence of if's is
> significantly faster than the current Lucene's table lookup, especially on
> ASCII input. And now compare this to Java's built-in decoder...
> {code}
> JAVA UNICODE 5753930.1 =
> JAVA ASCII 23.8 =
> {code}
> Yes, it's the same benchmark. Wtf? I realize buffers are partially native and
> probably so is utf8 decoder but by so much?! Again, to put it in context:
> {code}
> implementation dataType ns linear runtime
> LUCENE UNICODE 167359502.6 ===============
> LUCENE ASCII 334015746.5 ==============================
> JAVA UNICODE 5753930.1 =
> JAVA ASCII 23.8 =
> NOLOOKUP_IF UNICODE 90149072.6 ========
> NOLOOKUP_IF ASCII 29151411.4 ==
> NOLOOKUP_SWITCH UNICODE 154294141.8 =============
> NOLOOKUP_SWITCH ASCII 119500892.8 ==========
> {code}
> Wtf? The code is here if you want to experiment.
> https://github.com/dweiss/utf8dfa
> I realize the Java version needs to allocate a temporary space buffer but if
> these numbers hold for different VMs it may actually be worth it...
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]