toCharArray()

Ulf Zibis Thu, 28 Apr 2011 06:13:43 -0700

According to comments in 6795537

I additionally assume
    else if (b1<  (byte)0xc2)
should be little faster than
    else if ((b1>>  5) == -2)
and
    if (isMalformed2(b1, b2))
could be replaced by
    if (isNotContinuation(b2))



-Ulf


Am 28.04.2011 14:44, schrieb Ulf Zibis:

Interesting results!

Some days ago we had the discussion about constants for standard Charsets.
Looking at your results, I see, that using *charset names constants*, the conversion mostlyperforms little better (up to 25 %), than using *charset constants*.
So again my question: Why do we need those charset constants?
IMO, we more need de/encoder constants, and array-based API for Charset class.
In malformed(byte[] src, int sp, int nb) I think you could cache the ByteBuffer bb, insteadinstantiating a new one all the time. For this the method should not be static to ensurethread-safety.
As you are there, did you refer to:
6795537 -UTF_8$Decoder returns wrong results<http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6795537>6798514 - Charset UTF-8 accepts CESU-8 codings<http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6798514>
-Ulf



Am 28.04.2011 08:34, schrieb Xueming Shen:
 Hi
This is motivated by Neil's request to optimize common-case UTF8 path for native ZipFile.getEntrycalls [1].As I said in my replying email [2] I believe a better approach might be to "patch" UTF8 charsetdirectly to
implement sun.nio.cs.ArrayDecoder/Encoder interface to speed up the coding 
operation for array based
encoding/decoding under certain circumstance, as we did for all single byte charsets in #6636323[3]. I
have a old blog [4] that has some data for this optimization.
The original plan was to do the same thing for our new UTF8 [5] as well in JDK7, but then(excuse, excuse)I was just too busy to come back to this topic till 2 days ago. After two days of small tweakinghere and thereand testing those possible corner cases I can think of, I'm happy with the result and think itmight be
worth sending it out for a codereview for JDK7, knowing we only have couple 
days left.

The webrev is at

http://cr.openjdk.java.net/~sherman/7040220/webrev

Those tests are supposed to make sure the coding result from the new paths for 
String.getBytes()/
toCharArray() matches the result from the existing implementation.

The performance results of running StrCodingBenchmarkUTF8 (included in webrev) 
on my linux
box in -client and -server mode respectively are included at

http://cr.openjdk.java.net/~sherman/7040220/client
http://cr.openjdk.java.net/~sherman/7040220/server

The microbenchmark measures 1-byte, 2-byte, 3-byte and 4 bytes utf8 bits 
separately with different
length of data (from 12 bytes to thousands)

Thanks!
-Sherman

[1] http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-April/006710.html
[2] http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-April/006726.html
[3] http://cr.openjdk.java.net/~sherman/6636323_6636319/webrev
[4] http://blogs.sun.com/xuemingshen/entry/faster_new_string_bytes_cs
[5] http://blogs.sun.com/xuemingshen/entry/the_big_overhaul_of_java

Re: Codereview request: CR 7040220 java/char_encodin Optimize UTF-8 charset for String.getBytes()/toCharArray()

Reply via email to