Re: [protobuf] Java UTF-8 encoding/decoding: possible performance improvements

David Yu Tue, 22 Dec 2009 20:41:46 -0800

On Wed, Dec 23, 2009 at 12:21 PM, Kenton Varda <ken...@google.com> wrote:


>
>
> On Tue, Dec 22, 2009 at 8:18 PM, David Yu <david.yu....@gmail.com> wrote:
>
>>
>>
>> On Wed, Dec 23, 2009 at 11:14 AM, Kenton Varda <ken...@google.com> wrote:
>>
>>> On Tue, Dec 22, 2009 at 7:06 PM, David Yu <david.yu....@gmail.com>wrote:
>>>
>>>> There should be a writeByteArray(int fieldNumber, byte[] value) in
>>>> CodedOutputStream so that the cached bytes of strings would
>>>> be written directly.  The ByteString would not help, it adds more memory
>>>> since it creates a copy of the byte array.
>>>>
>>>
>>> We could cache the bytes as a ByteString.  Converting a String to a
>>> ByteString does not require a redundant copy, as ByteString has methods for
>>> this.
>>>
>>> I think it would be better to do it this way because, in the long run, we
>>> actually want to extend ByteString to allow avoiding copies in some cases.
>>>  For example, if you are serializing a message to a ByteString (you caleld
>>> toByteString()) or parsing from a ByteString, then handling "bytes" fields
>>> should require any copy.  Instead, it should be possible to construct a
>>> ByteString which is a substring of some other ByteString in O(1) time, as
>>> well as concatenate ByteStrings in O(1) time.
>>>
>>> So this way, if the size-computation step converted the String to a
>>> ByteString and cached that, no further copy of the bytes would ever be
>>> needed in many cases.
>>>
>>
>> Cool.
>> Btw, the ByteString's snippet is:
>>  return new ByteString(text.getBytes("UTF-
>> 8"));
>>
>> Another improvement would be avoiding the lookup and instead cache the
>> Charset.forName("UTF-8") object and use it.
>> I believe you google guys have also been evangelizing this :-) (PDF from
>> http://code.google.com/p/guava-libraries/)
>>
>
> I tried doing that at one point and found that it was *much slower* --
> apparently String.getBytes("UTF-8") is highly-optimized, whereas creating a
> Charset object (even statically) and using that is not.  :/
>

Just checked the code ... and you're absolutely right.
java.lang.StringCoding (line 277) creates an unnecessary copy of the char
array which makes it slow.  I'm not sure but it might be a sun jdk 6 bug.



-- 
When the cat is away, the mouse is alone.
- David Yu

--

You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to proto...@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.

Re: [protobuf] Java UTF-8 encoding/decoding: possible performance improvements

Reply via email to