Re: [protobuf] Java UTF-8 encoding/decoding: possible performance improvements

Evan Jones Wed, 23 Dec 2009 07:45:00 -0800

On Dec 22, 2009, at 19:59 , Kenton Varda wrote:
> I wonder if we can safely discard the cached byte array during  
> serialization on the assumption that most messages are serialized  
> only once?

This is a good idea, and it seems to me that this should definitely be  
possible. It would need to be done somewhat carefully, since Message  
objects are supposed to be thread safe, but I don't think this is  
particularly hard. This would result in the first call to the  
combination of getSerializedSize() and writeTo() only serializing  
strings once, and each subsequent call would also serialize the string  
once. The advantage would be nearly no "permanent" extra memory  
overhead, beyond perhaps one extra field for messages that contain  
strings. The disadvantage is that you serialize strings each time you  
serialize the message.

The only additional tricky part: on subsequent serializations, it  
would be useful to know the serialized size of the string, in order to  
serialize the string directly into the output buffer, rather than  
needing to create a temporary byte[] array to get the length. This is  
also a solvable problem, and my lame benchmarks suggest this is only a  
small improvement anyway.

On Dec 22, 2009, at 22:06 , David Yu wrote:
> I noticed this as well before ... the solution could be applied to  
> *all* generated messages for efficiency.

I don't think other types have this "double encoding" overhead, but I  
could be wrong. This is not a problem for nested messages, since the  
call to Message.getSerializedSize() ends up  
calling .getSerializedSize() on the sub-message, so the size will be  
calculated without actually serializing the sub message.

While caching the serialized representation of the entire message  
would be faster for applications that call message.writeTo() multiple  
times on a single message, there is a significant memory cost to doing  
that automatically. Personally, I think this is the sort of thing that  
applications should do themselves, if they want it, and not something  
that should be part of the core library.

On Dec 22, 2009, at 19:59 , Kenton Varda wrote:
> Fetching a threadlocal should just be a pointer dereference on any  
> decent threading implementation.  Is it really that expensive in Java?

It used to be very expensive. I think it does the "right thing" now,  
but I'll measure to make sure. Its still more expensive than accessing  
a local field.

> Solution 3:  Maintain a private freelist of encoder objects within  
> CodedOutputStream.  Allocate one the first time a string is encoded  
> on a particular stream object, and return it to the freelist on  
> flush() (which is always called before discarding the stream unless  
> an exception interrupts serialization).  In may make sense for the  
> freelist to additionally be thread-local to avoid locking, but if  
> it's only one lock per serialization maybe it's not a big deal?

I would guess that this might be more expensive than the ThreadLocal,  
but I don't know that for sure. It would avoid the "one encoder/ 
decoder per thread" overhead. Do you think it is worth it?

Evan

--
Evan Jones
http://evanjones.ca/

--

You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.

Re: [protobuf] Java UTF-8 encoding/decoding: possible performance improvements

Reply via email to