[jira] [Updated] (CASSANDRA-5664) Improve serialization in the native protocol

Sylvain Lebresne (JIRA) Wed, 21 Aug 2013 02:37:50 -0700

     [ 
https://issues.apache.org/jira/browse/CASSANDRA-5664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sylvain Lebresne updated CASSANDRA-5664:
----------------------------------------

    Attachment: 0002-Avoid-copy-when-compressing-native-protocol-frames.txt
                0001-Rewrite-encoding-methods.txt

Thanks for the remarks. And while I don't disagree with those, it seems to me 
that they all require a bit more investigation work to see whether they are 
worth the trouble, and while we should certainly do this additional work at 
some point, I think the patches here do improve the cleanness of the current 
serialization code enough that I'd rather leave those to a follow up and commit 
this now (provided we agree that the patches are already an improvement on the 
status quo). So attaching rebased patches but so far they are just rebased, 
nothing more.

But to address the observations more precisely:

bq. There's quite a bit of string encoding and object serialization going on in 
some of the encodedSize() methods. This means that strings/objects will be 
encoded/serialized twice.

True. For strings, while I'm far from being an expert in unicode encodings and 
charsets, I'm not sure how one could compute the size of the result of UTF8 
encoding without doing the actual encoding. Of course, the current 
CBUtil.sizeOfString does an array allocation which in theory could be avoided, 
but I'm not aware of a method that return the size of an encoded string without 
encoding it (and I'm certainly not writing one manually). So I'm open to 
suggestions, but I doubt this is much of a bottleneck in practice (see below 
for more on that) so... As for "object serialization" (that are not strings), 
I'm not totally sure which ones you are referring to exactly but I don't see 
anything that looks like a great waste of effort.

bq. byte[] allocation and copying in encode() should be possible to avoid when 
serializing strings by careful use of ChannelBuffer.toByteBuffer(), 
CharBuffer.wrap() and CharsetEncoder.encode().

You're probably right, but I believe that means assuming (correct me if I'm 
wrong) that the result of ChannelBuffer.toByteBuffer() *will* share content 
with the ChannelBuffer it is called on, which is not guaranteed by that API. 
And while it is true that as of this patch we only call CBUtil.writeString() on 
ChannelBuffer for which this will be true, it bothers me to rely on it as this 
could easily break in a very hard to notice way in the future. Furthermore, I'm 
reasonably confident that this doesn't really matter in practice (performance 
wise) since if you use prepared statement (which you should if you care about 
performance) and with the optimization of the v2 protocol of not returning 
metadata in the result set for said prepared statement, I don't think there's 
any string serialization going on in the hot path.

bq. It might be worth investigating if the code duplication in encode() and 
encodedSize() can be eliminated

It does is worth the investigation. But we already do similar code duplication 
for other serialization code (namely IVersionedSerializer stuffs).  So given 
the patch is already written and since, as said above, I'm pretty convinced 
this does already is better than the status quo, I'd rather left such 
refactoring to a latter ticket that could consider cleaning up all existing 
serialization code (though on a personal level, that particular code 
duplication from IVersionedSerializer has never bothered me much in practice so 
I don't see fixing it a priority).

                
> Improve serialization in the native protocol
> --------------------------------------------
>
>                 Key: CASSANDRA-5664
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5664
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>            Priority: Minor
>             Fix For: 2.0
>
>         Attachments: 0001-Rewrite-encoding-methods.txt, 
> 0002-Avoid-copy-when-compressing-native-protocol-frames.txt
>
>
> Message serialization in the native protocol currently make a Netty's 
> ChannelBuffers.wrappedBuffer(). The rational was to avoid copying of the 
> values bytes when such value are biggish. This has a cost however, especially 
> with lots of small values, and as suggested in CASSANDRA-5422, this might 
> well be a more common scenario for Cassandra, so let's consider directly 
> serializing in a newly allocated buffer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-5664) Improve serialization in the native protocol

Reply via email to