[
https://issues.apache.org/jira/browse/CASSANDRA-5664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sylvain Lebresne updated CASSANDRA-5664:
----------------------------------------
Attachment: 0002-Avoid-copy-when-compressing-native-protocol-frames.txt
0001-Rewrite-encoding-methods.txt
Thanks for the remarks. And while I don't disagree with those, it seems to me
that they all require a bit more investigation work to see whether they are
worth the trouble, and while we should certainly do this additional work at
some point, I think the patches here do improve the cleanness of the current
serialization code enough that I'd rather leave those to a follow up and commit
this now (provided we agree that the patches are already an improvement on the
status quo). So attaching rebased patches but so far they are just rebased,
nothing more.
But to address the observations more precisely:
bq. There's quite a bit of string encoding and object serialization going on in
some of the encodedSize() methods. This means that strings/objects will be
encoded/serialized twice.
True. For strings, while I'm far from being an expert in unicode encodings and
charsets, I'm not sure how one could compute the size of the result of UTF8
encoding without doing the actual encoding. Of course, the current
CBUtil.sizeOfString does an array allocation which in theory could be avoided,
but I'm not aware of a method that return the size of an encoded string without
encoding it (and I'm certainly not writing one manually). So I'm open to
suggestions, but I doubt this is much of a bottleneck in practice (see below
for more on that) so... As for "object serialization" (that are not strings),
I'm not totally sure which ones you are referring to exactly but I don't see
anything that looks like a great waste of effort.
bq. byte[] allocation and copying in encode() should be possible to avoid when
serializing strings by careful use of ChannelBuffer.toByteBuffer(),
CharBuffer.wrap() and CharsetEncoder.encode().
You're probably right, but I believe that means assuming (correct me if I'm
wrong) that the result of ChannelBuffer.toByteBuffer() *will* share content
with the ChannelBuffer it is called on, which is not guaranteed by that API.
And while it is true that as of this patch we only call CBUtil.writeString() on
ChannelBuffer for which this will be true, it bothers me to rely on it as this
could easily break in a very hard to notice way in the future. Furthermore, I'm
reasonably confident that this doesn't really matter in practice (performance
wise) since if you use prepared statement (which you should if you care about
performance) and with the optimization of the v2 protocol of not returning
metadata in the result set for said prepared statement, I don't think there's
any string serialization going on in the hot path.
bq. It might be worth investigating if the code duplication in encode() and
encodedSize() can be eliminated
It does is worth the investigation. But we already do similar code duplication
for other serialization code (namely IVersionedSerializer stuffs). So given
the patch is already written and since, as said above, I'm pretty convinced
this does already is better than the status quo, I'd rather left such
refactoring to a latter ticket that could consider cleaning up all existing
serialization code (though on a personal level, that particular code
duplication from IVersionedSerializer has never bothered me much in practice so
I don't see fixing it a priority).
> Improve serialization in the native protocol
> --------------------------------------------
>
> Key: CASSANDRA-5664
> URL: https://issues.apache.org/jira/browse/CASSANDRA-5664
> Project: Cassandra
> Issue Type: Improvement
> Reporter: Sylvain Lebresne
> Assignee: Sylvain Lebresne
> Priority: Minor
> Fix For: 2.0
>
> Attachments: 0001-Rewrite-encoding-methods.txt,
> 0002-Avoid-copy-when-compressing-native-protocol-frames.txt
>
>
> Message serialization in the native protocol currently make a Netty's
> ChannelBuffers.wrappedBuffer(). The rational was to avoid copying of the
> values bytes when such value are biggish. This has a cost however, especially
> with lots of small values, and as suggested in CASSANDRA-5422, this might
> well be a more common scenario for Cassandra, so let's consider directly
> serializing in a newly allocated buffer.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira