[ 
https://issues.apache.org/jira/browse/CASSANDRA-9708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14621131#comment-14621131
 ] 

Benedict commented on CASSANDRA-9708:
-------------------------------------

This ticket utilises the vint encoding, and by doing so ironically reduces the 
size of some serializations so that they are < 9 bytes, triggering assertion 
errors stating we haven't enough room to read vints. 

I've opted to fix this problem in this ticket, since it's the first cause of 
the issue, but the approach warrants discussion. [~aweisberg]: your input would 
be welcome. I've opted to simply extend the buffer to at least 9 bytes, since, 
on reading, we expect to always know when we're done consuming an input (so 
having some extra bytes would not be a problem), and by reading too much we 
cannot negatively impact anyone using the portion of the buffer we extend into. 
This seems safe to me, but obviously if we ever depend on EOF for safe 
consumption this will break things. I doubt this is something we would ever 
depend upon, though, since it would be prone to corruption.

> Serialize ClusteringPrefixes in batches
> ---------------------------------------
>
>                 Key: CASSANDRA-9708
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9708
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Core
>            Reporter: Benedict
>            Assignee: Benedict
>             Fix For: 3.0.0 rc1
>
>
> Typically we will have very few clustering prefixes to serialize, however in 
> theory they are not constrained (or are they, just to a very large number?). 
> Currently we encode a fat header for all values up front (two bits per 
> value), however those bits will typically be zero, and typically we will have 
> only a handful (perhaps 1 or 2) of values.
> This patch modifies the encoding to batch the prefixes in groups of up to 32, 
> along with a header that is vint encoded. Typically this will result in a 
> single byte per batch, but will consume up to 9 bytes if some of the values 
> have their flags set. If we have more than 32 columns, we just read another 
> header. This means we incur no garbage, and compress the data on disk in many 
> cases where we have more than 4 clustering components.
> I do wonder if we shouldn't impose a limit on clustering columns, though: If 
> you have more than a handful merge performance is going to disintegrate. 32 
> is probably well in excess of what we should be seeing in the wild anyway.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to