C. Scott Andreas updated CASSANDRA-14611:
    Component/s: Coordination

> Quorum reads of wide partitions are expensive due to inefficient hashing in 
> AbstractCell
> ----------------------------------------------------------------------------------------
>                 Key: CASSANDRA-14611
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14611
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Coordination
>         Environment: AWS, i3.4xlarge (very fast nvme drives), Linux 4.13.
> Cassandra 3.0.16, Java 8u162
>            Reporter: Joseph Lynch
>            Priority: Minor
>              Labels: performance
>         Attachments: quorum_reads_lots_of_small_columns.svg
> While I was analyzing flame graphs as part of investigating CASSANDRA-14605 I 
> noticed that digest overhead for quorum reads of partitions with many small 
> columns is much higher than I would expect. 
> In this particular use case of ({{LOCAL_QUORUM}}) reads of partitions with  
> (p50, p95, p99) = (60, 5k, 20k) small columns (a {{set<int>}}), digest 
> requests make up 45%+ of the CPU time that Cassandra is using. About 20% is 
> reading the columns into the iterator abstraction (bummer) but surprisingly 
> about 20% is _calculating the digests themselves_. I would expect digest 
> calculation itself to be close to 2% total ... I've attached a flame graph 
> showing this behavior.
> The part that is particularly concerning to me is that ~17% of CPU time (~1/3 
> of the overall digest time) is spent in calculating {{[AbstractCell::digest 
> |https://github.com/apache/cassandra/blob/d52c7b8c595cc0d06fc3607bf16e3f595f016bb6/src/java/org/apache/cassandra/db/rows/AbstractCell.java#L54-L56]}}
>  but only ~2% is digesting the actual cell values. Looking deeper the 
> implementation of 
> {{[FBUtilities::updateWithInt|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/utils/FBUtilities.java#L834]}}
>  and other related functions ({{updateWithBoolean}}, {{updateWithLong}}, 
> etc...) are very inefficient as they call update on the underlying 
> {{MessageDigest}} on each byte individually. When you have lots of small 
> cells the overhead of this approach is apparently significant. In particular 
> it looks like the underlying MD5 implementation is not as efficient as it 
> should be for lots of single byte updates as it does attempt to buffer them 
> into a 64 byte buffer, but that's too small to get maximum throughput. In my 
> local benchmarking I'm seeing just grouping the metadata updates into a 13 
> byte array yields a 3x performance improvement but it has 13 bytes of 
> allocated memory per cell during the hash that previously we didn't allocate.
> I've been playing around with a patch which changes out the normal MD5 digest 
> for a buffered one (e.g. has a fixed 1kb byte array per read thread which is 
> used to roll up digest updates and then whenever the buffer fills it does an 
> underlying update). So far such a buffered MessageDigest is showing 2x better 
> performance on hashing 10k column metadata in a jmh microbenchmark I wrote. 
> If I combine that with metadata grouping into a 13 byte array I get 10x more 
> throughput. I think I need to do more profiling but I think that buffering 
> our digest calls could really help for wide partitions of smaller columns (as 
> opposed to fat single columns, where the metadata hashing overhead is 
> probably pretty negligible).

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to