[
https://issues.apache.org/jira/browse/CASSANDRA-14611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16566156#comment-16566156
]
Joseph Lynch commented on CASSANDRA-14611:
------------------------------------------
With CASSANDRA-13291 we could use a better hash function (like xxHash!), this
might not apply to trunk, I will try to test it out.
> Quorum reads of wide partitions are expensive due to inefficient hashing in
> AbstractCell
> ----------------------------------------------------------------------------------------
>
> Key: CASSANDRA-14611
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14611
> Project: Cassandra
> Issue Type: Improvement
> Environment: AWS, i3.4xlarge (very fast nvme drives), Linux 4.13.
> Cassandra 3.0.16, Java 8u162
> Reporter: Joseph Lynch
> Priority: Minor
> Labels: performance
> Attachments: quorum_reads_lots_of_small_columns.svg
>
>
> While I was analyzing flame graphs as part of investigating CASSANDRA-14605 I
> noticed that digest overhead for quorum reads of partitions with many small
> columns is much higher than I would expect.
> In this particular use case of ({{LOCAL_QUORUM}}) reads of partitions with
> (p50, p95, p99) = (60, 5k, 20k) small columns (a {{set<int>}}), digest
> requests make up 45%+ of the CPU time that Cassandra is using. About 20% is
> reading the columns into the iterator abstraction (bummer) but surprisingly
> about 20% is _calculating the digests themselves_. I would expect digest
> calculation itself to be close to 2% total ... I've attached a flame graph
> showing this behavior.
> The part that is particularly concerning to me is that ~17% of CPU time (~1/3
> of the overall digest time) is spent in calculating {{[AbstractCell::digest
> |https://github.com/apache/cassandra/blob/d52c7b8c595cc0d06fc3607bf16e3f595f016bb6/src/java/org/apache/cassandra/db/rows/AbstractCell.java#L54-L56]}}
> but only ~2% is digesting the actual cell values. Looking deeper the
> implementation of
> {{[FBUtilities::updateWithInt|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/utils/FBUtilities.java#L834]}}
> and other related functions ({{updateWithBoolean}}, {{updateWithLong}},
> etc...) are very inefficient as they call update on the underlying
> {{MessageDigest}} on each byte individually. When you have lots of small
> cells the overhead of this approach is apparently significant. In particular
> it looks like the underlying MD5 implementation is not as efficient as it
> should be for lots of single byte updates as it does attempt to buffer them
> into a 64 byte buffer, but that's too small to get maximum throughput. In my
> local benchmarking I'm seeing just grouping the metadata updates into a 13
> byte array yields a 3x performance improvement but it has 13 bytes of
> allocated memory per cell during the hash that previously we didn't allocate.
> I've been playing around with a patch which changes out the normal MD5 digest
> for a buffered one (e.g. has a fixed 1kb byte array per read thread which is
> used to roll up digest updates and then whenever the buffer fills it does an
> underlying update). So far such a buffered MessageDigest is showing 2x better
> performance on hashing 10k column metadata in a jmh microbenchmark I wrote.
> If I combine that with metadata grouping into a 13 byte array I get 10x more
> throughput. I think I need to do more profiling but I think that buffering
> our digest calls could really help for wide partitions of smaller columns (as
> opposed to fat single columns, where the metadata hashing overhead is
> probably pretty negligible).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]