[
https://issues.apache.org/jira/browse/CASSANDRA-14898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17267041#comment-17267041
]
Venkata Harikrishna Nukala commented on CASSANDRA-14898:
--------------------------------------------------------
Changes are here:
[https://github.com/nvharikrishna/cassandra/commit/c090ed640440b284686131ac73b3a6295cea27c0]
Circle CI:
[https://app.circleci.com/pipelines/github/nvharikrishna/cassandra?branch=14898-keycache-performance-fix]
Tested locally in my laptop with 5000+ SSTables and these are the results
observed:
Before:
{code:java}
Test 1:
INFO [pool-3-thread-1] 2021-01-07 15:40:24,767 AutoSavingCache.java:177 -
Completed loading (148997 ms; 589886 keys) KeyCache cache
Test 2:
INFO [pool-3-thread-1] 2021-01-07 16:55:23,327 AutoSavingCache.java:177 -
Completed loading (99441 ms; 396405 keys) KeyCache cache
{code}
After:
{code:java}
INFO [pool-3-thread-1] 2021-01-08 17:54:11,402 AutoSavingCache.java:177 -
Completed loading (8502 ms; 401358 keys) KeyCache cache{code}
> Key cache loading is very slow when there are many SSTables
> -----------------------------------------------------------
>
> Key: CASSANDRA-14898
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14898
> Project: Cassandra
> Issue Type: Bug
> Components: Legacy/Local Write-Read Paths
> Environment: AWS i3.2xlarge, 4 physical cores (8 threads), 60GB of
> RAM, loading about 8MB of KeyCache with 10k keys in it.
> Reporter: Joey Lynch
> Assignee: Joey Lynch
> Priority: Low
> Labels: Performance, low-hanging-fruit
> Attachments: key_cache_load_slow.svg
>
>
> While dealing with a production issue today where some 3.0.17 nodes had close
> to ~8k sstables on disk due to excessive write pressure, we had a few nodes
> crash due to OOM and then they took close to 17 minutes to load the key cache
> and recover. This excessive key cache load significantly increased the
> duration of the outage (to mitigate we just removed the saved key cache
> files). For example here is one example taking 17 minutes to load 10k keys,
> or about 10 keys per second (which is ... very slow):
> {noformat}
> INFO [pool-3-thread-1] 2018-11-15 21:50:21,885 AutoSavingCache.java:190 -
> reading saved cache /mnt/data/cassandra/saved_caches/KeyCache-d.db
> INFO [pool-3-thread-1] 2018-11-15 22:07:16,490 AutoSavingCache.java:166 -
> Completed loading (1014606 ms; 10103 keys) KeyCache cache
> {noformat}
> I've witnessed similar behavior in the past with large LCS clusters, and
> indeed it appears that any time the number of sstables is large, KeyCache
> loading takes a _really_ long time. Today I got a flame graph and I believe
> that I found the issue and I think it's reasonably easy to fix. From what I
> can tell the {{KeyCacheSerializer::deserialize}} [method
> |https://github.com/apache/cassandra/blob/06209037ea56b5a2a49615a99f1542d6ea1b2947/src/java/org/apache/cassandra/service/CacheService.java#L445]
> which is called for every key is linear in the number of sstables due to the
> [call|https://github.com/apache/cassandra/blob/06209037ea56b5a2a49615a99f1542d6ea1b2947/src/java/org/apache/cassandra/service/CacheService.java#L459]
> to {{ColumnFamilyStore::getSSTables}} which ends up calling {{View::select}}
> [here|https://github.com/apache/cassandra/blob/06209037ea56b5a2a49615a99f1542d6ea1b2947/src/java/org/apache/cassandra/db/lifecycle/View.java#L139].
> The {{View::select}} call is linear in the number of sstables and causes a
> _lot_ of {{HashSet}}
> [resizing|https://github.com/apache/cassandra/blob/06209037ea56b5a2a49615a99f1542d6ea1b2947/src/java/org/apache/cassandra/db/lifecycle/View.java#L139]
> when the number of sstables is much greater than 16 (the default size of the
> backing {{HashMap}}).
> As we see in the attached flamegraph we spend 50% of our CPU time in these
> {{getSSTable}} calls, of which 36% is spent adding sstables to the HashSet in
> {{View::select}} and 17% is spent just iterating the sstables in the first
> place. A full 16% of CPU time is spent _just resizing the HashMap_. Then
> another 4% is spend calling {{CacheService::findDesc}} which does [a linear
> search|https://github.com/apache/cassandra/blob/06209037ea56b5a2a49615a99f1542d6ea1b2947/src/java/org/apache/cassandra/service/CacheService.java#L475]
> for the sstable generation.
> I believe that this affects at least Cassandra 3.0.17 and trunk, and could be
> pretty easily fixed by either caching the getSSTables call or at the very
> least pre-sizing the {{HashSet}} in {{View::select}} to be the size of the
> sstables map.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]