[
https://issues.apache.org/jira/browse/CASSANDRA-14415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Benjamin Lerer updated CASSANDRA-14415:
---------------------------------------
Status: Ready to Commit (was: Review In Progress)
> Performance regression in queries for distinct keys
> ---------------------------------------------------
>
> Key: CASSANDRA-14415
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14415
> Project: Cassandra
> Issue Type: Bug
> Components: Legacy/Local Write-Read Paths
> Reporter: Samuel Klock
> Assignee: Samuel Klock
> Priority: Normal
> Labels: performance
> Fix For: 3.0.x, 3.11.x, 4.x
>
>
> Running Cassandra 3.0.16, we observed a major performance regression
> affecting {{SELECT DISTINCT keys}}-style queries against certain tables.
> Based on some investigation (guided by some helpful feedback from Benjamin on
> the dev list), we tracked the regression down to two problems.
> * One is that Cassandra was reading more data from disk than was necessary
> to satisfy the query. This was fixed under CASSANDRA-10657 in a later 3.x
> release.
> * If the fix for CASSANDRA-10657 is incorporated, the other is this code
> snippet in {{RebufferingInputStream}}:
> {code:java}
> @Override
> public int skipBytes(int n) throws IOException
> {
> if (n < 0)
> return 0;
> int requested = n;
> int position = buffer.position(), limit = buffer.limit(), remaining;
> while ((remaining = limit - position) < n)
> {
> n -= remaining;
> buffer.position(limit);
> reBuffer();
> position = buffer.position();
> limit = buffer.limit();
> if (position == limit)
> return requested - n;
> }
> buffer.position(position + n);
> return requested;
> }
> {code}
> The gist of it is that to skip bytes, the stream needs to read those bytes
> into memory then throw them away. In our tests, we were spending a lot of
> time in this method, so it looked like the chief drag on performance.
> We noticed that the subclass of {{RebufferingInputStream}} in use for our
> queries, {{RandomAccessReader}} (over compressed sstables), implements a
> {{seek()}} method. Overriding {{skipBytes()}} in it to use {{seek()}}
> instead was sufficient to fix the performance regression.
> The performance difference is significant for tables with large values. It's
> straightforward to evaluate with very simple key-value tables, e.g.:
> {{CREATE TABLE testtable (key TEXT PRIMARY KEY, value BLOB);}}
> We did some basic experimentation with the following variations (all in a
> single-node 3.11.2 cluster with off-the-shelf settings running on a dev
> workstation):
> * small values (1 KB, 100,000 entries), somewhat larger values (25 KB,
> 10,000 entries), and much larger values (1 MB, 10,000 entries);
> * compressible data (a single byte repeated) and uncompressible data (output
> from {{openssl rand $bytes}}); and
> * with and without sstable compression. (With compression, we use
> Cassandra's defaults.)
> The difference is most conspicuous for tables with large, uncompressible data
> and sstable decompression (which happens to describe the use case that
> triggered our investigation). It is smaller but still readily apparent for
> tables with effective compression. For uncompressible data without
> compression enabled, there is no appreciable difference.
> Here's what the performance looks like without our patch for the 1-MB entries
> (times in seconds, five consecutive runs for each data set, all exhausting
> the results from a {{SELECT DISTINCT key FROM ...}} query with a page size of
> 24):
> {noformat}
> working on compressible
> 5.21180510521
> 5.10270500183
> 5.22311806679
> 4.6732840538
> 4.84219098091
> working on uncompressible_uncompressed
> 55.0423607826
> 0.769015073776
> 0.850513935089
> 0.713396072388
> 0.62596988678
> working on uncompressible
> 413.292617083
> 231.345913887
> 449.524993896
> 425.135111094
> 243.469946861
> {noformat}
> and with the fix:
> {noformat}
> working on compressible
> 2.86733293533
> 1.24895811081
> 1.108907938
> 1.12742400169
> 1.04647302628
> working on uncompressible_uncompressed
> 56.4146180153
> 0.895509958267
> 0.922824144363
> 0.772884130478
> 0.731923818588
> working on uncompressible
> 64.4587619305
> 1.81325793266
> 1.52577018738
> 1.41769099236
> 1.60442209244
> {noformat}
> The long initial runs for the uncompressible data presumably come from
> repeatedly hitting the disk. In contrast to the runs without the fix, the
> initial runs seem to be effective at warming the page cache (as lots of data
> is skipped, so the data that's read can fit in memory), so subsequent runs
> are faster.
> For smaller data sets, {{RandomAccessReader.seek()}} and
> {{RebufferingInputStream.skipBytes()}} are approximately equivalent in their
> behavior (reducing to changing the position pointer of an in-memory buffer
> most of the time), so there isn't much difference. Here's before the fix for
> the 1-KB entries:
> {noformat}
> working on small_compressible
> 8.34115099907
> 8.57280993462
> 8.3534219265
> 8.55130696297
> 8.17362189293
> working on small_uncompressible_uncompressed
> 7.85155582428
> 7.54075288773
> 7.50106596947
> 7.39202189445
> 7.95735621452
> working on small_uncompressible
> 7.89256501198
> 7.88875198364
> 7.9013261795
> 7.76551413536
> 7.84927678108
> {noformat}
> and after:
> {noformat}
> working on small_compressible
> 8.29225707054
> 7.57822394371
> 8.10092878342
> 8.21332192421
> 8.19347810745
> working on small_uncompressible_uncompressed
> 7.74823594093
> 7.81218004227
> 7.68660092354
> 7.95432114601
> 7.77612304688
> working on small_uncompressible
> 8.18260502815
> 8.21010804176
> 8.1233921051
> 7.31543707848
> 7.91079998016
> {noformat}
> The effect is similar for the 25-KB entries, which might enjoy a slight
> performance benefit from the patch (perhaps because they're larger than the
> default buffer size defined in {{RandomAccessReader}}). Before:
> {noformat}
> working on medium_compressible
> 0.988080978394
> 1.02464294434
> 0.977658033371
> 1.02553391457
> 0.769363880157
> working on medium_uncompressible_uncompressed
> 1.07718396187
> 1.08547902107
> 1.12398791313
> 1.10300898552
> 1.08757281303
> working on medium_uncompressible
> 0.940990209579
> 0.917474985123
> 0.768013954163
> 0.871683835983
> 0.814841985703
> {noformat}
> and after:
> {noformat}
> working on medium_compressible
> 0.829009056091
> 0.705173015594
> 0.603646993637
> 0.820069074631
> 0.873830080032
> working on medium_uncompressible_uncompressed
> 0.785156965256
> 0.808106184006
> 0.848286151886
> 0.857885837555
> 0.825689077377
> working on medium_uncompressible
> 0.845101118088
> 0.913790941238
> 0.824147939682
> 0.849114894867
> 0.85981798172
> {noformat}
> In short, this looks like a pretty straightforward performance win with
> negligible cost. (It's worth noting that for our use case, disabling sstable
> compression is clearly the _best_ solution, but there's still reasonably
> clear benefit from this minor fix for data sets with larger, compressible
> values, as well as presumably data sets with a mix of compressible and
> uncompressible values in environments where storage is limited.)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]