[jira] [Comment Edited] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format

Robert Stupp (JIRA) Sun, 20 Mar 2016 05:00:45 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203258#comment-15203258
 ]


Robert Stupp edited comment on CASSANDRA-11206 at 3/20/16 11:59 AM:
--------------------------------------------------------------------

I just finished with most of the coding for this ticket - i.e. "shallow" 
RowIndexEntry without {{IndexInfo}} - and ran a poor-man's comparison of 
current trunk against 11206 using different partition sizes covering writes, a 
major compaction and reads. The results are really promising especially with 
big and huge partitions (tested up to 8G partitions).

Reads against big partitions really benefit from 11206. For example, with 11206 
it takes a couple of seconds for 5000 random reads against 8G partitions vs. 
many minutes (not a typo) on current trunk). At the same time, the heap is 
quite full and causes a lot of GC pressure.

Compactions also benefit from 11206 GC-wise - but not CPU- or I/O-wise since 
it's still the same amount of work to be done. 11206 "just" reduces GC pressure.

Flushes also benefit, since it can "forget" IndexInfo objects sooner.

This ticket will *not* raise the limit on cells.

[~doanduyhai], you're right. Having the ability to handle big partitions has a 
direct influence to data modeling. I'd not say "you are not longer limited by 
the size of your partitions". This ticket _raises_ the current limitation WRT 
GC pressure and read performance. In theory the limit went away, but as you 
say, compaction gets even more important and other operational tasks like 
replacing nodes or changing topology need to be considered.

My next steps are:
* fix some unit tests that no longer work as they relied on the old 
implementation (expected to have IndexInfo on heap)
* cleanup the code
* run some tests on cstar

I only ran a poor-man's comparison - on my laptop with small-ish 3G heap with 
default unit test settings. That's why I did not note exact numbers. But I'd 
like to show the GC pressure of the same test ran against trunk and 11206:

!trunk-gc.png|GC on current trunk!

!11206-gc.png|GC on 11206!



was (Author: snazy):
I just finished with most of the coding for this ticket - i.e. "shallow" 
RowIndexEntry without {{IndexInfo}} - and ran a poor-man's comparison of 
current trunk against 11206 using different partition sizes covering writes, a 
major compaction and reads. The results are really promising especially with 
big and huge partitions (tested up to 8G partitions).

Reads against big partitions really benefit from 11206. For example, with 11206 
it takes a couple of seconds for 5000 random reads against 8G partitions vs. 
many minutes (not a typo) on current trunk). At the same time, the heap is 
quite full and causes a lot of GC pressure.

Compactions also benefit from 11206 GC-wise - but not CPU- or I/O-wise since 
it's still the same amount of work to be done. 11206 "just" reduces GC pressure.

Flushes also benefit, since it can "forget" IndexInfo objects sooner.

This ticket will *not* raise the limit on cells.

[~doanduyhai], you're right. Having the ability to handle big partitions has a 
direct influence to data modeling. I'd not say "you are not longer limited by 
the size of your partitions". This ticket _raises_ the current limitation WRT 
GC pressure and read performance. In theory the limit went away, but as you 
say, compaction gets even more important and other operational tasks like 
replacing nodes or changing topology need to be considered.

My next steps are:
* fix some unit tests that no longer work as they relied on the old 
implementation (expected to have IndexInfo on heap)
* cleanup the code
* run some tests on cstar

I only ran a poor-man's comparison - on my laptop with small-ish 3G heap with 
default unit test settings. That's why I did not note exact numbers. But I'd 
like to show the GC pressure of the same test ran against trunk and 11206:

!trunk-gc.png!GC on current trunk!

!11206-gc.png!GC on 11206!


> Support large partitions on the 3.0 sstable format
> --------------------------------------------------
>
>                 Key: CASSANDRA-11206
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11206
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Jonathan Ellis
>            Assignee: Robert Stupp
>             Fix For: 3.x
>
>         Attachments: 11206-gc.png, trunk-gc.png
>
>
> Cassandra saves a sample of IndexInfo objects that store the offset within 
> each partition of every 64KB (by default) range of rows.  To find a row, we 
> binary search this sample, then scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, 
> we deserialize the entire set of IndexInfo, which both creates a lot of GC 
> overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity 
> (relative to reading a single 64KB row range) as partitions get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform 
> the IndexInfo bsearch while only deserializing IndexInfo that we need to 
> compare against, i.e. log(N) deserializations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format

Reply via email to