[ 
https://issues.apache.org/jira/browse/CASSANDRA-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932597#action_12932597
 ] 

Peter Schuller commented on CASSANDRA-1658:
-------------------------------------------

Ryan: Yes, good point.

Jonathan:

So originally with 1608 I looked at the size restrictions as something that 
you'd select to be pretty high; essentially limiting it to "very large" instead 
of "huge" for bloom filter purposes (and probably disk space purposes - 
avoiding spikes). Limiting sizes sufficiently such that individual sstable 
compactions are no longer an issue would imply having pretty sever limits (on 
the order of smallish subset of RAM size rather than say 100 gig).

My main concern is the number of sstables this would generate if the maximum 
size was e.g. 500 MB or something along those lines. This means that row 
locality (between sstables) becomes significantly more important for large data 
sets. However, assuming 1608 works well enough, and coupled with rate limited 
compactions (outside the scope of this ticket or 1608), I agree that this 
should essentially become unnecessary, instead effectively being a complex way 
to achieve the same thing as one of the side-effects of 1608.

That said, I'm still not sold on how 1608 is to accomplish sufficiently 
aggressive row "de-spreading" without incurring significant overhead by 
compacting too aggressively. But I am starting to think I have misunderstood 
something about 1608, so take that with a grain of salt.



> support incremental sstable switching
> -------------------------------------
>
>                 Key: CASSANDRA-1658
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1658
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Peter Schuller
>            Priority: Minor
>
> I have been thinking about how to minimize the impact of compaction further 
> beyond CASSANDRA-1470. 1470 deals with the impact of the compaction process 
> itself in that it avoids going through the buffer cache; however, once 
> compaction is complete you are still switching to new sstables which will 
> imply cold reads.
> Instead of switching all at once, one could keep both the old and new 
> sstables around for a bit and incrementally switch over traffic to the new 
> sstables.
> A given request would go to the new or old sstable depending on e.g. the hash 
> of the row key couple with the point in time relative to compaction 
> completion and relative to the intended target sstable switch-over.
> In terms of end-user configuration/mnemonics, one would specify, for a given 
> column family, something like "sstable transition period per gb of data" or 
> similar. The "per gb of data" would refer to the size of the newly written 
> sstable after a compaction. So; for a major compaction you would wait for a 
> very significant period of time since the entire database just went cold. For 
> a minor compaction, you would only wait for a short period of time.
> The result should be a reasonable negative impact on e.g. disk space usage, 
> but hopefully a very significant impact in terms of making the sstable 
> transition as smooth as possible for the node.
> I like this because it feels pretty simple, is not relying on OS specific 
> features or otherwise rely on specific support from the OS other than a "well 
> functioning cache mechanism", and does not imply something hugely significant 
> like writing our own page cache layer. The performance w.r.t. CPU should be 
> very small, but the improvement in terms of disk I/O should be very 
> significant for workloads where it matters.
> The feature would be optional and per-sstable (or possibly global for the 
> node).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to