[
https://issues.apache.org/jira/browse/CASSANDRA-4784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13575153#comment-13575153
]
Jouni Hartikainen commented on CASSANDRA-4784:
----------------------------------------------
I'm not really sure if I understood this correctly, but wouldn't this change
lead to memtable flushes creating much more random I/O than previously?
Especially when using vnodes wouldn't the incoming data be spread to num_tokens
files per CF instead of one per CF? Wouldn't this affect compactions as well?
E.g. for default size tiered strategy, instead of compacting 4 larger SSTables
into one even larger per CF, we would be compacting num_tokens * 4 smaller
files into num_tokens larger ones per CF.
Am I missing something here?
> Create separate sstables for each token range handled by a node
> ---------------------------------------------------------------
>
> Key: CASSANDRA-4784
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4784
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Affects Versions: 1.2.0 beta 1
> Reporter: sankalp kohli
> Assignee: Benjamin Coverston
> Priority: Minor
> Labels: perfomance
> Fix For: 2.0
>
> Attachments: 4784.patch
>
>
> Currently, each sstable has data for all the ranges that node is handling. If
> we change that and rather have separate sstables for each range that node is
> handling, it can lead to some improvements.
> Improvements
> 1) Node rebuild will be very fast as sstables can be directly copied over to
> the bootstrapping node. It will minimize any application level logic. We can
> directly use Linux native methods to transfer sstables without using CPU and
> putting less pressure on the serving node. I think in theory it will be the
> fastest way to transfer data.
> 2) Backup can only transfer sstables for a node which belong to its primary
> keyrange.
> 3) ETL process can only copy one replica of data and will be much faster.
> Changes:
> We can split the writes into multiple memtables for each range it is
> handling. The sstables being flushed from these can have details of which
> range of data it is handling.
> There will be no change I think for any reads as they work with interleaved
> data anyway. But may be we can improve there as well?
> Complexities:
> The change does not look very complicated. I am not taking into account how
> it will work when ranges are being changed for nodes.
> Vnodes might make this work more complicated. We can also have a bit on each
> sstable which says whether it is primary data or not.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira