[jira] [Comment Edited] (CASSANDRA-12245) initial view build can be parallel

Tom van der Woerdt (JIRA) Wed, 20 Jul 2016 07:18:38 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-12245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15385924#comment-15385924
 ]


Tom van der Woerdt edited comment on CASSANDRA-12245 at 7/20/16 2:17 PM:
-------------------------------------------------------------------------

Good point about the partial row updates, hadn't thought about that case. Could 
probably make it do the row retrieval conditionally based on whether all fields 
are in the sstable or not (most rows will be, thanks to compaction), but then 
you probably run into a lot of corner cases.

Doing vnodes in parallel can already give a massive boost (~25x in my case, I 
guess) so that'll do fine :)


was (Author: tvdw):
Good point about the partial row updates, hadn't thought about that case. Could 
probably make it do the row retrieval conditionally based on whether all fields 
are in the sstable or not (most rows will be, thanks to compaction), but then 
you probably run into a lot of corner cases.

> initial view build can be parallel
> ----------------------------------
>
>                 Key: CASSANDRA-12245
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12245
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Tom van der Woerdt
>
> On a node with lots of data (~3TB) building a materialized view takes several 
> weeks, which is not ideal. It's doing this in a single thread.
> There are several potential ways this can be optimized :
>  * do vnodes in parallel, instead of going through the entire range in one 
> thread
>  * just iterate through sstables, not worrying about duplicates, and include 
> the timestamp of the original write in the MV mutation. since this doesn't 
> exclude duplicates it does increase the amount of work and could temporarily 
> surface ghost rows (yikes) but I guess that's why they call it eventual 
> consistency. doing it this way can avoid holding references to all tables on 
> disk, allows parallelization, and removes the need to check other sstables 
> for existing data. this is essentially the 'do a full repair' path



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-12245) initial view build can be parallel

Reply via email to