[ 
https://issues.apache.org/jira/browse/CASSANDRA-12245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom van der Woerdt updated CASSANDRA-12245:
-------------------------------------------
    Description: 
On a node with lots of data (~3TB) building a materialized view takes several 
weeks, which is not ideal. It's doing this in a single thread.

There are several potential ways this can be optimized :
 * do vnodes in parallel, instead of going through the entire range in one 
thread
 * just iterate through sstables, not worrying about duplicates, and include 
the timestamp of the original write in the MV mutation. since this doesn't 
exclude duplicates it does increase the amount of work and could temporarily 
surface ghost rows (yikes) but I guess that's why they call it eventual 
consistency. doing it this way can avoid holding references to all tables on 
disk, allows parallelization, and removes the need to check other sstables for 
existing data. this is essentially the 'do a full repair' path

  was:
On a node with lots of data (~3TB) building a materialized view takes several 
weeks, which is not ideal. It's doing this in a single thread.

There are several potential ways this can be optimized :
 * do vnodes in parallel, instead of going through the entire range in one 
thread
 * just iterate through sstables, not worrying about duplicates, and include 
the timestamp of the original write in the MV mutation. since this excludes 
duplicates it does increase the amount of work and could temporarily surface 
ghost rows (yikes) but I guess that's why they call it eventual consistency. 
doing it this way can avoid holding references to all tables on disk, allows 
parallelization, and removes the need to check other sstables for existing 
data. this is essentially the 'do a full repair' path


> initial view build can be parallel
> ----------------------------------
>
>                 Key: CASSANDRA-12245
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12245
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Tom van der Woerdt
>
> On a node with lots of data (~3TB) building a materialized view takes several 
> weeks, which is not ideal. It's doing this in a single thread.
> There are several potential ways this can be optimized :
>  * do vnodes in parallel, instead of going through the entire range in one 
> thread
>  * just iterate through sstables, not worrying about duplicates, and include 
> the timestamp of the original write in the MV mutation. since this doesn't 
> exclude duplicates it does increase the amount of work and could temporarily 
> surface ghost rows (yikes) but I guess that's why they call it eventual 
> consistency. doing it this way can avoid holding references to all tables on 
> disk, allows parallelization, and removes the need to check other sstables 
> for existing data. this is essentially the 'do a full repair' path



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to