[jira] [Commented] (CASSANDRA-12245) initial view build can be parallel

Paulo Motta (JIRA) Mon, 14 Aug 2017 21:39:23 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-12245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126791#comment-16126791
 ]


Paulo Motta commented on CASSANDRA-12245:
-----------------------------------------

Thanks for the patch and sorry for the delay! Had an initial look at the patch, 
overall looks good, I have the following comments/questions/remarks:

bq. The newly created ViewBuilderController reads the local token ranges, 
splits them to satisfy a concurrency factor, and runs a ViewBuilder for each of 
them.

It would be nice to maybe try to reuse the {{Splitter}} methods if possible, so 
we can reuse tests, or if that's not straightforward maybe put the methods on 
splitter and add some tests to make sure it's working correctly.

bq.  When ViewBuilderController receives the finalization signal of the last 
ViewBuilder, it double-checks if there are new local ranges that weren't 
considered at the beginning of the build. If there are new ranges, new 
{{ViewBuilder}}s are created for them.

This will not work if the range movement which created the new local range 
finishes after the view has finished building. This problem exists currently 
and is unrelated to the view build process itself, but more related to the 
range movement completion which should ensure the views are properly built 
before the operation finishes, so I created CASSANDRA-13762 to handle this 
properly.

bq. Given that we have a ViewBuilder per local range, the key of the table 
system.views_builds_in_progress is modified to include the bounds of the token 
range. So, we will have an entry in the table per each ViewBuilder. The number 
of covered keys per range is also recorded in the table.

Can probably remove the generation field from the builds in progress table and 
[remove this 
comment|https://github.com/adelapena/cassandra/blob/94f3d0d02bb5f849e4d54857d0b33531a5650643/src/java/org/apache/cassandra/db/view/ViewBuilder.java#L124]

bq. I have updated the patch to use a new separate table, 
system.views_builds_in_progress_v2

{{views_builds_in_progress_v2}} sounds a bit hacky, so perhaps we should call 
it {{system.view_builds_in_progress}} (remove the s) and also add a NOTICE 
entry informing the previous table was replaced and data files can be removed.

bq. The downside is that pending view builds will be restarted during an 
upgrade to 4.x, which seems reasonable to me.

Sounds reasonable to me too.

bq. ViewBuilder and ViewBuilderController are probably not the best names. 
Maybe we could rename ViewBuilder to something like ViewBuilderTask or 
ViewBuilderForRange, and rename ViewBuilderController to ViewBuilder.

{{ViewBuilder}} and {{ViewBuilderTask}} LGTM

bq. The concurrency factor is based on conf.concurrent_compactors because the 
views are built on the CompactionManager, but we may be interested in a 
different value.

I'm a bit concerned about starving the compaction executor for a long period 
during view build of large base tables, so we should probably have another 
option like {{concurret_view_builders}} with a conservative default and perhaps 
control the concurrency at the {{ViewBuilderController}}. WDYT?

bq. The patch tries to evenly split the token ranges in the minimum number of 
parts to satisfy the concurrency factor, and it never merges ranges. So, with 
the default 256 virtual nodes (and a lesser concurrency factor) we create 256 
build tasks. We might be interested in a different planning. If we want the 
number of tasks to be lesser than the number of local ranges we should modify 
the ViewBuilder task to be responsible for several ranges, although it will 
complicate the status tracking.

I think this is good to start with, we can improve the planning later if 
necessary. I don't think there is much gain from merging ranges to have smaller 
tasks.

bq. Probably there is a better way of implementing 
ViewBuilder.getCompactionInfo. The patch uses 
keysBuilt/ColumnFamilyStore.estimatedKeysForRange to estimate the completion, 
which could deal to have task completion status over 100%, depending on the 
estimation.

How about using {{prevToken.size(range.right)}} (introduced by CASSANDRA-7032)? 
Even though this will not be available for BytesToken (used by 
ByteOrderedPartitioner, which is rarely used, so could maybe fallback to the 
current imprecise calculation in that case).

Other comments:

* Avoid submitting view builder when view is already built instead of checking 
on the ViewBuilder 
([here|https://github.com/adelapena/cassandra/blob/94f3d0d02bb5f849e4d54857d0b33531a5650643/src/java/org/apache/cassandra/db/view/ViewBuilder.java#L109])
* ViewBuilder seems to be reimplementing some of the logic of 
{{PartitionRangeReadCommand}}, so I wonder if we shoud take this chance to 
simplify and use that instead of manually constructing the commands via 
ReducingKeyIterator and multiple {{SinglePartitionReadCommands}}? We can 
totally do this in other ticket if you prefer.
* Perform view marking on ViewBuilderController [instead of 
ViewBuilder|https://github.com/adelapena/cassandra/blob/94f3d0d02bb5f849e4d54857d0b33531a5650643/src/java/org/apache/cassandra/db/view/ViewBuilder.java#L177]
* Updating the view built status [at every 
key|https://github.com/adelapena/cassandra/blob/94f3d0d02bb5f849e4d54857d0b33531a5650643/src/java/org/apache/cassandra/db/view/ViewBuilder.java#L163]
 is perhaps a bit inefficient and unnecessary, so perhaps we should update it 
every 1000 keys or so.
* Would be nice to update the {{interrupt_build_process_test}} to stop halfway 
through the build (instead of the start of the build) and verify it it's being 
resumed correctly with the new changes.

> initial view build can be parallel
> ----------------------------------
>
>                 Key: CASSANDRA-12245
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12245
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Materialized Views
>            Reporter: Tom van der Woerdt
>            Assignee: Andrés de la Peña
>             Fix For: 4.x
>
>
> On a node with lots of data (~3TB) building a materialized view takes several 
> weeks, which is not ideal. It's doing this in a single thread.
> There are several potential ways this can be optimized :
>  * do vnodes in parallel, instead of going through the entire range in one 
> thread
>  * just iterate through sstables, not worrying about duplicates, and include 
> the timestamp of the original write in the MV mutation. since this doesn't 
> exclude duplicates it does increase the amount of work and could temporarily 
> surface ghost rows (yikes) but I guess that's why they call it eventual 
> consistency. doing it this way can avoid holding references to all tables on 
> disk, allows parallelization, and removes the need to check other sstables 
> for existing data. this is essentially the 'do a full repair' path



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-12245) initial view build can be parallel

Reply via email to