[jira] [Commented] (CASSANDRA-12245) initial view build can be parallel

Paulo Motta (JIRA) Tue, 19 Sep 2017 04:23:40 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-12245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171536#comment-16171536
 ]


Paulo Motta commented on CASSANDRA-12245:
-----------------------------------------

Finally getting to this after a while, sorry for the delay. Thanks for the 
update! Had another look at the patch and this is looking much better, see some 
follow-up comments below:

bq. I have moved the methods to split the ranges to the Splitter, reusing its 
valueForToken method. Tests here.

Awesome, looks much better now! It seems like the way the number of tokens in a 
range was computed by {{abs(range.right - range.left)}} may not work correctly 
for some wrap-around cases, as shown by [this test 
case|https://github.com/pauloricardomg/cassandra/blob/2760bbbc25a2ad9a9bbf9d29a0dc19e1e3bfb237/test/unit/org/apache/cassandra/dht/SplitterTest.java#L184].
 Even though this shouldn't break when local ranges are used , I fixed it on 
[this 
commit|https://github.com/pauloricardomg/cassandra/commit/2760bbbc25a2ad9a9bbf9d29a0dc19e1e3bfb237]
 to make sure split works correctly for wrap-around ranges. Can you confirm 
this is correct?

Other than that, it seems like you added unit tests only for 
{{Murmur3Partitioner}}, would you mind extending {{testSplit()}} to 
{{RandomPartitioner}}?

bq. Agree. I have added a new dedicated executor in the CompactionManager, 
similar to the executors used for validation and cache cleanup. The concurrency 
of this executor is determined by the new config property 
concurrent_materialized_view_builders, which defaults to a perhaps too much 
conservative value of 1. This property can be modified through both JMX and the 
new setconcurrentviewbuilders and getconcurrentviewbuilders nodetool commands. 
These commands are tested here.

I think having a dedicated executor will ensure view building doesn't compete 
with compactions for the compaction executor, good job! One problem I see 
though is that if the user is finding its view building slow it will try to 
increase the number of concurrent view builders via nodetool, but it will have 
no effect since the range was split in the previously number of concurrent view 
builders. Given this will be a pretty common scenario for large datasets, how 
about splitting the range in multiple smaller tasks, so that if the user 
increases {{concurrent_view_builders}} the other tasks immediately start 
executing?

We could use a simple approach of splitting the local range in let's say 1000 
hard-coded parts, or be smarter and make each split have ~100MB or so. In this 
way we can keep {{concurrent_materialized_view_builders=1}} by default, and 
users with large base tables are able to increase it and see immediate effect 
via nodetool. WDYT?

bq. I would prefer to do this in another ticket.

Agreed.

bq. I have moved the marking of system tables (and the retries in case of 
failure) from the ViewBuilderTask to the ViewBuilder, using a callback to do 
the marking. I think the code is clearer this way.

Great, looks much cleaner indeed! One minor thing is that if there's a failure 
after some {{ViewBuildTasks}} were completed, it will resume that subtask from 
its last token while it already finished. Could we maybe set the last_token = 
end_token when the task is finished to flag it was already finished and avoid 
resuming the task when that is the case?

bq. Updated here. It also uses a byteman script to make sure that the MV build 
isn't finished before stopping the cluster, which is more likely to happen.

The dtest looks mostly good, except for the following nits:
* {{concurrent_materialized_view_builders=1}} when the nodes are restarted. Can 
you set the configuration value during cluster setup phase (instead of setting 
via nodetool) to make sure the restarted view builds will be parallel?
* can probably use {{self._wait_for_view("ks", "t_by_v")}} 
[here|https://github.com/adelapena/cassandra-dtest/blob/cf893982c361fd7b6018b2570d3a5a33badd5424/materialized_views_test.py#L982]
* We cannot ensure key 10000 was not built 
[here|https://github.com/adelapena/cassandra-dtest/blob/cf893982c361fd7b6018b2570d3a5a33badd5424/materialized_views_test.py#L980]
 which may cause flakiness, so it's probably better to check for 
{{self.assertNotEqual(len(list(session.execute("SELECT count\(*\) FROM 
t_by_v;"))), 10000)}} or something like that.
* It would be nice to check that the view build was actually removed on 
restart, by checking for the log entry {{Resuming view build for range}}

bq. I'm not sure about if it still makes sense for the builder task to extend 
CompactionInfo.Holder. If so, I'm neither sure about how to use 
prevToken.size(range.right) (that returns a double) to create CompationInfo 
objects. WDYT?

I think it's still useful to show view build progress via {{nodetool 
compactionstats}}, so I created a new method {{Splitter.positionInRange(Token 
token, Range<Token> range)}} which gives the position of a token relative to a 
range and used that to show view build progress when a splitter is present. 
When it's not (such as the case of {{ByterOrderedPartitioner}}, we fallback to 
the progress based on the keys estimate. This is implemented on [this 
commit|https://github.com/pauloricardomg/cassandra/commit/bcdbca614e4e049c3b0c965ae0985a2dea2939d1].
 Please let me know what do you think about this approach.

In addition to the suggestion above, I also made the following improvements:
* Select only sstables that fall into the {{ViewBuildTask}} range 
([commit|https://github.com/pauloricardomg/cassandra/commit/3694c63678f6872477531825914a89a072f8f67d])
* Simplify ViewBuilderTask loop 
([commit|https://github.com/pauloricardomg/cassandra/commit/3c4e2ac65b7a3ca050450400392735fb3738ecd7])
* It's pretty rare but in the case of collisions it's possible for multiple 
keys to share the same token, so I updated the {{ViewBuilderTask}} loop to 
build all the keys sharing the same token 
([commit|https://github.com/pauloricardomg/cassandra/commit/4ad0deb1de9d5dbc414c8d83d87256481e307d9e])

I created a [PR|https://github.com/adelapena/cassandra/pull/1] on your branch 
with the above suggestions.

Even though the patch is looking good and has some dtest coverage, I feel that 
we are still missing some unit testing to have confidence this is working as 
desired and catch any subtle regression, given this is critical for correct MV 
functioning. With that said, it would be nice if we could test that 
{{ViewBuilderTask}} is correctly building a specific range and maybe extend 
{{ViewTest.testViewBuilderResume}} to test view building/resume with different 
number of concurrent view builders. What do you think?

> initial view build can be parallel
> ----------------------------------
>
>                 Key: CASSANDRA-12245
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12245
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Materialized Views
>            Reporter: Tom van der Woerdt
>            Assignee: Andrés de la Peña
>             Fix For: 4.x
>
>
> On a node with lots of data (~3TB) building a materialized view takes several 
> weeks, which is not ideal. It's doing this in a single thread.
> There are several potential ways this can be optimized :
>  * do vnodes in parallel, instead of going through the entire range in one 
> thread
>  * just iterate through sstables, not worrying about duplicates, and include 
> the timestamp of the original write in the MV mutation. since this doesn't 
> exclude duplicates it does increase the amount of work and could temporarily 
> surface ghost rows (yikes) but I guess that's why they call it eventual 
> consistency. doing it this way can avoid holding references to all tables on 
> disk, allows parallelization, and removes the need to check other sstables 
> for existing data. this is essentially the 'do a full repair' path



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-12245) initial view build can be parallel

Reply via email to