[
https://issues.apache.org/jira/browse/CASSANDRA-12245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171536#comment-16171536
]
Paulo Motta commented on CASSANDRA-12245:
-----------------------------------------
Finally getting to this after a while, sorry for the delay. Thanks for the
update! Had another look at the patch and this is looking much better, see some
follow-up comments below:
bq. I have moved the methods to split the ranges to the Splitter, reusing its
valueForToken method. Tests here.
Awesome, looks much better now! It seems like the way the number of tokens in a
range was computed by {{abs(range.right - range.left)}} may not work correctly
for some wrap-around cases, as shown by [this test
case|https://github.com/pauloricardomg/cassandra/blob/2760bbbc25a2ad9a9bbf9d29a0dc19e1e3bfb237/test/unit/org/apache/cassandra/dht/SplitterTest.java#L184].
Even though this shouldn't break when local ranges are used , I fixed it on
[this
commit|https://github.com/pauloricardomg/cassandra/commit/2760bbbc25a2ad9a9bbf9d29a0dc19e1e3bfb237]
to make sure split works correctly for wrap-around ranges. Can you confirm
this is correct?
Other than that, it seems like you added unit tests only for
{{Murmur3Partitioner}}, would you mind extending {{testSplit()}} to
{{RandomPartitioner}}?
bq. Agree. I have added a new dedicated executor in the CompactionManager,
similar to the executors used for validation and cache cleanup. The concurrency
of this executor is determined by the new config property
concurrent_materialized_view_builders, which defaults to a perhaps too much
conservative value of 1. This property can be modified through both JMX and the
new setconcurrentviewbuilders and getconcurrentviewbuilders nodetool commands.
These commands are tested here.
I think having a dedicated executor will ensure view building doesn't compete
with compactions for the compaction executor, good job! One problem I see
though is that if the user is finding its view building slow it will try to
increase the number of concurrent view builders via nodetool, but it will have
no effect since the range was split in the previously number of concurrent view
builders. Given this will be a pretty common scenario for large datasets, how
about splitting the range in multiple smaller tasks, so that if the user
increases {{concurrent_view_builders}} the other tasks immediately start
executing?
We could use a simple approach of splitting the local range in let's say 1000
hard-coded parts, or be smarter and make each split have ~100MB or so. In this
way we can keep {{concurrent_materialized_view_builders=1}} by default, and
users with large base tables are able to increase it and see immediate effect
via nodetool. WDYT?
bq. I would prefer to do this in another ticket.
Agreed.
bq. I have moved the marking of system tables (and the retries in case of
failure) from the ViewBuilderTask to the ViewBuilder, using a callback to do
the marking. I think the code is clearer this way.
Great, looks much cleaner indeed! One minor thing is that if there's a failure
after some {{ViewBuildTasks}} were completed, it will resume that subtask from
its last token while it already finished. Could we maybe set the last_token =
end_token when the task is finished to flag it was already finished and avoid
resuming the task when that is the case?
bq. Updated here. It also uses a byteman script to make sure that the MV build
isn't finished before stopping the cluster, which is more likely to happen.
The dtest looks mostly good, except for the following nits:
* {{concurrent_materialized_view_builders=1}} when the nodes are restarted. Can
you set the configuration value during cluster setup phase (instead of setting
via nodetool) to make sure the restarted view builds will be parallel?
* can probably use {{self._wait_for_view("ks", "t_by_v")}}
[here|https://github.com/adelapena/cassandra-dtest/blob/cf893982c361fd7b6018b2570d3a5a33badd5424/materialized_views_test.py#L982]
* We cannot ensure key 10000 was not built
[here|https://github.com/adelapena/cassandra-dtest/blob/cf893982c361fd7b6018b2570d3a5a33badd5424/materialized_views_test.py#L980]
which may cause flakiness, so it's probably better to check for
{{self.assertNotEqual(len(list(session.execute("SELECT count\(*\) FROM
t_by_v;"))), 10000)}} or something like that.
* It would be nice to check that the view build was actually removed on
restart, by checking for the log entry {{Resuming view build for range}}
bq. I'm not sure about if it still makes sense for the builder task to extend
CompactionInfo.Holder. If so, I'm neither sure about how to use
prevToken.size(range.right) (that returns a double) to create CompationInfo
objects. WDYT?
I think it's still useful to show view build progress via {{nodetool
compactionstats}}, so I created a new method {{Splitter.positionInRange(Token
token, Range<Token> range)}} which gives the position of a token relative to a
range and used that to show view build progress when a splitter is present.
When it's not (such as the case of {{ByterOrderedPartitioner}}, we fallback to
the progress based on the keys estimate. This is implemented on [this
commit|https://github.com/pauloricardomg/cassandra/commit/bcdbca614e4e049c3b0c965ae0985a2dea2939d1].
Please let me know what do you think about this approach.
In addition to the suggestion above, I also made the following improvements:
* Select only sstables that fall into the {{ViewBuildTask}} range
([commit|https://github.com/pauloricardomg/cassandra/commit/3694c63678f6872477531825914a89a072f8f67d])
* Simplify ViewBuilderTask loop
([commit|https://github.com/pauloricardomg/cassandra/commit/3c4e2ac65b7a3ca050450400392735fb3738ecd7])
* It's pretty rare but in the case of collisions it's possible for multiple
keys to share the same token, so I updated the {{ViewBuilderTask}} loop to
build all the keys sharing the same token
([commit|https://github.com/pauloricardomg/cassandra/commit/4ad0deb1de9d5dbc414c8d83d87256481e307d9e])
I created a [PR|https://github.com/adelapena/cassandra/pull/1] on your branch
with the above suggestions.
Even though the patch is looking good and has some dtest coverage, I feel that
we are still missing some unit testing to have confidence this is working as
desired and catch any subtle regression, given this is critical for correct MV
functioning. With that said, it would be nice if we could test that
{{ViewBuilderTask}} is correctly building a specific range and maybe extend
{{ViewTest.testViewBuilderResume}} to test view building/resume with different
number of concurrent view builders. What do you think?
> initial view build can be parallel
> ----------------------------------
>
> Key: CASSANDRA-12245
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12245
> Project: Cassandra
> Issue Type: Improvement
> Components: Materialized Views
> Reporter: Tom van der Woerdt
> Assignee: Andrés de la Peña
> Fix For: 4.x
>
>
> On a node with lots of data (~3TB) building a materialized view takes several
> weeks, which is not ideal. It's doing this in a single thread.
> There are several potential ways this can be optimized :
> * do vnodes in parallel, instead of going through the entire range in one
> thread
> * just iterate through sstables, not worrying about duplicates, and include
> the timestamp of the original write in the MV mutation. since this doesn't
> exclude duplicates it does increase the amount of work and could temporarily
> surface ghost rows (yikes) but I guess that's why they call it eventual
> consistency. doing it this way can avoid holding references to all tables on
> disk, allows parallelization, and removes the need to check other sstables
> for existing data. this is essentially the 'do a full repair' path
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]