[
https://issues.apache.org/jira/browse/CASSANDRA-12245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126791#comment-16126791
]
Paulo Motta commented on CASSANDRA-12245:
-----------------------------------------
Thanks for the patch and sorry for the delay! Had an initial look at the patch,
overall looks good, I have the following comments/questions/remarks:
bq. The newly created ViewBuilderController reads the local token ranges,
splits them to satisfy a concurrency factor, and runs a ViewBuilder for each of
them.
It would be nice to maybe try to reuse the {{Splitter}} methods if possible, so
we can reuse tests, or if that's not straightforward maybe put the methods on
splitter and add some tests to make sure it's working correctly.
bq. When ViewBuilderController receives the finalization signal of the last
ViewBuilder, it double-checks if there are new local ranges that weren't
considered at the beginning of the build. If there are new ranges, new
{{ViewBuilder}}s are created for them.
This will not work if the range movement which created the new local range
finishes after the view has finished building. This problem exists currently
and is unrelated to the view build process itself, but more related to the
range movement completion which should ensure the views are properly built
before the operation finishes, so I created CASSANDRA-13762 to handle this
properly.
bq. Given that we have a ViewBuilder per local range, the key of the table
system.views_builds_in_progress is modified to include the bounds of the token
range. So, we will have an entry in the table per each ViewBuilder. The number
of covered keys per range is also recorded in the table.
Can probably remove the generation field from the builds in progress table and
[remove this
comment|https://github.com/adelapena/cassandra/blob/94f3d0d02bb5f849e4d54857d0b33531a5650643/src/java/org/apache/cassandra/db/view/ViewBuilder.java#L124]
bq. I have updated the patch to use a new separate table,
system.views_builds_in_progress_v2
{{views_builds_in_progress_v2}} sounds a bit hacky, so perhaps we should call
it {{system.view_builds_in_progress}} (remove the s) and also add a NOTICE
entry informing the previous table was replaced and data files can be removed.
bq. The downside is that pending view builds will be restarted during an
upgrade to 4.x, which seems reasonable to me.
Sounds reasonable to me too.
bq. ViewBuilder and ViewBuilderController are probably not the best names.
Maybe we could rename ViewBuilder to something like ViewBuilderTask or
ViewBuilderForRange, and rename ViewBuilderController to ViewBuilder.
{{ViewBuilder}} and {{ViewBuilderTask}} LGTM
bq. The concurrency factor is based on conf.concurrent_compactors because the
views are built on the CompactionManager, but we may be interested in a
different value.
I'm a bit concerned about starving the compaction executor for a long period
during view build of large base tables, so we should probably have another
option like {{concurret_view_builders}} with a conservative default and perhaps
control the concurrency at the {{ViewBuilderController}}. WDYT?
bq. The patch tries to evenly split the token ranges in the minimum number of
parts to satisfy the concurrency factor, and it never merges ranges. So, with
the default 256 virtual nodes (and a lesser concurrency factor) we create 256
build tasks. We might be interested in a different planning. If we want the
number of tasks to be lesser than the number of local ranges we should modify
the ViewBuilder task to be responsible for several ranges, although it will
complicate the status tracking.
I think this is good to start with, we can improve the planning later if
necessary. I don't think there is much gain from merging ranges to have smaller
tasks.
bq. Probably there is a better way of implementing
ViewBuilder.getCompactionInfo. The patch uses
keysBuilt/ColumnFamilyStore.estimatedKeysForRange to estimate the completion,
which could deal to have task completion status over 100%, depending on the
estimation.
How about using {{prevToken.size(range.right)}} (introduced by CASSANDRA-7032)?
Even though this will not be available for BytesToken (used by
ByteOrderedPartitioner, which is rarely used, so could maybe fallback to the
current imprecise calculation in that case).
Other comments:
* Avoid submitting view builder when view is already built instead of checking
on the ViewBuilder
([here|https://github.com/adelapena/cassandra/blob/94f3d0d02bb5f849e4d54857d0b33531a5650643/src/java/org/apache/cassandra/db/view/ViewBuilder.java#L109])
* ViewBuilder seems to be reimplementing some of the logic of
{{PartitionRangeReadCommand}}, so I wonder if we shoud take this chance to
simplify and use that instead of manually constructing the commands via
ReducingKeyIterator and multiple {{SinglePartitionReadCommands}}? We can
totally do this in other ticket if you prefer.
* Perform view marking on ViewBuilderController [instead of
ViewBuilder|https://github.com/adelapena/cassandra/blob/94f3d0d02bb5f849e4d54857d0b33531a5650643/src/java/org/apache/cassandra/db/view/ViewBuilder.java#L177]
* Updating the view built status [at every
key|https://github.com/adelapena/cassandra/blob/94f3d0d02bb5f849e4d54857d0b33531a5650643/src/java/org/apache/cassandra/db/view/ViewBuilder.java#L163]
is perhaps a bit inefficient and unnecessary, so perhaps we should update it
every 1000 keys or so.
* Would be nice to update the {{interrupt_build_process_test}} to stop halfway
through the build (instead of the start of the build) and verify it it's being
resumed correctly with the new changes.
> initial view build can be parallel
> ----------------------------------
>
> Key: CASSANDRA-12245
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12245
> Project: Cassandra
> Issue Type: Improvement
> Components: Materialized Views
> Reporter: Tom van der Woerdt
> Assignee: Andrés de la Peña
> Fix For: 4.x
>
>
> On a node with lots of data (~3TB) building a materialized view takes several
> weeks, which is not ideal. It's doing this in a single thread.
> There are several potential ways this can be optimized :
> * do vnodes in parallel, instead of going through the entire range in one
> thread
> * just iterate through sstables, not worrying about duplicates, and include
> the timestamp of the original write in the MV mutation. since this doesn't
> exclude duplicates it does increase the amount of work and could temporarily
> surface ghost rows (yikes) but I guess that's why they call it eventual
> consistency. doing it this way can avoid holding references to all tables on
> disk, allows parallelization, and removes the need to check other sstables
> for existing data. this is essentially the 'do a full repair' path
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]