[
https://issues.apache.org/jira/browse/CASSANDRA-11729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sam Tunnicliffe updated CASSANDRA-11729:
----------------------------------------
Attachment: node3_debug.log.gz
node2_debug.log.gz
node1_debug.log.gz
This isn't actually related to indexes, but is highlighting a race condition
which is pretty pervasive. You can see from the stacktrace that the assertion
error is actually being thrown from a lambda defined in
{{CassandraDaemon::setup}}, which only runs when a node is started. From
inspection of the code and logs, what seems to be happening is this:
* At startup node1 creates a task to submit rebuilds of all MVs in all
keyspaces& submits it to the {{OptionalTasks}} executor to run after
{{RING_DELAY}}.
* While this is still pending, all 3 nodes finish startup and proceed with the
test, creating and then dropping the {{ks}} keyspace.
* It so happens that all of the "DROP KEYSPACE" statements hit node3 as the
coordinator. From its log, we can see that the 4th of these executes at
{{00:33:56,585}}, so shortly after that point, it pushes a defs change to node1
and node2.
* Back on node1, the MV building runnable is executed where it calls
{{Keyspace::all}} and begins to iterate the keyspaces, submitting MV builds.
This is where the race occurs. {{Keyspace::all}} provides an iterable of
{{Keyspace}} instance by transforming the key set of
{{Schema.instance.keyspaces}}, using {{Keyspace::open}} as the transformation
function. Concurrently, processing the schema update pushed by node3 follows
the path
{code}
SchemaKeyspace::mergeSchema
-> Schema.instance.dropKeyspace
-> Schema.instance.clearKeyspaceMetadata
-> Schema.instance.keyspaces.remove
{code}
If the removal from {{Schema.instance.keyspaces}} happens after the
transforming iterable has read the keyspace name from the keyset, but before it
attempts to open the {{Keyspace}}, the assertion error is thrown.
This is really a deep rooted problem with schema not being properly safe under
any level of concurrency. {{Keyspace::all}} has many callsites, all of which
are potentially vulnerable to this and fixing that properly should be done as a
subtask of CASSANDRA-9424.
[~iamaleksey] , I don't think that any of the existing subtasks fully capture
this. Do you think it may fit in CASSANDRA-9425, or do you think a new ticket
is called for?
[~philipthompson], is the best thing to do here just to mark the test as flaky
for now?
> dtest failure in
> secondary_indexes_test.TestSecondaryIndexes.test_6924_dropping_ks
> ----------------------------------------------------------------------------------
>
> Key: CASSANDRA-11729
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11729
> Project: Cassandra
> Issue Type: Bug
> Reporter: Russ Hatch
> Assignee: Sam Tunnicliffe
> Labels: dtest
> Fix For: 3.x
>
> Attachments: node1_debug.log.gz, node2_debug.log.gz,
> node3_debug.log.gz
>
>
> looks to be a single flap. might be worth trying to reproduce. example
> failure:
> http://cassci.datastax.com/job/trunk_dtest/1204/testReport/secondary_indexes_test/TestSecondaryIndexes/test_6924_dropping_ks
> Failed on CassCI build trunk_dtest #1204
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)