[
https://issues.apache.org/jira/browse/CASSANDRA-17140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516806#comment-17516806
]
Alex Petrov edited comment on CASSANDRA-17140 at 4/4/22 2:42 PM:
-----------------------------------------------------------------
bq. Unfortunately it is not the same amount of work on both side.
I think it may be less work to write a new test that does more than to keep
fixing this one. I'm not suggesting everyone to do this, I'm saying I think
it's just a more productive thing to do. In other words, next dtest I'll be
fixing, I'll just port right away. I realise my wording did not reflect that
precisely. Sorry for not being specific.
bq. What did I miss?
I've mentioned above that this error should be transient. In other words, even
though there's a chance of asynchronously switching to new hash version between
the nodes because there's no single moment when all nodes agree on the version,
but every node [flips
behaviour|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cql3/QueryProcessor.java#L636]
at its own time. In other words, we switch and star returning a new hash.
First query with old hash (for example, immediately after bounce), will fail
with digest mismatch. However, while failing, it will also [prepare both
hashes|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cql3/QueryProcessor.java#L680-L683],
and subsequent query will "just" succeed. I think we've also explained this
[while working on the
patch|https://github.com/apache/cassandra/commit/242f7f9b18db77bce36c9bba00b2acda4ff3209e#diff-3ce9f6c4626262d6a4be26ecf3a57ef11fdb55537448def0c736115417487115R438].
So I can see two ways out of this:
1. *Avoid mismatches.* Because [this
case|https://github.com/apache/cassandra/commit/242f7f9b18db77bce36c9bba00b2acda4ff3209e#diff-3ce9f6c4626262d6a4be26ecf3a57ef11fdb55537448def0c736115417487115R487]
is the only one where we won't change the hash, I was suggesting removing
`USE` (this would mean that during prepare and after upgrade
{{qualifiedWithKeyspace}} will be null, and we'll return
{{qualifiedWithoutKeyspace}} in both cases) and add keyspace to the executed
statements.
2. *Retry on mismatches.* Simply add a retry that would tolerate failures
during diverged gossip state, and retry failed query.
In other words, I believe there's no bug in the code and we did validate this
behaviour and made sure it's transitory. It's just that at that particular
moment I can not allocate time to fix this dtest, for which I'm sorry and hope
you'll understand. Hope you find this explanation satisfactory.
was (Author: ifesdjeen):
bq. Unfortunately it is not the same amount of work on both side.
I think it may be less work to write a new test that does more than to keep
fixing this one. I'm not suggesting everyone to do this, I'm saying I think
it's just a more productive thing to do. In other words, next dtest I'll be
fixing, I'll just port right away. I realise my wording did not reflect that
precisely. Sorry for not being specific.
bq. What did I miss?
I've mentioned above that this error should be transient. In other words, even
though there's a slight chance of asynchronously switching to new hash version
between the nodes because there's no single moment when all nodes agree on the
version, but every node [flips
behaviour|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cql3/QueryProcessor.java#L636]
at its own time. In other words, we switch and star returning a new hash.
First query with old hash (for example, immediately after bounce), will fail
with digest mismatch. However, while failing, it will also [prepare both
hashes|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cql3/QueryProcessor.java#L680-L683],
and subsequent query will "just" succeed. I think we've also explained this
[while working on the
patch|https://github.com/apache/cassandra/commit/242f7f9b18db77bce36c9bba00b2acda4ff3209e#diff-3ce9f6c4626262d6a4be26ecf3a57ef11fdb55537448def0c736115417487115R438].
So I can see two ways out of this:
1. *Avoid mismatches.* Because [this
case|https://github.com/apache/cassandra/commit/242f7f9b18db77bce36c9bba00b2acda4ff3209e#diff-3ce9f6c4626262d6a4be26ecf3a57ef11fdb55537448def0c736115417487115R487]
is the only one where we won't change the hash, I was suggesting removing
`USE` (this would mean that during prepare and after upgrade
{{qualifiedWithKeyspace}} will be null, and we'll return
{{qualifiedWithoutKeyspace}} in both cases) and add keyspace to the executed
statements.
2. *Retry on mismatches.* Simply add a retry that would tolerate failures
during diverged gossip state, and retry failed query.
In other words, I believe there's no bug in the code and we did validate this
behaviour and made sure it's transitory. It's just that at that particular
moment I can not allocate time to fix this dtest, for which I'm sorry and hope
you'll understand. Hope you find this explanation satisfactory.
> Broken test_rolling_upgrade -
> upgrade_tests.upgrade_through_versions_test.TestUpgrade_indev_3_0_x_To_indev_4_0_x
> ----------------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-17140
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17140
> Project: Cassandra
> Issue Type: Bug
> Components: CI
> Reporter: Yifan Cai
> Assignee: Berenguer Blasi
> Priority: Normal
> Fix For: 4.0.x
>
>
> The tests "test_rolling_upgrade" fail with the below error.
>
> [https://app.circleci.com/pipelines/github/yifan-c/cassandra/279/workflows/6340cd42-0b27-42c2-8418-9f8b56c57bea/jobs/1990]
>
> I am able to alway produce it by running the test locally too.
> {{$ pytest --execute-upgrade-tests-only --upgrade-target-version-only
> --upgrade-version-selection all --cassandra-version=4.0
> upgrade_tests/upgrade_through_versions_test.py::TestUpgrade_indev_3_11_x_To_indev_4_0_x::test_rolling_upgrade}}
>
> {code:java}
> self =
> <upgrade_tests.upgrade_through_versions_test.TestUpgrade_indev_3_0_x_To_indev_4_0_x
> object at 0x7ffba4242fd0>
> subprocs = [<Process(Process-1, stopped[SIGKILL] daemon)>,
> <Process(Process-2, stopped[1] daemon)>]
> def _check_on_subprocs(self, subprocs):
> """
> Check on given subprocesses.
>
> If any are not alive, we'll go ahead and terminate any remaining
> alive subprocesses since this test is going to fail.
> """
> subproc_statuses = [s.is_alive() for s in subprocs]
> if not all(subproc_statuses):
> message = "A subprocess has terminated early. Subprocess
> statuses: "
> for s in subprocs:
> message += "{name} (is_alive: {aliveness}),
> ".format(name=s.name, aliveness=s.is_alive())
> message += "attempting to terminate remaining subprocesses now."
> self._terminate_subprocs()
> > raise RuntimeError(message)
> E RuntimeError: A subprocess has terminated early. Subprocess
> statuses: Process-1 (is_alive: True), Process-2 (is_alive: False), attempting
> to terminate remaining subprocesses now.{code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]