[jira] [Comment Edited] (CASSANDRA-17140) Broken test_rolling_upgrade - upgrade_tests.upgrade_through_versions_test.TestUpgrade_indev_3_0_x_To_indev_4_0_x

Alex Petrov (Jira) Mon, 04 Apr 2022 07:43:07 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-17140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516806#comment-17516806
 ]


Alex Petrov edited comment on CASSANDRA-17140 at 4/4/22 2:42 PM:
-----------------------------------------------------------------

bq. Unfortunately it is not the same amount of work on both side. 

I think it may be less work to write a new test that does more than to keep 
fixing this one. I'm not suggesting everyone to do this, I'm saying I think 
it's just a more productive thing to do. In other words, next dtest I'll be 
fixing, I'll just port right away. I realise my wording did not reflect that 
precisely. Sorry for not being specific.

bq. What did I miss? 

I've mentioned above that this error should be transient. In other words, even 
though there's a chance of asynchronously switching to new hash version between 
the nodes because there's no single moment when all nodes agree on the version, 
but every node [flips 
behaviour|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cql3/QueryProcessor.java#L636]
 at its own time. In other words, we switch and star returning a new hash. 
First query with old hash (for example, immediately after bounce), will fail 
with digest mismatch. However, while failing, it will also [prepare both 
hashes|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cql3/QueryProcessor.java#L680-L683],
 and subsequent query will "just" succeed. I think we've also explained this 
[while working on the 
patch|https://github.com/apache/cassandra/commit/242f7f9b18db77bce36c9bba00b2acda4ff3209e#diff-3ce9f6c4626262d6a4be26ecf3a57ef11fdb55537448def0c736115417487115R438].

So I can see two ways out of this: 
1. *Avoid mismatches.* Because [this 
case|https://github.com/apache/cassandra/commit/242f7f9b18db77bce36c9bba00b2acda4ff3209e#diff-3ce9f6c4626262d6a4be26ecf3a57ef11fdb55537448def0c736115417487115R487]
 is the only one where we won't change the hash, I was suggesting removing 
`USE` (this would mean that during prepare and after upgrade 
{{qualifiedWithKeyspace}} will be null, and we'll return 
{{qualifiedWithoutKeyspace}} in both cases) and add keyspace to the executed 
statements.
2. *Retry on mismatches.* Simply add a retry that would tolerate failures 
during diverged gossip state, and retry failed query.

In other words, I believe there's no bug in the code and we did validate this 
behaviour and made sure it's transitory. It's just that at that particular 
moment I can not allocate time to fix this dtest, for which I'm sorry and hope 
you'll understand. Hope you find this explanation satisfactory.


was (Author: ifesdjeen):
bq. Unfortunately it is not the same amount of work on both side. 

I think it may be less work to write a new test that does more than to keep 
fixing this one. I'm not suggesting everyone to do this, I'm saying I think 
it's just a more productive thing to do. In other words, next dtest I'll be 
fixing, I'll just port right away. I realise my wording did not reflect that 
precisely. Sorry for not being specific.

bq. What did I miss? 

I've mentioned above that this error should be transient. In other words, even 
though there's a slight chance of asynchronously switching to new hash version 
between the nodes because there's no single moment when all nodes agree on the 
version, but every node [flips 
behaviour|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cql3/QueryProcessor.java#L636]
 at its own time. In other words, we switch and star returning a new hash. 
First query with old hash (for example, immediately after bounce), will fail 
with digest mismatch. However, while failing, it will also [prepare both 
hashes|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cql3/QueryProcessor.java#L680-L683],
 and subsequent query will "just" succeed. I think we've also explained this 
[while working on the 
patch|https://github.com/apache/cassandra/commit/242f7f9b18db77bce36c9bba00b2acda4ff3209e#diff-3ce9f6c4626262d6a4be26ecf3a57ef11fdb55537448def0c736115417487115R438].

So I can see two ways out of this: 
1. *Avoid mismatches.* Because [this 
case|https://github.com/apache/cassandra/commit/242f7f9b18db77bce36c9bba00b2acda4ff3209e#diff-3ce9f6c4626262d6a4be26ecf3a57ef11fdb55537448def0c736115417487115R487]
 is the only one where we won't change the hash, I was suggesting removing 
`USE` (this would mean that during prepare and after upgrade 
{{qualifiedWithKeyspace}} will be null, and we'll return 
{{qualifiedWithoutKeyspace}} in both cases) and add keyspace to the executed 
statements.
2. *Retry on mismatches.* Simply add a retry that would tolerate failures 
during diverged gossip state, and retry failed query.

In other words, I believe there's no bug in the code and we did validate this 
behaviour and made sure it's transitory. It's just that at that particular 
moment I can not allocate time to fix this dtest, for which I'm sorry and hope 
you'll understand. Hope you find this explanation satisfactory.

> Broken test_rolling_upgrade - 
> upgrade_tests.upgrade_through_versions_test.TestUpgrade_indev_3_0_x_To_indev_4_0_x
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-17140
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17140
>             Project: Cassandra
>          Issue Type: Bug
>          Components: CI
>            Reporter: Yifan Cai
>            Assignee: Berenguer Blasi
>            Priority: Normal
>             Fix For: 4.0.x
>
>
> The tests "test_rolling_upgrade" fail with the below error. 
>  
> [https://app.circleci.com/pipelines/github/yifan-c/cassandra/279/workflows/6340cd42-0b27-42c2-8418-9f8b56c57bea/jobs/1990]
>  
> I am able to alway produce it by running the test locally too. 
> {{$ pytest --execute-upgrade-tests-only --upgrade-target-version-only 
> --upgrade-version-selection all --cassandra-version=4.0 
> upgrade_tests/upgrade_through_versions_test.py::TestUpgrade_indev_3_11_x_To_indev_4_0_x::test_rolling_upgrade}}
>  
> {code:java}
> self = 
> <upgrade_tests.upgrade_through_versions_test.TestUpgrade_indev_3_0_x_To_indev_4_0_x
>  object at 0x7ffba4242fd0>
> subprocs = [<Process(Process-1, stopped[SIGKILL] daemon)>, 
> <Process(Process-2, stopped[1] daemon)>]
>     def _check_on_subprocs(self, subprocs):
>         """
>             Check on given subprocesses.
>     
>             If any are not alive, we'll go ahead and terminate any remaining 
> alive subprocesses since this test is going to fail.
>             """
>         subproc_statuses = [s.is_alive() for s in subprocs]
>         if not all(subproc_statuses):
>             message = "A subprocess has terminated early. Subprocess 
> statuses: "
>             for s in subprocs:
>                 message += "{name} (is_alive: {aliveness}), 
> ".format(name=s.name, aliveness=s.is_alive())
>             message += "attempting to terminate remaining subprocesses now."
>             self._terminate_subprocs()
> >           raise RuntimeError(message)
> E           RuntimeError: A subprocess has terminated early. Subprocess 
> statuses: Process-1 (is_alive: True), Process-2 (is_alive: False), attempting 
> to terminate remaining subprocesses now.{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (CASSANDRA-17140) Broken test_rolling_upgrade - upgrade_tests.upgrade_through_versions_test.TestUpgrade_indev_3_0_x_To_indev_4_0_x

Reply via email to