from:"Kevin Gallardo \(Jira\)"

[jira] [Commented] (CASSANDRA-15712) Introduce MIDRES config in CircleCI

2020-04-10 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080555#comment-17080555
 ] 

Kevin Gallardo commented on CASSANDRA-15712:


Note that it would be better to solve 
https://issues.apache.org/jira/browse/CASSANDRA-15686 before this issue

> Introduce MIDRES config in CircleCI
> ---
>
> Key: CASSANDRA-15712
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15712
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Test/dtest, Test/unit
>Reporter: Kevin Gallardo
>Priority: Normal
>
> From document: 
> [https://gist.github.com/newkek/bb79dccbe7d2f5e41b2a3daac3858fde]
> We have identified that the current HIGHRES configuration seems to require 
> resources that might not bring the best cost efficiency to the build.
> We have also identified several "good compromise" configurations that allow 
> to get decent performance out of the test suites, without going all out on 
> the big config.
> It seems it would be useful for a lot of people to have this available as a 
> {{config.yml.MIDRES}} configuration in the {{.circleci}} folder. This way we 
> do not need to argue on modifying the {{HIGHRES}} configuration so as to not 
> impact the people already using it , but can still have easy access the 
> "compromise" config.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15712) Introduce MIDRES config in CircleCI

2020-04-10 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080556#comment-17080556
 ] 

Kevin Gallardo commented on CASSANDRA-15712:


[~dcapwell] FYI, if you have any feedback

> Introduce MIDRES config in CircleCI
> ---
>
> Key: CASSANDRA-15712
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15712
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Test/dtest, Test/unit
>Reporter: Kevin Gallardo
>Priority: Normal
>
> From document: 
> [https://gist.github.com/newkek/bb79dccbe7d2f5e41b2a3daac3858fde]
> We have identified that the current HIGHRES configuration seems to require 
> resources that might not bring the best cost efficiency to the build.
> We have also identified several "good compromise" configurations that allow 
> to get decent performance out of the test suites, without going all out on 
> the big config.
> It seems it would be useful for a lot of people to have this available as a 
> {{config.yml.MIDRES}} configuration in the {{.circleci}} folder. This way we 
> do not need to argue on modifying the {{HIGHRES}} configuration so as to not 
> impact the people already using it, but can still have easy access the 
> "compromise" config.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-15712) Introduce MIDRES config in CircleCI

2020-04-10 Thread Kevin Gallardo (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Gallardo updated CASSANDRA-15712:
---
Description: 
>From document: 
>[https://gist.github.com/newkek/bb79dccbe7d2f5e41b2a3daac3858fde]

We have identified that the current HIGHRES configuration seems to require 
resources that might not bring the best cost efficiency to the build.

We have also identified several "good compromise" configurations that allow to 
get decent performance out of the test suites, without going all out on the big 
config.

It seems it would be useful for a lot of people to have this available as a 
{{config.yml.MIDRES}} configuration in the {{.circleci}} folder. This way we do 
not need to argue on modifying the {{HIGHRES}} configuration so as to not 
impact the people already using it, but can still have easy access the 
"compromise" config.

  was:
>From document: 
>[https://gist.github.com/newkek/bb79dccbe7d2f5e41b2a3daac3858fde]

We have identified that the current HIGHRES configuration seems to require 
resources that might not bring the best cost efficiency to the build.

We have also identified several "good compromise" configurations that allow to 
get decent performance out of the test suites, without going all out on the big 
config.

It seems it would be useful for a lot of people to have this available as a 
{{config.yml.MIDRES}} configuration in the {{.circleci}} folder. This way we do 
not need to argue on modifying the {{HIGHRES}} configuration so as to not 
impact the people already using it , but can still have easy access the 
"compromise" config.


> Introduce MIDRES config in CircleCI
> ---
>
> Key: CASSANDRA-15712
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15712
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Test/dtest, Test/unit
>Reporter: Kevin Gallardo
>Priority: Normal
>
> From document: 
> [https://gist.github.com/newkek/bb79dccbe7d2f5e41b2a3daac3858fde]
> We have identified that the current HIGHRES configuration seems to require 
> resources that might not bring the best cost efficiency to the build.
> We have also identified several "good compromise" configurations that allow 
> to get decent performance out of the test suites, without going all out on 
> the big config.
> It seems it would be useful for a lot of people to have this available as a 
> {{config.yml.MIDRES}} configuration in the {{.circleci}} folder. This way we 
> do not need to argue on modifying the {{HIGHRES}} configuration so as to not 
> impact the people already using it, but can still have easy access the 
> "compromise" config.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-15712) Introduce MIDRES config in CircleCI

2020-04-10 Thread Kevin Gallardo (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Gallardo updated CASSANDRA-15712:
---
  Workflow: Cassandra Default Workflow  (was: Cassandra Bug Workflow)
Issue Type: Improvement  (was: Bug)

> Introduce MIDRES config in CircleCI
> ---
>
> Key: CASSANDRA-15712
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15712
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Test/dtest, Test/unit
>Reporter: Kevin Gallardo
>Priority: Normal
>
> From document: 
> [https://gist.github.com/newkek/bb79dccbe7d2f5e41b2a3daac3858fde]
> We have identified that the current HIGHRES configuration seems to require 
> resources that might not bring the best cost efficiency to the build.
> We have also identified several "good compromise" configurations that allow 
> to get decent performance out of the test suites, without going all out on 
> the big config.
> It seems it would be useful for a lot of people to have this available as a 
> {{config.yml.MIDRES}} configuration in the {{.circleci}} folder. This way we 
> do not need to argue on modifying the {{HIGHRES}} configuration so as to not 
> impact the people already using it , but can still have easy access the 
> "compromise" config.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Created] (CASSANDRA-15712) Introduce MIDRES config in CircleCI

2020-04-10 Thread Kevin Gallardo (Jira)

Kevin Gallardo created CASSANDRA-15712:
--

 Summary: Introduce MIDRES config in CircleCI
 Key: CASSANDRA-15712
 URL: https://issues.apache.org/jira/browse/CASSANDRA-15712
 Project: Cassandra
  Issue Type: Bug
  Components: Test/dtest, Test/unit
Reporter: Kevin Gallardo


>From document: 
>[https://gist.github.com/newkek/bb79dccbe7d2f5e41b2a3daac3858fde]

We have identified that the current HIGHRES configuration seems to require 
resources that might not bring the best cost efficiency to the build.

We have also identified several "good compromise" configurations that allow to 
get decent performance out of the test suites, without going all out on the big 
config.

It seems it would be useful for a lot of people to have this available as a 
{{config.yml.MIDRES}} configuration in the {{.circleci}} folder. This way we do 
not need to argue on modifying the {{HIGHRES}} configuration so as to not 
impact the people already using it , but can still have easy access the 
"compromise" config.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15686) Improvements in circle CI default config

2020-04-10 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080546#comment-17080546
 ] 

Kevin Gallardo commented on CASSANDRA-15686:


{quote}and saw more failures than normal (normally ~1-3 failures, had 9; most 
known flaky tests)
{quote}
Hm that's a bummer, I think I saw similar failures when testing out too, but 
did not investigate more and thought they were just flaky. Do you think these 
could be fixed, or put in another bucket to run in another non-parallel suite?

For reference, this is the tests I had run a little while ago (about 2 weeks) 
for the *unit* tests only, on parallelizing the tests on current LOWRES config:
 - [Run of p=4 and i=medium with no change to the current build (which means it 
uses 1 runner by 
default)|https://app.circleci.com/pipelines/github/newkek/cassandra/4/workflows/16897387-cbf0-49e8-82bc-327d75885256]
 - [Same hardware, but hardcoding 
{{-Dtest.runners=2}}|https://app.circleci.com/pipelines/github/newkek/cassandra/5/workflows/56b341b5-8f74-414e-852e-52cba8acd28a]

You can see the runtime goes from 19min to 12.

I am looking at old experiments I ran, and it seems I was able to run with 8 
runners on unit tests, and all that was reported seem like failures from flaky 
tests: 
[run|https://app.circleci.com/pipelines/github/newkek/cassandra/22/workflows/68bf0582-2b43-4b71-a610-e1e76c6ff418/jobs/90]
 and [corresponding 
config|https://github.com/newkek/cassandra/blob/00103c51d3c6b1fd0c534b649e146bbe45bcdf11/.circleci/config.yml#L975]

> Improvements in circle CI default config
> 
>
> Key: CASSANDRA-15686
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15686
> Project: Cassandra
>  Issue Type: Bug
>  Components: Build
>Reporter: Kevin Gallardo
>Priority: Normal
>
> I have been looking at and played around with the [default CircleCI 
> config|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml], 
> a few comments/questions regarding the following topics:
>  * Python dtests do not run successfully (200-300 failures) on {{medium}} 
> instances, they seem to only run with small flaky failures on {{large}} 
> instances or higher
>  * Python Upgrade tests:
>  ** Do not seem to run without many failures on any instance types / any 
> parallelism setting
>  ** Do not seem to parallelize well, it seems each container is going to 
> download multiple C* versions
>  ** Additionally it seems the configuration is not up to date, as currently 
> we get errors because {{JAVA8_HOME}} is not set
>  * Unit tests do not seem to parallelize optimally, number of test runners do 
> not reflect the available CPUs on the container. Ideally if # of runners == # 
> of CPUs, build time is improved, on any type of instances.
>  ** For instance when using the current configuration, running on medium 
> instances, build will use 1 junit test runner, but 2 CPUs are available. If 
> using 2 runners, the build time is reduced from 19min (at the current main 
> config of parallelism=4) to 12min.
>  * There are some typos in the file, some dtests say "Run Unit Tests" but 
> they are JVM dtests (see 
> [here|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml#L1077],
>  
> [here|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml#L1386])
> So some ways to process these would be:
>  * Do the Python dtests run successfully for anyone on {{medium}} instances? 
> If not, would it make sense to bump them to {{large}} so that they can be run 
> successfully?
>  * Does anybody ever run the python upgrade tests on CircleCI and what is the 
> configuration that makes it work?
>  * Would it make sense to either hardcode the number of test runners in the 
> unit tests with `-Dtest.runners` in the config file to reflect the number of 
> CPUs on the instances, or change the build so that it is able to detect the 
> appropriate number of core available automatically?
> Additionally, it seems this default config file (config.yml) is not as well 
> maintained as the 
> [{{config-2_1.yml}}|https://github.com/apache/cassandra/blob/trunk/.circleci/config-2_1.yml]
>  (+its lowres/highres) version in the same folder (from CASSANDRA-14806). 
> What is the reasoning for maintaining these 2 versions of the build? Could 
> the better maintained version be used as the default? We could generate a 
> lowres version of the new config-2_1.yml, and rename it {{config.yml}} so 
> that it gets picked up by CircleCI automatically instead of the current 
> default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15642) Inconsistent failure messages on distributed queries

2020-04-08 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078645#comment-17078645
 ] 

Kevin Gallardo commented on CASSANDRA-15642:


I see, the differentiation "waiting until we can guarantee we can never 
succeed", I will say it makes more sense presented this way.

Although when testing for CASSANDRA-15543, {{blockFor()}} and 
{{cassandraReplicaCount()}} were both the same, which means we would still fail 
as soon as n >= 1, bringing it back to the same conclusion.

bq. if you want to file a ticket for it I think I can make us both happy: we 
should always fail a query as soon as we know it cannot succeed [...]

I think we still miss out on potential information that could be returned to 
the user and improve usability, but I have presented my arguments already, so I 
won't keep insisting. The case of the schema agreement error is still to me a 
clear situation where things could be improved. But if anything I would hope 
there was be a place where this sort of behavior was documented and explained, 
rather than users having to discover it by themselves in unfortunate 
circumstances, or having to go through the code.

Also I agree it seems to me like a good idea from my POV for the "speculative 
read" (or put more simply a "retry" iiuc?). It would be an improvement, though 
I'm thinking the drivers already provides this sort of utility that are well 
customizable by the users, compared to a server-side solution so I suppose it 
has upsides and downsides. But something that makes completing a request more 
robust seems like a good idea regardless.

> Inconsistent failure messages on distributed queries
> 
>
> Key: CASSANDRA-15642
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15642
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Consistency/Coordination
>Reporter: Kevin Gallardo
>Priority: Normal
>
> As a follow up to some exploration I have done for CASSANDRA-15543, I 
> realized the following behavior in both {{ReadCallback}} and 
> {{AbstractWriteHandler}}:
>  - await for responses
>  - when all required number of responses have come back: unblock the wait
>  - when a single failure happens: unblock the wait
>  - when unblocked, look to see if the counter of failures is > 1 and if so 
> return an error message based on the {{failures}} map that's been filled
> Error messages that can result from this behavior can be a ReadTimeout, a 
> ReadFailure, a WriteTimeout or a WriteFailure.
> In case of a Write/ReadFailure, the user will get back an error looking like 
> the following:
> "Failure: Received X responses, and Y failures"
> (if this behavior I describe is incorrect, please correct me)
> This causes a usability problem. Since the handler will fail and throw an 
> exception as soon as 1 failure happens, the error message that is returned to 
> the user may not be accurate.
> (note: I am not entirely sure of the behavior in case of timeouts for now)
> For example, say a request at CL = QUORUM = 3, a failed request may complete 
> first, then a successful one completes, and another fails. If the exception 
> is thrown fast enough, the error message could say 
>  "Failure: Received 0 response, and 1 failure at CL = 3"
> Which:
> 1. doesn't make a lot of sense because the CL doesn't match the number of 
> results in the message, so you end up thinking "what happened with the rest 
> of the required CL?"
> 2. the information is incorrect. We did receive a successful response, only 
> it came after the initial failure.
> From that logic, I think it is safe to assume that the information returned 
> in the error message cannot be trusted in case of a failure. Only information 
> users should extract out of it is that at least 1 node has failed.
> For a big improvement in usability, the {{ReadCallback}} and 
> {{AbstractWriteResponseHandler}} could instead wait for all responses to come 
> back before unblocking the wait, or let it timeout. This is way, the users 
> will be able to have some trust around the information returned to them.
> Additionally, an error that happens first prevents a timeout to happen 
> because it fails immediately, and so potentially it hides problems with other 
> replicas. If we were to wait for all responses, we might get a timeout, in 
> that case we'd also be able to tell wether failures have happened *before* 
> that timeout, and have a more complete diagnostic where you can't detect both 
> errors at the same time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15686) Improvements in circle CI default config

2020-04-08 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078372#comment-17078372
 ] 

Kevin Gallardo commented on CASSANDRA-15686:


bq. Unit tests should be able to complete on Circle CI with medium instances. 
The unit tests are run automatically, so that's the default.

JVM dtests also run by default FWIW, they are able to complete on medium 
instances, as opposed to the Python dtests.

bq. Pretty much all others won't due to limited resources. You shouldn't try 
running them, but if you do, then we can't guarantee they may complete unless 
high resource settings are used.

Right, that was my understanding initially. Since those tests are available to 
run on the lowres config though, and there is no guidance or document saying 
they are not reliable on low config, and people get confused as to why the 
Python dtests for example run with 200-300 failures when using the default 
config. I was suggesting here to either bump up the default config, so that 
they can be run successfully, or remove them from the config since they can't 
be run (and only keep them in the HIGHRES file?), or maybe document this 
better. I understood David recommended fixing the tests to make them run on 
medium instances too, that sounds reasonable too IMO if it makes things less 
confusing in the long run.

Thanks for the input

> Improvements in circle CI default config
> 
>
> Key: CASSANDRA-15686
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15686
> Project: Cassandra
>  Issue Type: Bug
>  Components: Build
>Reporter: Kevin Gallardo
>Priority: Normal
>
> I have been looking at and played around with the [default CircleCI 
> config|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml], 
> a few comments/questions regarding the following topics:
>  * Python dtests do not run successfully (200-300 failures) on {{medium}} 
> instances, they seem to only run with small flaky failures on {{large}} 
> instances or higher
>  * Python Upgrade tests:
>  ** Do not seem to run without many failures on any instance types / any 
> parallelism setting
>  ** Do not seem to parallelize well, it seems each container is going to 
> download multiple C* versions
>  ** Additionally it seems the configuration is not up to date, as currently 
> we get errors because {{JAVA8_HOME}} is not set
>  * Unit tests do not seem to parallelize optimally, number of test runners do 
> not reflect the available CPUs on the container. Ideally if # of runners == # 
> of CPUs, build time is improved, on any type of instances.
>  ** For instance when using the current configuration, running on medium 
> instances, build will use 1 junit test runner, but 2 CPUs are available. If 
> using 2 runners, the build time is reduced from 19min (at the current main 
> config of parallelism=4) to 12min.
>  * There are some typos in the file, some dtests say "Run Unit Tests" but 
> they are JVM dtests (see 
> [here|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml#L1077],
>  
> [here|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml#L1386])
> So some ways to process these would be:
>  * Do the Python dtests run successfully for anyone on {{medium}} instances? 
> If not, would it make sense to bump them to {{large}} so that they can be run 
> successfully?
>  * Does anybody ever run the python upgrade tests on CircleCI and what is the 
> configuration that makes it work?
>  * Would it make sense to either hardcode the number of test runners in the 
> unit tests with `-Dtest.runners` in the config file to reflect the number of 
> CPUs on the instances, or change the build so that it is able to detect the 
> appropriate number of core available automatically?
> Additionally, it seems this default config file (config.yml) is not as well 
> maintained as the 
> [{{config-2_1.yml}}|https://github.com/apache/cassandra/blob/trunk/.circleci/config-2_1.yml]
>  (+its lowres/highres) version in the same folder (from CASSANDRA-14806). 
> What is the reasoning for maintaining these 2 versions of the build? Could 
> the better maintained version be used as the default? We could generate a 
> lowres version of the new config-2_1.yml, and rename it {{config.yml}} so 
> that it gets picked up by CircleCI automatically instead of the current 
> default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15642) Inconsistent failure messages on distributed queries

2020-04-07 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17077652#comment-17077652
 ] 

Kevin Gallardo commented on CASSANDRA-15642:


Hm maybe I am not clear on the proposition

Currently the code waits for {{blockFor}} responses, unless there's a failure. 
The proposed behavior in case of successes doesn't change.
You would wait for the CL required, if there's only successes, then it's a 
success. If there's failures and successes, it's a failure, like now, except it 
would provide the complete list of who failed and who succeeded, right now the 
server doesn't wait to know which fails and which succeeds, it fails-fast in 
case of at least 1 failure and if the failure happens first, the rest is 
ignored.

bq. Is that an argument, or a belief?

Well, it was both probably, now that you point it out 

bq. I'm personally not sure how I would utilise this extra information. [..]

Hm, how the user processes that information, is up to the user. There's plenty 
of things you can do in case of failures and timeouts, as you mention, but I 
think it is outside the scope of this conversation though. The point is that 
they would get more information with the proposed changes, so they can 
investigate having more information.

Indeed one downside is delayed answer, in one scenario: when there's [failures 
+ 1 or more node doesn't respond]. I can't really tell how frequent this would 
happen, or how impactful it is to delay the response by some seconds in that 
very specific scenario, but it seems to me that the benefits of providing more 
information outweighs that special case's downside 路‍♂️

I agree it's a change in behavior, introducing that as optional makes sense too 
then if too impactful

> Inconsistent failure messages on distributed queries
> 
>
> Key: CASSANDRA-15642
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15642
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Consistency/Coordination
>Reporter: Kevin Gallardo
>Priority: Normal
>
> As a follow up to some exploration I have done for CASSANDRA-15543, I 
> realized the following behavior in both {{ReadCallback}} and 
> {{AbstractWriteHandler}}:
>  - await for responses
>  - when all required number of responses have come back: unblock the wait
>  - when a single failure happens: unblock the wait
>  - when unblocked, look to see if the counter of failures is > 1 and if so 
> return an error message based on the {{failures}} map that's been filled
> Error messages that can result from this behavior can be a ReadTimeout, a 
> ReadFailure, a WriteTimeout or a WriteFailure.
> In case of a Write/ReadFailure, the user will get back an error looking like 
> the following:
> "Failure: Received X responses, and Y failures"
> (if this behavior I describe is incorrect, please correct me)
> This causes a usability problem. Since the handler will fail and throw an 
> exception as soon as 1 failure happens, the error message that is returned to 
> the user may not be accurate.
> (note: I am not entirely sure of the behavior in case of timeouts for now)
> For example, say a request at CL = QUORUM = 3, a failed request may complete 
> first, then a successful one completes, and another fails. If the exception 
> is thrown fast enough, the error message could say 
>  "Failure: Received 0 response, and 1 failure at CL = 3"
> Which:
> 1. doesn't make a lot of sense because the CL doesn't match the number of 
> results in the message, so you end up thinking "what happened with the rest 
> of the required CL?"
> 2. the information is incorrect. We did receive a successful response, only 
> it came after the initial failure.
> From that logic, I think it is safe to assume that the information returned 
> in the error message cannot be trusted in case of a failure. Only information 
> users should extract out of it is that at least 1 node has failed.
> For a big improvement in usability, the {{ReadCallback}} and 
> {{AbstractWriteResponseHandler}} could instead wait for all responses to come 
> back before unblocking the wait, or let it timeout. This is way, the users 
> will be able to have some trust around the information returned to them.
> Additionally, an error that happens first prevents a timeout to happen 
> because it fails immediately, and so potentially it hides problems with other 
> replicas. If we were to wait for all responses, we might get a timeout, in 
> that case we'd also be able to tell wether failures have happened *before* 
> that timeout, and have a more complete diagnostic where you can't detect both 
> errors at the same time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe,

[jira] [Comment Edited] (CASSANDRA-15642) Inconsistent failure messages on distributed queries

2020-04-07 Thread Kevin Gallardo (Jira)

[
https://issues.apache.org/jira/browse/CASSANDRA-15642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17077518#comment-17077518
]

Kevin Gallardo edited comment on CASSANDRA-15642 at 4/7/20, 8:52 PM:
-

{quote}So what do you propose? Some questions to consider
{quote}
I propose waiting for {{blockFor}} (with a {{CountDownLatch}} for example in
{{ReadCallback}}, as detailed on CASSANDRA-15543), until timeout, if it
happens. And in case of a timeout, provide a cohesive message saying: "_These_
are the hosts that responded successfully, _these_ that failed, _these_ haven't
responded before timeout.". If all required responses have comeback, if
failures then detail who failed and who succeeded. Which provides better
usability than the current incomplete error messages.
{quote}How long until you're sure you've received all the responses you might
ever receive?
{quote}
The trap! Theoretically, you can't know, it's an asynchronous distributed
system. I am well aware of things like the FLP impossibility. In more
seriousness what you can say though is "we haven't received anything from X
within the given timeframe, this might indicate an issue". Now, this is my
personal POV, but I think it's valuable to also provide that information for
Ops to have a look into the nodes that haven't responded in time. Keeping in
mind that this is a difference in behavior only in the case of: [some
sub-requests fail AND some don't respond at all].

A good example of usability improvement I can mention is the case what I've
seen on CASSANDRA-15543. The test author initially assumed that for the read
with the schema disagreement they would get in the error message "hosts x and y
don't have the correct schema, but host z does". But you can't expect that from
the error atm, all you can be sure to get is "x, or y don't have the required
schema". Usability would be improved vastly if you were able to give the
complete view of the problem. ISTM that in this scenario, this information can
in most cases be returned without timeout (cannot strongly guarantee that,
because again, distributed async system, but usability would be improved for a
larger amount of situations)
{quote}How would you balance the delayed responses with user requirements to
take corrective action promptly in response to failures?
{quote}
I would argue that the user wouldn't be able to act optimally anyway with only
partial information. In the case of "failures + timeouts" then there would be
more value in them knowing that there is also timeouts with some nodes, instead
of just the failure.

Thanks for the discussion

was (Author: newkek):
{quote}So what do you propose? Some questions to consider
{quote}
I propose waiting for {{blockFor}} (with a {{CountDownLatch}} for example in
{{ReadCallback}}, as detailed on CASSANDRA-15543), until timeout, if it
happens. And in case of a timeout, provide a cohesive message saying: "_These_
are the hosts that responded successfully, _these_ that failed, _these_ haven't
responded before timeout.". If all required responses have comeback, if
failures then detail who failed and who succeeded. Which provides better
usability than the current incomplete error messages.
{quote}How long until you're sure you've received all the responses you might
ever receive?
{quote}
The trap! Theoretically, you can't know, it's an asynchronous distributed
system. I am well aware of things like the FLP impossibility. In more
seriousness what you can say though is "we haven't received anything from X
within the given timeframe, this might indicate an issue". Now, this is my
personal POV, but I think it's valuable to also provide that information for
Ops to have a look into the nodes that haven't responded in time. Keeping in
mind that this is a difference in behavior only in the case of: [some
sub-requests fail AND some don't respond at all].

A good example of usability improvement I can mention is the case what I've
seen on CASSANDRA-15543. The test author initially assumed that for the read
with the schema disagreement they would get in the error message "hosts x and y
don't have the correct schema, but host z does". But you can't expect that from
the error atm, all you can be sure to get is "x, or y don't have the required
schema". Usability would be improved vastly if you were able to give the
complete view of the problem. ISTM that in this scenario, this information can
in most cases be returned without timeout. Cannot be strongly guaranteed,
because again, distributed async system, but usability would be improved for a
larger amount of situations.
{quote}How would you balance the delayed responses with user requirements to
take corrective action promptly in response to failures?
{quote}
I would argue that the user wouldn't be able to act optimally anyway with only
partial information. In the case

[jira] [Comment Edited] (CASSANDRA-15642) Inconsistent failure messages on distributed queries

2020-04-07 Thread Kevin Gallardo (Jira)

[
https://issues.apache.org/jira/browse/CASSANDRA-15642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17077518#comment-17077518
]

Kevin Gallardo edited comment on CASSANDRA-15642 at 4/7/20, 8:37 PM:
-

A good example of usability improvement I can mention is the case what I've
seen on CASSANDRA-15543. The test author initially assumed that for the read
with the schema disagreement they would get in the error message "hosts x and y
don't have the correct schema, but host z does". But you can't expect that from
the error atm, all you can be sure to get is "x, or y don't have the required
schema". Usability would be improved vastly if you were able to give the
complete view of the problem. ISTM that in this scenario, this information can
in most cases be returned without timeout. Cannot be strongly guaranteed,
because again, distributed async system, but usability would be improved for a
larger amount of situations.
{quote}How would you balance the delayed responses with user requirements to
take corrective action promptly in response to failures?
{quote}
I would argue that the user wouldn't be able to act optimally anyway with only
partial information. In the case of "failures + timeouts" then there would be
more value in them knowing that there is also timeouts with some nodes, instead
of just the failure.

Thanks for the discussion

was (Author: newkek):
{quote}So what do you propose? Some questions to consider
{quote}
I propose waiting for {{blockFor}} (with a {{CountDownLatch}} for example in
{{ReadCallback}}, as detailed on CASSANDRA-15543), until timeout, if it
happens. And in case of a timeout, provide a cohesive message saying: "_These_
are the hosts that responded successfully, _these_ that failed, _these_ haven't
responded before timeout." If not timeout, when all required responses have
comeback, if failures then explain who failed and who succeeded. Which provides
better usability than the current incomplete error messages.
{quote}How long until you're sure you've received all the responses you might
ever receive?
{quote}
The trap! Theoretically, you can't know, it's an asynchronous distributed
system. I am well aware of things like the FLP impossibility. In more
seriousness what you can say though is "we haven't received anything from X
within the given timeframe, this might indicate an issue". Now, this is my
personal POV, but I think it's valuable to also provide that information for
Ops to have a look into the nodes that haven't responded in time. Keeping in
mind that this is a difference in behavior only in the case of: [some
sub-requests fail AND some don't respond at all].

A good example of usability improvement I can mention is the case what I've
seen on CASSANDRA-15543. The test author initially assumed that for the read
with the schema disagreement they would get in the error message "hosts x and y
don't have the correct schema, but host z does". But you can't expect that from
the error atm, all you can be sure to get is "x, or y don't have the required
schema". Usability would be improved vastly if you were able to give the
complete view of the problem. ISTM that in this scenario, this information can
in most cases be returned without timeout. Cannot be strongly guaranteed,
because again, distributed async system, but usability would be improved for a
larger amount of situations.
{quote}How would you balance the delayed responses with user requirements to
take corrective action promptly in response to failures?
{quote}
I would argue that the user wouldn't be able to act optimally anyway with only
partial

[jira] [Commented] (CASSANDRA-15686) Improvements in circle CI default config

2020-04-07 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17077529#comment-17077529
 ] 

Kevin Gallardo commented on CASSANDRA-15686:


bq. I was told that if you have a build that requests > medium on a free tier, 
CircleCI would downgrade to medium automatically

So that was incorrect. Instead you get an error saying: _This job has been 
blocked because resource-class:xlarge is not available on your plan. Please 
upgrade to continue building._

So it doesn't build, but as mentioned earlier, the dtests don't build 
successfully either on medium, so I would think it wouldn't make much 
difference for free tier users to bump to large or more for the dtests, and it 
would allow people with {{large}} instances access to build without having to 
switch their config. Does that make sense?



> Improvements in circle CI default config
> 
>
> Key: CASSANDRA-15686
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15686
> Project: Cassandra
>  Issue Type: Bug
>  Components: Build
>Reporter: Kevin Gallardo
>Priority: Normal
>
> I have been looking at and played around with the [default CircleCI 
> config|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml], 
> a few comments/questions regarding the following topics:
>  * Python dtests do not run successfully (200-300 failures) on {{medium}} 
> instances, they seem to only run with small flaky failures on {{large}} 
> instances or higher
>  * Python Upgrade tests:
>  ** Do not seem to run without many failures on any instance types / any 
> parallelism setting
>  ** Do not seem to parallelize well, it seems each container is going to 
> download multiple C* versions
>  ** Additionally it seems the configuration is not up to date, as currently 
> we get errors because {{JAVA8_HOME}} is not set
>  * Unit tests do not seem to parallelize optimally, number of test runners do 
> not reflect the available CPUs on the container. Ideally if # of runners == # 
> of CPUs, build time is improved, on any type of instances.
>  ** For instance when using the current configuration, running on medium 
> instances, build will use 1 junit test runner, but 2 CPUs are available. If 
> using 2 runners, the build time is reduced from 19min (at the current main 
> config of parallelism=4) to 12min.
>  * There are some typos in the file, some dtests say "Run Unit Tests" but 
> they are JVM dtests (see 
> [here|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml#L1077],
>  
> [here|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml#L1386])
> So some ways to process these would be:
>  * Do the Python dtests run successfully for anyone on {{medium}} instances? 
> If not, would it make sense to bump them to {{large}} so that they can be run 
> successfully?
>  * Does anybody ever run the python upgrade tests on CircleCI and what is the 
> configuration that makes it work?
>  * Would it make sense to either hardcode the number of test runners in the 
> unit tests with `-Dtest.runners` in the config file to reflect the number of 
> CPUs on the instances, or change the build so that it is able to detect the 
> appropriate number of core available automatically?
> Additionally, it seems this default config file (config.yml) is not as well 
> maintained as the 
> [{{config-2_1.yml}}|https://github.com/apache/cassandra/blob/trunk/.circleci/config-2_1.yml]
>  (+its lowres/highres) version in the same folder (from CASSANDRA-14806). 
> What is the reasoning for maintaining these 2 versions of the build? Could 
> the better maintained version be used as the default? We could generate a 
> lowres version of the new config-2_1.yml, and rename it {{config.yml}} so 
> that it gets picked up by CircleCI automatically instead of the current 
> default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15642) Inconsistent failure messages on distributed queries

2020-04-07 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17077518#comment-17077518
 ] 

Kevin Gallardo commented on CASSANDRA-15642:


{quote}So what do you propose? Some questions to consider
{quote}
I propose waiting for {{blockFor}} (with a {{CountDownLatch}} for example in 
{{ReadCallback}}, as detailed on CASSANDRA-15543), until timeout, if it 
happens. And in case of a timeout, provide a cohesive message saying: "_These_ 
are the hosts that responded successfully, _these_ that failed, _these_ haven't 
responded before timeout." If not timeout, when all required responses have 
comeback, if failures then explain who failed and who succeeded. Which provides 
better usability than the current incomplete error messages.
{quote}How long until you're sure you've received all the responses you might 
ever receive?
{quote}
The trap! Theoretically, you can't know, it's an asynchronous distributed 
system. I am well aware of things like the FLP impossibility. In more 
seriousness what you can say though is "we haven't received anything from X 
within the given timeframe, this might indicate an issue". Now, this is my 
personal POV, but I think it's valuable to also provide that information for 
Ops to have a look into the nodes that haven't responded in time. Keeping in 
mind that this is a difference in behavior only in the case of: [some 
sub-requests fail AND some don't respond at all].

A good example of usability improvement I can mention is the case what I've 
seen on CASSANDRA-15543. The test author initially assumed that for the read 
with the schema disagreement they would get in the error message "hosts x and y 
don't have the correct schema, but host z does". But you can't expect that from 
the error atm, all you can be sure to get is "x, or y don't have the required 
schema". Usability would be improved vastly if you were able to give the 
complete view of the problem. ISTM that in this scenario, this information can 
in most cases be returned without timeout. Cannot be strongly guaranteed, 
because again, distributed async system, but usability would be improved for a 
larger amount of situations.
{quote}How would you balance the delayed responses with user requirements to 
take corrective action promptly in response to failures?
{quote}
I would argue that the user wouldn't be able to act optimally anyway with only 
partial information. In the case of "failures + timeouts" then there would be 
more value in them knowing that there is also timeouts with some nodes, instead 
of just the failure.

Thanks for the discussion

> Inconsistent failure messages on distributed queries
> 
>
> Key: CASSANDRA-15642
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15642
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Consistency/Coordination
>Reporter: Kevin Gallardo
>Priority: Normal
>
> As a follow up to some exploration I have done for CASSANDRA-15543, I 
> realized the following behavior in both {{ReadCallback}} and 
> {{AbstractWriteHandler}}:
>  - await for responses
>  - when all required number of responses have come back: unblock the wait
>  - when a single failure happens: unblock the wait
>  - when unblocked, look to see if the counter of failures is > 1 and if so 
> return an error message based on the {{failures}} map that's been filled
> Error messages that can result from this behavior can be a ReadTimeout, a 
> ReadFailure, a WriteTimeout or a WriteFailure.
> In case of a Write/ReadFailure, the user will get back an error looking like 
> the following:
> "Failure: Received X responses, and Y failures"
> (if this behavior I describe is incorrect, please correct me)
> This causes a usability problem. Since the handler will fail and throw an 
> exception as soon as 1 failure happens, the error message that is returned to 
> the user may not be accurate.
> (note: I am not entirely sure of the behavior in case of timeouts for now)
> For example, say a request at CL = QUORUM = 3, a failed request may complete 
> first, then a successful one completes, and another fails. If the exception 
> is thrown fast enough, the error message could say 
>  "Failure: Received 0 response, and 1 failure at CL = 3"
> Which:
> 1. doesn't make a lot of sense because the CL doesn't match the number of 
> results in the message, so you end up thinking "what happened with the rest 
> of the required CL?"
> 2. the information is incorrect. We did receive a successful response, only 
> it came after the initial failure.
> From that logic, I think it is safe to assume that the information returned 
> in the error message cannot be trusted in case of a failure. Only information 
> users should extract out of it is that at least 1 node has failed.
> For a

[jira] [Commented] (CASSANDRA-15686) Improvements in circle CI default config

2020-04-06 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076590#comment-17076590
 ] 

Kevin Gallardo commented on CASSANDRA-15686:


bq. I am only speculating but I thought there were tests that spin up network 
and there are many tests which share disk, and I don't know if we isolate the 
paths or not. If this is not true then it should be a simple change to bump the 
number of runners for unit tests (definitely not jvm dtests).

Hm afaict running with multiple runners is fine for the unit tests, I have 
tested on different configurations and it didn't seem to cause problems as long 
as # runners <= # CPUs

> Improvements in circle CI default config
> 
>
> Key: CASSANDRA-15686
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15686
> Project: Cassandra
>  Issue Type: Bug
>  Components: Build
>Reporter: Kevin Gallardo
>Priority: Normal
>
> I have been looking at and played around with the [default CircleCI 
> config|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml], 
> a few comments/questions regarding the following topics:
>  * Python dtests do not run successfully (200-300 failures) on {{medium}} 
> instances, they seem to only run with small flaky failures on {{large}} 
> instances or higher
>  * Python Upgrade tests:
>  ** Do not seem to run without many failures on any instance types / any 
> parallelism setting
>  ** Do not seem to parallelize well, it seems each container is going to 
> download multiple C* versions
>  ** Additionally it seems the configuration is not up to date, as currently 
> we get errors because {{JAVA8_HOME}} is not set
>  * Unit tests do not seem to parallelize optimally, number of test runners do 
> not reflect the available CPUs on the container. Ideally if # of runners == # 
> of CPUs, build time is improved, on any type of instances.
>  ** For instance when using the current configuration, running on medium 
> instances, build will use 1 junit test runner, but 2 CPUs are available. If 
> using 2 runners, the build time is reduced from 19min (at the current main 
> config of parallelism=4) to 12min.
>  * There are some typos in the file, some dtests say "Run Unit Tests" but 
> they are JVM dtests (see 
> [here|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml#L1077],
>  
> [here|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml#L1386])
> So some ways to process these would be:
>  * Do the Python dtests run successfully for anyone on {{medium}} instances? 
> If not, would it make sense to bump them to {{large}} so that they can be run 
> successfully?
>  * Does anybody ever run the python upgrade tests on CircleCI and what is the 
> configuration that makes it work?
>  * Would it make sense to either hardcode the number of test runners in the 
> unit tests with `-Dtest.runners` in the config file to reflect the number of 
> CPUs on the instances, or change the build so that it is able to detect the 
> appropriate number of core available automatically?
> Additionally, it seems this default config file (config.yml) is not as well 
> maintained as the 
> [{{config-2_1.yml}}|https://github.com/apache/cassandra/blob/trunk/.circleci/config-2_1.yml]
>  (+its lowres/highres) version in the same folder (from CASSANDRA-14806). 
> What is the reasoning for maintaining these 2 versions of the build? Could 
> the better maintained version be used as the default? We could generate a 
> lowres version of the new config-2_1.yml, and rename it {{config.yml}} so 
> that it gets picked up by CircleCI automatically instead of the current 
> default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15686) Improvements in circle CI default config

2020-04-06 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076459#comment-17076459
 ] 

Kevin Gallardo commented on CASSANDRA-15686:


Confirm the build is running as normal by using the {{config-2_1.yml}} directly 
renamed as {{config.yml}}, see build #42 here: 
https://app.circleci.com/pipelines/github/newkek/cassandra?branch=chg

> Improvements in circle CI default config
> 
>
> Key: CASSANDRA-15686
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15686
> Project: Cassandra
>  Issue Type: Bug
>  Components: Build
>Reporter: Kevin Gallardo
>Priority: Normal
>
> I have been looking at and played around with the [default CircleCI 
> config|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml], 
> a few comments/questions regarding the following topics:
>  * Python dtests do not run successfully (200-300 failures) on {{medium}} 
> instances, they seem to only run with small flaky failures on {{large}} 
> instances or higher
>  * Python Upgrade tests:
>  ** Do not seem to run without many failures on any instance types / any 
> parallelism setting
>  ** Do not seem to parallelize well, it seems each container is going to 
> download multiple C* versions
>  ** Additionally it seems the configuration is not up to date, as currently 
> we get errors because {{JAVA8_HOME}} is not set
>  * Unit tests do not seem to parallelize optimally, number of test runners do 
> not reflect the available CPUs on the container. Ideally if # of runners == # 
> of CPUs, build time is improved, on any type of instances.
>  ** For instance when using the current configuration, running on medium 
> instances, build will use 1 junit test runner, but 2 CPUs are available. If 
> using 2 runners, the build time is reduced from 19min (at the current main 
> config of parallelism=4) to 12min.
>  * There are some typos in the file, some dtests say "Run Unit Tests" but 
> they are JVM dtests (see 
> [here|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml#L1077],
>  
> [here|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml#L1386])
> So some ways to process these would be:
>  * Do the Python dtests run successfully for anyone on {{medium}} instances? 
> If not, would it make sense to bump them to {{large}} so that they can be run 
> successfully?
>  * Does anybody ever run the python upgrade tests on CircleCI and what is the 
> configuration that makes it work?
>  * Would it make sense to either hardcode the number of test runners in the 
> unit tests with `-Dtest.runners` in the config file to reflect the number of 
> CPUs on the instances, or change the build so that it is able to detect the 
> appropriate number of core available automatically?
> Additionally, it seems this default config file (config.yml) is not as well 
> maintained as the 
> [{{config-2_1.yml}}|https://github.com/apache/cassandra/blob/trunk/.circleci/config-2_1.yml]
>  (+its lowres/highres) version in the same folder (from CASSANDRA-14806). 
> What is the reasoning for maintaining these 2 versions of the build? Could 
> the better maintained version be used as the default? We could generate a 
> lowres version of the new config-2_1.yml, and rename it {{config.yml}} so 
> that it gets picked up by CircleCI automatically instead of the current 
> default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15686) Improvements in circle CI default config

2020-04-06 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076386#comment-17076386
 ] 

Kevin Gallardo commented on CASSANDRA-15686:


bq. I was under the impression the lager instances were not in the free tier;

Correct, it is not. I was told that if you have a build that requests > medium 
on a free tier, CircleCI would downgrade to medium automatically, but I have 
not confirmed that, need to double check. I have seen [other OSS 
projects|https://github.com/envoyproxy/envoy/blob/master/.circleci/config.yml#L9]
 using {{xlarge}} resources on their default build config 路‍♂️. The problem is 
that the tests do not run well at all on medium, contributors get confused by 
seeing the python dtest build available and launches it, and then gets confused 
by the 200-300+ failures resulting. If these don't build successfully at all by 
default, I am wondering if it is worth to keep it in the "lowres" build at all 
given the confusion it causes?

bq. Sorry I don't follow. [...] So the only file we should be maintaining is 
config-2_1.yml.

So as we discussed offline, it would be possible to, instead of using the 
generated version of the {{config-2_1.yml}} file, we use the {{config-2_1.yml}} 
by default. This would cause less confusion to contributors too, we could also 
keep a HIGHRES version as discussed so that it doesn't change people's 
workflows.

> Improvements in circle CI default config
> 
>
> Key: CASSANDRA-15686
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15686
> Project: Cassandra
>  Issue Type: Bug
>  Components: Build
>Reporter: Kevin Gallardo
>Priority: Normal
>
> I have been looking at and played around with the [default CircleCI 
> config|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml], 
> a few comments/questions regarding the following topics:
>  * Python dtests do not run successfully (200-300 failures) on {{medium}} 
> instances, they seem to only run with small flaky failures on {{large}} 
> instances or higher
>  * Python Upgrade tests:
>  ** Do not seem to run without many failures on any instance types / any 
> parallelism setting
>  ** Do not seem to parallelize well, it seems each container is going to 
> download multiple C* versions
>  ** Additionally it seems the configuration is not up to date, as currently 
> we get errors because {{JAVA8_HOME}} is not set
>  * Unit tests do not seem to parallelize optimally, number of test runners do 
> not reflect the available CPUs on the container. Ideally if # of runners == # 
> of CPUs, build time is improved, on any type of instances.
>  ** For instance when using the current configuration, running on medium 
> instances, build will use 1 junit test runner, but 2 CPUs are available. If 
> using 2 runners, the build time is reduced from 19min (at the current main 
> config of parallelism=4) to 12min.
>  * There are some typos in the file, some dtests say "Run Unit Tests" but 
> they are JVM dtests (see 
> [here|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml#L1077],
>  
> [here|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml#L1386])
> So some ways to process these would be:
>  * Do the Python dtests run successfully for anyone on {{medium}} instances? 
> If not, would it make sense to bump them to {{large}} so that they can be run 
> successfully?
>  * Does anybody ever run the python upgrade tests on CircleCI and what is the 
> configuration that makes it work?
>  * Would it make sense to either hardcode the number of test runners in the 
> unit tests with `-Dtest.runners` in the config file to reflect the number of 
> CPUs on the instances, or change the build so that it is able to detect the 
> appropriate number of core available automatically?
> Additionally, it seems this default config file (config.yml) is not as well 
> maintained as the 
> [{{config-2_1.yml}}|https://github.com/apache/cassandra/blob/trunk/.circleci/config-2_1.yml]
>  (+its lowres/highres) version in the same folder (from CASSANDRA-14806). 
> What is the reasoning for maintaining these 2 versions of the build? Could 
> the better maintained version be used as the default? We could generate a 
> lowres version of the new config-2_1.yml, and rename it {{config.yml}} so 
> that it gets picked up by CircleCI automatically instead of the current 
> default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15642) Inconsistent failure messages on distributed queries

2020-04-06 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076374#comment-17076374
 ] 

Kevin Gallardo commented on CASSANDRA-15642:


bq. What is your definition of complete/reliable?

To me it would mean that user gets a complete view of the states of the other 
requests necessary to complete a request at a certain CL.

Instead, in the current state of things:

* say for CL=3
* first response that comes back is a failure
* then later the 2 other responses are successful
* the error message may say "1 failure, 0 successful response"

The "0 successful response" cannot be trusted because some successful responses 
actually came back, but after the failure. And the "1 failure" cannot be 
trusted either, because there may have been more failures that would not be 
reported because of the current fail-fast behavior.

The alternate solution you mention to save the state first doesn't provide with 
a complete view of the situation either as far as I understand. The state is 
saved and things are less inconsistent, but the errors returned to the user may 
still be misleading as explained above

> Inconsistent failure messages on distributed queries
> 
>
> Key: CASSANDRA-15642
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15642
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Consistency/Coordination
>Reporter: Kevin Gallardo
>Priority: Normal
>
> As a follow up to some exploration I have done for CASSANDRA-15543, I 
> realized the following behavior in both {{ReadCallback}} and 
> {{AbstractWriteHandler}}:
>  - await for responses
>  - when all required number of responses have come back: unblock the wait
>  - when a single failure happens: unblock the wait
>  - when unblocked, look to see if the counter of failures is > 1 and if so 
> return an error message based on the {{failures}} map that's been filled
> Error messages that can result from this behavior can be a ReadTimeout, a 
> ReadFailure, a WriteTimeout or a WriteFailure.
> In case of a Write/ReadFailure, the user will get back an error looking like 
> the following:
> "Failure: Received X responses, and Y failures"
> (if this behavior I describe is incorrect, please correct me)
> This causes a usability problem. Since the handler will fail and throw an 
> exception as soon as 1 failure happens, the error message that is returned to 
> the user may not be accurate.
> (note: I am not entirely sure of the behavior in case of timeouts for now)
> For example, say a request at CL = QUORUM = 3, a failed request may complete 
> first, then a successful one completes, and another fails. If the exception 
> is thrown fast enough, the error message could say 
>  "Failure: Received 0 response, and 1 failure at CL = 3"
> Which:
> 1. doesn't make a lot of sense because the CL doesn't match the number of 
> results in the message, so you end up thinking "what happened with the rest 
> of the required CL?"
> 2. the information is incorrect. We did receive a successful response, only 
> it came after the initial failure.
> From that logic, I think it is safe to assume that the information returned 
> in the error message cannot be trusted in case of a failure. Only information 
> users should extract out of it is that at least 1 node has failed.
> For a big improvement in usability, the {{ReadCallback}} and 
> {{AbstractWriteResponseHandler}} could instead wait for all responses to come 
> back before unblocking the wait, or let it timeout. This is way, the users 
> will be able to have some trust around the information returned to them.
> Additionally, an error that happens first prevents a timeout to happen 
> because it fails immediately, and so potentially it hides problems with other 
> replicas. If we were to wait for all responses, we might get a timeout, in 
> that case we'd also be able to tell wether failures have happened *before* 
> that timeout, and have a more complete diagnostic where you can't detect both 
> errors at the same time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-15642) Inconsistent failure messages on distributed queries

2020-04-03 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074827#comment-17074827
 ] 

Kevin Gallardo edited comment on CASSANDRA-15642 at 4/3/20, 7:10 PM:
-

[~benedict] thanks for having a look at the ticket!

bq. There is no real significant increase in information by waiting, and it 
delays how quickly a client may action this information

I would argue that right now that a user wouldn't even be aware that the 
information is not reliable, as it's only by digging in server code that I 
realized that this is happening. The error message tell you "received X errors 
and X responses" and there is no indication externally to the user/client that 
the information returned in not complete/reliable.

Additionally, given these findings I don't know how the client would be able to 
properly action on the information, quickly or not, if the info is not 
reliable? I am not sure that giving partial info quickly is better than 
cohesive info

bq. It is possible to simply report consistent information

I am not sure how that would be possible without waiting all the responses come 
back or timeout, but happy to be explained if I am missing something


was (Author: newkek):
[~benedict] thanks for having a look at the ticket!

bq. There is no real significant increase in information by waiting, and it 
delays how quickly a client may action this information

I would argue that right now that a user wouldn't even be aware that the 
information is not reliable, as it's only by digging in server code that I 
realized that this is happening. The error message tell you "received X errors 
and X responses" and there is no indication externally to the user/client that 
the information returned in not reliable.

Additionally, given these findings I don't know how the client would be able to 
properly action on the information, quickly or not, if the info is not 
reliable? I am not sure that giving partial info quickly is better than 
cohesive info

bq. It is possible to simply report consistent information

I am not sure how that would be possible without waiting all the responses come 
back or timeout, but happy to be explained if I am missing something

> Inconsistent failure messages on distributed queries
> 
>
> Key: CASSANDRA-15642
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15642
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Consistency/Coordination
>Reporter: Kevin Gallardo
>Priority: Normal
>
> As a follow up to some exploration I have done for CASSANDRA-15543, I 
> realized the following behavior in both {{ReadCallback}} and 
> {{AbstractWriteHandler}}:
>  - await for responses
>  - when all required number of responses have come back: unblock the wait
>  - when a single failure happens: unblock the wait
>  - when unblocked, look to see if the counter of failures is > 1 and if so 
> return an error message based on the {{failures}} map that's been filled
> Error messages that can result from this behavior can be a ReadTimeout, a 
> ReadFailure, a WriteTimeout or a WriteFailure.
> In case of a Write/ReadFailure, the user will get back an error looking like 
> the following:
> "Failure: Received X responses, and Y failures"
> (if this behavior I describe is incorrect, please correct me)
> This causes a usability problem. Since the handler will fail and throw an 
> exception as soon as 1 failure happens, the error message that is returned to 
> the user may not be accurate.
> (note: I am not entirely sure of the behavior in case of timeouts for now)
> For example, say a request at CL = QUORUM = 3, a failed request may complete 
> first, then a successful one completes, and another fails. If the exception 
> is thrown fast enough, the error message could say 
>  "Failure: Received 0 response, and 1 failure at CL = 3"
> Which:
> 1. doesn't make a lot of sense because the CL doesn't match the number of 
> results in the message, so you end up thinking "what happened with the rest 
> of the required CL?"
> 2. the information is incorrect. We did receive a successful response, only 
> it came after the initial failure.
> From that logic, I think it is safe to assume that the information returned 
> in the error message cannot be trusted in case of a failure. Only information 
> users should extract out of it is that at least 1 node has failed.
> For a big improvement in usability, the {{ReadCallback}} and 
> {{AbstractWriteResponseHandler}} could instead wait for all responses to come 
> back before unblocking the wait, or let it timeout. This is way, the users 
> will be able to have some trust around the information returned to them.
> Additionally, an error that happens first prevents a timeout to happen 
> because

[jira] [Commented] (CASSANDRA-15642) Inconsistent failure messages on distributed queries

2020-04-03 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074827#comment-17074827
 ] 

Kevin Gallardo commented on CASSANDRA-15642:


[~benedict] thanks for having a look at the ticket!

bq. There is no real significant increase in information by waiting, and it 
delays how quickly a client may action this information

I would argue that right now that a user wouldn't even be aware that the 
information is not reliable, as it's only by digging in server code that I 
realized that this is happening. The error message tell you "received X errors 
and X responses" and there is no indication externally to the user/client that 
the information returned in not reliable.

Additionally, given these findings I don't know how the client would be able to 
properly action on the information, quickly or not, if the info is not 
reliable? I am not sure that giving partial info quickly is better than 
cohesive info

bq. It is possible to simply report consistent information

I am not sure how that would be possible without waiting all the responses come 
back or timeout, but happy to be explained if I am missing something

> Inconsistent failure messages on distributed queries
> 
>
> Key: CASSANDRA-15642
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15642
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Consistency/Coordination
>Reporter: Kevin Gallardo
>Priority: Normal
>
> As a follow up to some exploration I have done for CASSANDRA-15543, I 
> realized the following behavior in both {{ReadCallback}} and 
> {{AbstractWriteHandler}}:
>  - await for responses
>  - when all required number of responses have come back: unblock the wait
>  - when a single failure happens: unblock the wait
>  - when unblocked, look to see if the counter of failures is > 1 and if so 
> return an error message based on the {{failures}} map that's been filled
> Error messages that can result from this behavior can be a ReadTimeout, a 
> ReadFailure, a WriteTimeout or a WriteFailure.
> In case of a Write/ReadFailure, the user will get back an error looking like 
> the following:
> "Failure: Received X responses, and Y failures"
> (if this behavior I describe is incorrect, please correct me)
> This causes a usability problem. Since the handler will fail and throw an 
> exception as soon as 1 failure happens, the error message that is returned to 
> the user may not be accurate.
> (note: I am not entirely sure of the behavior in case of timeouts for now)
> For example, say a request at CL = QUORUM = 3, a failed request may complete 
> first, then a successful one completes, and another fails. If the exception 
> is thrown fast enough, the error message could say 
>  "Failure: Received 0 response, and 1 failure at CL = 3"
> Which:
> 1. doesn't make a lot of sense because the CL doesn't match the number of 
> results in the message, so you end up thinking "what happened with the rest 
> of the required CL?"
> 2. the information is incorrect. We did receive a successful response, only 
> it came after the initial failure.
> From that logic, I think it is safe to assume that the information returned 
> in the error message cannot be trusted in case of a failure. Only information 
> users should extract out of it is that at least 1 node has failed.
> For a big improvement in usability, the {{ReadCallback}} and 
> {{AbstractWriteResponseHandler}} could instead wait for all responses to come 
> back before unblocking the wait, or let it timeout. This is way, the users 
> will be able to have some trust around the information returned to them.
> Additionally, an error that happens first prevents a timeout to happen 
> because it fails immediately, and so potentially it hides problems with other 
> replicas. If we were to wait for all responses, we might get a timeout, in 
> that case we'd also be able to tell wether failures have happened *before* 
> that timeout, and have a more complete diagnostic where you can't detect both 
> errors at the same time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-15642) Inconsistent failure messages on distributed queries

2020-04-03 Thread Kevin Gallardo (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Gallardo updated CASSANDRA-15642:
---
Description: 
As a follow up to some exploration I have done for CASSANDRA-15543, I realized 
the following behavior in both {{ReadCallback}} and {{AbstractWriteHandler}}:
 - await for responses
 - when all required number of responses have come back: unblock the wait
 - when a single failure happens: unblock the wait
 - when unblocked, look to see if the counter of failures is > 1 and if so 
return an error message based on the {{failures}} map that's been filled

Error messages that can result from this behavior can be a ReadTimeout, a 
ReadFailure, a WriteTimeout or a WriteFailure.

In case of a Write/ReadFailure, the user will get back an error looking like 
the following:

"Failure: Received X responses, and Y failures"

(if this behavior I describe is incorrect, please correct me)

This causes a usability problem. Since the handler will fail and throw an 
exception as soon as 1 failure happens, the error message that is returned to 
the user may not be accurate.

(note: I am not entirely sure of the behavior in case of timeouts for now)

For example, say a request at CL = QUORUM = 3, a failed request may complete 
first, then a successful one completes, and another fails. If the exception is 
thrown fast enough, the error message could say 
 "Failure: Received 0 response, and 1 failure at CL = 3"

Which:
1. doesn't make a lot of sense because the CL doesn't match the number of 
results in the message, so you end up thinking "what happened with the rest of 
the required CL?"
2. the information is incorrect. We did receive a successful response, only it 
came after the initial failure.

>From that logic, I think it is safe to assume that the information returned in 
>the error message cannot be trusted in case of a failure. Only information 
>users should extract out of it is that at least 1 node has failed.

For a big improvement in usability, the {{ReadCallback}} and 
{{AbstractWriteResponseHandler}} could instead wait for all responses to come 
back before unblocking the wait, or let it timeout. This is way, the users will 
be able to have some trust around the information returned to them.

Additionally, an error that happens first prevents a timeout to happen because 
it fails immediately, and so potentially it hides problems with other replicas. 
If we were to wait for all responses, we might get a timeout, in that case we'd 
also be able to tell wether failures have happened *before* that timeout, and 
have a more complete diagnostic where you can't detect both errors at the same 
time.

  was:
As a follow up to some exploration I have done for CASSANDRA-15543, I realized 
the following behavior in both {{ReadCallback}} and {{AbstractWriteHandler}}:
 - await for responses
 - when all required number of responses have come back: unblock the wait
 - when a single failure happens: unblock the wait
 - when unblocked, look to see if the counter of failures is > 1 and if so 
return an error message based on the {{failures}} map that's been filled

Error messages that can result from this behavior can be a ReadTimeout, a 
ReadFailure, a WriteTimeout or a WriteFailure.

In case of a Write/ReadFailure, the user will get back an error looking like 
the following:

"Failure: Received X responses, and Y failures"

(if this behavior I describe is incorrect, please correct me)

This causes a usability problem. Since the handler will fail and throw an 
exception as soon as 1 failure happens, the error message that is returned to 
the user may not be accurate.

(note: I am not entirely sure of the behavior in case of timeouts for now)

At, say, CL = QUORUM = 3, the failed request may complete first, then a 
successful one completes, and another fails. If the exception is thrown fast 
enough, the error message could say 
 "Failure: Received 0 response, and 1 failure at CL = 3"

Which 1. doesn't make a lot of sense because the CL doesn't match the previous 
information, but 2. the information is incorrect. We received a successful 
response, only it came after the initial failure.

>From that logic, I think it is safe to assume that the information returned in 
>the error message cannot be trusted in case of a failure. We can only know 
>that at least 1 node has failed, or not if the response is successful.

I am suggesting that for a big improvement in usability, the ReadCallback and 
AbstractWriteResponseHandler wait for all responses to come back before 
unblocking the wait, or let it timeout. This is way, the users will be able to 
have some trust around the numbers returned to them. Also we would be able to 
return more information this way.

Right now, an error that happens first prevents from a timeout to happen 
because it fails immediately, and so potentially it hides problems with other

[jira] [Updated] (CASSANDRA-15685) flaky testWithMismatchingPending - org.apache.cassandra.distributed.test.PreviewRepairTest

2020-04-03 Thread Kevin Gallardo (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Gallardo updated CASSANDRA-15685:
---
Resolution: Duplicate
Status: Resolved  (was: Open)

> flaky testWithMismatchingPending - 
> org.apache.cassandra.distributed.test.PreviewRepairTest
> --
>
> Key: CASSANDRA-15685
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15685
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Observed in: 
> https://app.circleci.com/pipelines/github/newkek/cassandra/34/workflows/1c6b157d-13c3-48a9-85fb-9fe8c153256b/jobs/191/tests
> Failure:
> {noformat}
> testWithMismatchingPending - 
> org.apache.cassandra.distributed.test.PreviewRepairTest
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.PreviewRepairTest.testWithMismatchingPending(PreviewRepairTest.java:97)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15685) flaky testWithMismatchingPending - org.apache.cassandra.distributed.test.PreviewRepairTest

2020-04-03 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074648#comment-17074648
 ] 

Kevin Gallardo commented on CASSANDRA-15685:


Agreed

> flaky testWithMismatchingPending - 
> org.apache.cassandra.distributed.test.PreviewRepairTest
> --
>
> Key: CASSANDRA-15685
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15685
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Observed in: 
> https://app.circleci.com/pipelines/github/newkek/cassandra/34/workflows/1c6b157d-13c3-48a9-85fb-9fe8c153256b/jobs/191/tests
> Failure:
> {noformat}
> testWithMismatchingPending - 
> org.apache.cassandra.distributed.test.PreviewRepairTest
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.PreviewRepairTest.testWithMismatchingPending(PreviewRepairTest.java:97)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15685) flaky testWithMismatchingPending - org.apache.cassandra.distributed.test.PreviewRepairTest

2020-04-02 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074077#comment-17074077
 ] 

Kevin Gallardo commented on CASSANDRA-15685:


Hey [~dcapwell] afaict I have only seen this test failing on this run, I have 
not seen it fail on other more recent builds happening after CASSANDRA-15650 
indeed. Opened this ticket since I couldn't find a report of this failure. I'm 
ok closing it if it's been fixed in CASSANDRA-15650

> flaky testWithMismatchingPending - 
> org.apache.cassandra.distributed.test.PreviewRepairTest
> --
>
> Key: CASSANDRA-15685
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15685
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Observed in: 
> https://app.circleci.com/pipelines/github/newkek/cassandra/34/workflows/1c6b157d-13c3-48a9-85fb-9fe8c153256b/jobs/191/tests
> Failure:
> {noformat}
> testWithMismatchingPending - 
> org.apache.cassandra.distributed.test.PreviewRepairTest
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.PreviewRepairTest.testWithMismatchingPending(PreviewRepairTest.java:97)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15631) Add AssertJ test dependency

2020-04-02 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074062#comment-17074062
 ] 

Kevin Gallardo commented on CASSANDRA-15631:


Awesome, thank you!

So for future reference, as mentioned on the proposal thread, the first test 
suite using AssertJ, to use as an example is now located at 
[{{org.apache.cassandra.repair.RepairJobTest}}|https://github.com/apache/cassandra/blob/6d343c59f275b1165be08e2256be878ac9b95fec/test/unit/org/apache/cassandra/repair/RepairJobTest.java#L73]

> Add AssertJ test dependency
> ---
>
> Key: CASSANDRA-15631
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15631
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Test/dtest, Test/unit
>Reporter: Kevin Gallardo
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-beta
>
>
> See 
> [proposal|https://lists.apache.org/thread.html/rc562ec47578d0ae6f346ba9e3d7469c1cd3f8b521a72ddcb2accc47b%40%3Cdev.cassandra.apache.org%3E].
> The goal is to add [AssertJ|https://assertj.github.io/doc/] to the test 
> framework to allow for more comprehensible and easier to write tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-15686) Improvements in circle CI default config

2020-04-02 Thread Kevin Gallardo (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Gallardo updated CASSANDRA-15686:
---
Description: 
I have been looking at and played around with the [default CircleCI 
config|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml], a 
few comments/questions regarding the following topics:
 * Python dtests do not run successfully (200-300 failures) on {{medium}} 
instances, they seem to only run with small flaky failures on {{large}} 
instances or higher
 * Python Upgrade tests:
 ** Do not seem to run without many failures on any instance types / any 
parallelism setting
 ** Do not seem to parallelize well, it seems each container is going to 
download multiple C* versions
 ** Additionally it seems the configuration is not up to date, as currently we 
get errors because {{JAVA8_HOME}} is not set
 * Unit tests do not seem to parallelize optimally, number of test runners do 
not reflect the available CPUs on the container. Ideally if # of runners == # 
of CPUs, build time is improved, on any type of instances.
 ** For instance when using the current configuration, running on medium 
instances, build will use 1 junit test runner, but 2 CPUs are available. If 
using 2 runners, the build time is reduced from 19min (at the current main 
config of parallelism=4) to 12min.
 * There are some typos in the file, some dtests say "Run Unit Tests" but they 
are JVM dtests (see 
[here|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml#L1077],
 
[here|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml#L1386])

So some ways to process these would be:
 * Do the Python dtests run successfully for anyone on {{medium}} instances? If 
not, would it make sense to bump them to {{large}} so that they can be run 
successfully?
 * Does anybody ever run the python upgrade tests on CircleCI and what is the 
configuration that makes it work?
 * Would it make sense to either hardcode the number of test runners in the 
unit tests with `-Dtest.runners` in the config file to reflect the number of 
CPUs on the instances, or change the build so that it is able to detect the 
appropriate number of core available automatically?

Additionally, it seems this default config file (config.yml) is not as well 
maintained as the 
[{{config-2_1.yml}}|https://github.com/apache/cassandra/blob/trunk/.circleci/config-2_1.yml]
 (+its lowres/highres) version in the same folder (from CASSANDRA-14806). What 
is the reasoning for maintaining these 2 versions of the build? Could the 
better maintained version be used as the default? We could generate a lowres 
version of the new config-2_1.yml, and rename it {{config.yml}} so that it gets 
picked up by CircleCI automatically instead of the current default.

  was:
I have been looking at and played around with the [default CircleCI 
config|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml], a 
few comments/questions regarding the following topics:
 * Python dtests do not run successfully (200-300 failures) on {{medium}} 
instances, they seem to only run with small flaky failures on {{large}} 
instances or higher
 * Python Upgrade tests:
 ** Do not seem to run without many failures on any instance types / any 
parallelism setting
 ** Do not seem to parallelize well, it seems each container is going to 
download multiple C* versions
 ** Additionally it seems the configuration is not up to date, as currently we 
get errors because {{JAVA8_HOME}} is not set
 * Unit tests do not seem to parallelize optimally, number of test runners do 
not reflect the available CPUs on the container. Ideally if # of runners == # 
of CPUs, build time is improved, on any type of instances.
 ** For instance when using the current configuration, running on medium 
instances, build will use 1 junit test runner, but 2 CPUs are available. If 
using 2 runners, the build time is reduced from 19min (at the current main 
config of parallelism=4) to 12min.
 * There are some typos in the file, some dtests say "Run Unit Tests" but they 
are JVM dtests (see 
[here|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml#L1077],
 
[here|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml#L1386])

So some ways to process these would be:
 * Do the Python dtests run successfully for anyone on {{medium}} instances? If 
not, would it make sense to bump them to {{large}} so that they can be run 
successfully?
 * Does anybody ever run the python upgrade tests on CircleCI and what is the 
configuration that makes it work?
 * Would it make sense to either hardcode the number of test runners in the 
unit tests with `-Dtest.runners` in the config file to reflect the number of 
CPUs on the instances, or change the build so that it is able to detect the 
appropriate number of core available automatically?

Additionally, it seems this default

[jira] [Created] (CASSANDRA-15686) Improvements in circle CI default config

2020-04-02 Thread Kevin Gallardo (Jira)

Kevin Gallardo created CASSANDRA-15686:
--

 Summary: Improvements in circle CI default config
 Key: CASSANDRA-15686
 URL: https://issues.apache.org/jira/browse/CASSANDRA-15686
 Project: Cassandra
  Issue Type: Bug
Reporter: Kevin Gallardo


I have been looking at and played around with the [default CircleCI 
config|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml], a 
few comments/questions regarding the following topics:
 * Python dtests do not run successfully (200-300 failures) on {{medium}} 
instances, they seem to only run with small flaky failures on {{large}} 
instances or higher
 * Python Upgrade tests:
 ** Do not seem to run without many failures on any instance types / any 
parallelism setting
 ** Do not seem to parallelize well, it seems each container is going to 
download multiple C* versions
 ** Additionally it seems the configuration is not up to date, as currently we 
get errors because {{JAVA8_HOME}} is not set
 * Unit tests do not seem to parallelize optimally, number of test runners do 
not reflect the available CPUs on the container. Ideally if # of runners == # 
of CPUs, build time is improved, on any type of instances.
 ** For instance when using the current configuration, running on medium 
instances, build will use 1 junit test runner, but 2 CPUs are available. If 
using 2 runners, the build time is reduced from 19min (at the current main 
config of parallelism=4) to 12min.
 * There are some typos in the file, some dtests say "Run Unit Tests" but they 
are JVM dtests (see 
[here|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml#L1077],
 
[here|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml#L1386])

So some ways to process these would be:
 * Do the Python dtests run successfully for anyone on {{medium}} instances? If 
not, would it make sense to bump them to {{large}} so that they can be run 
successfully?
 * Does anybody ever run the python upgrade tests on CircleCI and what is the 
configuration that makes it work?
 * Would it make sense to either hardcode the number of test runners in the 
unit tests with `-Dtest.runners` in the config file to reflect the number of 
CPUs on the instances, or change the build so that it is able to detect the 
appropriate number of core available automatically?

Additionally, it seems this default config file (config.yml) is not as well 
maintained as the 
[{{config-2_1.yml}}|https://github.com/apache/cassandra/blob/trunk/.circleci/config-2_1.yml]
 (+its lowres/highres) version in the same folder. What is the reasoning for 
maintaining these 2 versions of the build? Could the better maintained version 
be used as the default? We could generate a lowres version of the new 
config-2_1.yml, and rename it {{config.yml}} so that it gets picked up by 
CircleCI automatically instead of the current default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-15685) flaky testWithMismatchingPending - org.apache.cassandra.distributed.test.PreviewRepairTest

2020-04-02 Thread Kevin Gallardo (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Gallardo updated CASSANDRA-15685:
---
 Bug Category: Parent values: Correctness(12982)
   Complexity: Low Hanging Fruit
Discovered By: Unit Test
 Severity: Low
   Status: Open  (was: Triage Needed)

> flaky testWithMismatchingPending - 
> org.apache.cassandra.distributed.test.PreviewRepairTest
> --
>
> Key: CASSANDRA-15685
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15685
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Observed in: 
> https://app.circleci.com/pipelines/github/newkek/cassandra/34/workflows/1c6b157d-13c3-48a9-85fb-9fe8c153256b/jobs/191/tests
> Failure:
> {noformat}
> testWithMismatchingPending - 
> org.apache.cassandra.distributed.test.PreviewRepairTest
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.PreviewRepairTest.testWithMismatchingPending(PreviewRepairTest.java:97)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-15685) flaky testWithMismatchingPending - org.apache.cassandra.distributed.test.PreviewRepairTest

2020-04-02 Thread Kevin Gallardo (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Gallardo updated CASSANDRA-15685:
---
Fix Version/s: 4.0-beta

> flaky testWithMismatchingPending - 
> org.apache.cassandra.distributed.test.PreviewRepairTest
> --
>
> Key: CASSANDRA-15685
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15685
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Observed in: 
> https://app.circleci.com/pipelines/github/newkek/cassandra/34/workflows/1c6b157d-13c3-48a9-85fb-9fe8c153256b/jobs/191/tests
> Failure:
> {noformat}
> testWithMismatchingPending - 
> org.apache.cassandra.distributed.test.PreviewRepairTest
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.PreviewRepairTest.testWithMismatchingPending(PreviewRepairTest.java:97)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Created] (CASSANDRA-15685) flaky testWithMismatchingPending - org.apache.cassandra.distributed.test.PreviewRepairTest

2020-04-02 Thread Kevin Gallardo (Jira)

Kevin Gallardo created CASSANDRA-15685:
--

 Summary: flaky testWithMismatchingPending - 
org.apache.cassandra.distributed.test.PreviewRepairTest
 Key: CASSANDRA-15685
 URL: https://issues.apache.org/jira/browse/CASSANDRA-15685
 Project: Cassandra
  Issue Type: Bug
  Components: Test/dtest
Reporter: Kevin Gallardo


Observed in: 
https://app.circleci.com/pipelines/github/newkek/cassandra/34/workflows/1c6b157d-13c3-48a9-85fb-9fe8c153256b/jobs/191/tests

Failure:
{noformat}
testWithMismatchingPending - 
org.apache.cassandra.distributed.test.PreviewRepairTest
junit.framework.AssertionFailedError
at 
org.apache.cassandra.distributed.test.PreviewRepairTest.testWithMismatchingPending(PreviewRepairTest.java:97)
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-15683) Fix NPE in SimpleReadWriteTest after test framework behavior change

2020-04-02 Thread Kevin Gallardo (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Gallardo updated CASSANDRA-15683:
---
Test and Documentation Plan: no doc changes
 Status: Patch Available  (was: In Progress)

> Fix NPE in SimpleReadWriteTest after test framework behavior change
> ---
>
> Key: CASSANDRA-15683
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15683
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Kevin Gallardo
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-alpha
>
>
> In JVM dtests, all exceptions thrown in a request execution used to be always 
> wrapped in a RuntimeException.
> After CASSANDRA-15650 changes, the behavior has been changed to: If the 
> exception is a RuntimeException, just rethrow it, otherwise wrap in 
> RuntimeException, as you can see in: 
> [https://github.com/apache/cassandra/commit/dfc279a22a5563ac7a832a586914d5410426e9b7#diff-0b019281b7e97248577c82af0e663ef4R211]
> This causes the tests that were always extracting the cause from the wrapping 
> RuntimeException before, to check the root cause of the error, to throw a NPE 
> when they call {{getCause()}}, tests such as 
> {{SimpleReadWriteTest#readWithSchemaDisagreement}} and 
> {{SimpleReadWriteTest#writeWithSchemaDisagreement}}.
> Can be fixed by simply not unwrapping the cause in those tests, use the 
> thrown exception directly, if the behavior of "not always wrapping in 
> RuntimeException" is agreed to be correct.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15683) Fix NPE in SimpleReadWriteTest after test framework behavior change

2020-04-02 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073863#comment-17073863
 ] 

Kevin Gallardo commented on CASSANDRA-15683:


Changes available at: https://github.com/newkek/cassandra/pull/1

> Fix NPE in SimpleReadWriteTest after test framework behavior change
> ---
>
> Key: CASSANDRA-15683
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15683
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Kevin Gallardo
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-alpha
>
>
> In JVM dtests, all exceptions thrown in a request execution used to be always 
> wrapped in a RuntimeException.
> After CASSANDRA-15650 changes, the behavior has been changed to: If the 
> exception is a RuntimeException, just rethrow it, otherwise wrap in 
> RuntimeException, as you can see in: 
> [https://github.com/apache/cassandra/commit/dfc279a22a5563ac7a832a586914d5410426e9b7#diff-0b019281b7e97248577c82af0e663ef4R211]
> This causes the tests that were always extracting the cause from the wrapping 
> RuntimeException before, to check the root cause of the error, to throw a NPE 
> when they call {{getCause()}}, tests such as 
> {{SimpleReadWriteTest#readWithSchemaDisagreement}} and 
> {{SimpleReadWriteTest#writeWithSchemaDisagreement}}.
> Can be fixed by simply not unwrapping the cause in those tests, use the 
> thrown exception directly, if the behavior of "not always wrapping in 
> RuntimeException" is agreed to be correct.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-15683) Fix NPE in SimpleReadWriteTest after test framework behavior change

2020-04-02 Thread Kevin Gallardo (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Gallardo updated CASSANDRA-15683:
---
Description: 
In JVM dtests, all exceptions thrown in a request execution used to be always 
wrapped in a RuntimeException.

After CASSANDRA-15650 changes, the behavior has been changed to: If the 
exception is a RuntimeException, just rethrow it, otherwise wrap in 
RuntimeException, as you can see in: 
[https://github.com/apache/cassandra/commit/dfc279a22a5563ac7a832a586914d5410426e9b7#diff-0b019281b7e97248577c82af0e663ef4R211]

This causes the tests that were always extracting the cause from the wrapping 
RuntimeException before, to check the root cause of the error, to throw a NPE 
when they call {{getCause()}}, tests such as 
{{SimpleReadWriteTest#readWithSchemaDisagreement}} and 
{{SimpleReadWriteTest#writeWithSchemaDisagreement}}.

Can be fixed by simply not unwrapping the cause in those tests, use the thrown 
exception directly, if the behavior of "not always wrapping in 
RuntimeException" is agreed to be correct.

  was:
In JVM dtests, all exceptions thrown in a request execution used to be always 
wrapped in a RuntimeException.

After CASSANDRA-15650 changes, the behavior has been changed to: If the 
exception is a RuntimeException, just rethrow it, otherwise wrap in 
RuntimeException, as you can see in: 
[https://github.com/apache/cassandra/commit/dfc279a22a5563ac7a832a586914d5410426e9b7#diff-0b019281b7e97248577c82af0e663ef4R211]

This causes the tests that were always extracting the cause from the wrapping 
RuntimeException before, to check the root cause of the error, to throw a NPE 
when they call {{getCause()}}, tests such as 
{{SimpleReadWriteTest#readWithSchemaDisagreement}} and 
{{SimpleReadWriteTest#writeWithSchemaDisagreement}}.

Can be fixed by simply not unwrapping the cause in those tests, use the thrown 
exception directly, if the behavior of "not always wrapping in 
RuntimeException" is expected.


> Fix NPE in SimpleReadWriteTest after test framework behavior change
> ---
>
> Key: CASSANDRA-15683
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15683
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Kevin Gallardo
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-alpha
>
>
> In JVM dtests, all exceptions thrown in a request execution used to be always 
> wrapped in a RuntimeException.
> After CASSANDRA-15650 changes, the behavior has been changed to: If the 
> exception is a RuntimeException, just rethrow it, otherwise wrap in 
> RuntimeException, as you can see in: 
> [https://github.com/apache/cassandra/commit/dfc279a22a5563ac7a832a586914d5410426e9b7#diff-0b019281b7e97248577c82af0e663ef4R211]
> This causes the tests that were always extracting the cause from the wrapping 
> RuntimeException before, to check the root cause of the error, to throw a NPE 
> when they call {{getCause()}}, tests such as 
> {{SimpleReadWriteTest#readWithSchemaDisagreement}} and 
> {{SimpleReadWriteTest#writeWithSchemaDisagreement}}.
> Can be fixed by simply not unwrapping the cause in those tests, use the 
> thrown exception directly, if the behavior of "not always wrapping in 
> RuntimeException" is agreed to be correct.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-15683) Fix NPE in SimpleReadWriteTest after test framework behavior change

2020-04-02 Thread Kevin Gallardo (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Gallardo updated CASSANDRA-15683:
---
 Bug Category: Parent values: Correctness(12982)Level 1 values: Test 
Failure(12990)
   Complexity: Low Hanging Fruit
Discovered By: Unit Test
 Severity: Low
   Status: Open  (was: Triage Needed)

> Fix NPE in SimpleReadWriteTest after test framework behavior change
> ---
>
> Key: CASSANDRA-15683
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15683
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Kevin Gallardo
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-alpha
>
>
> In JVM dtests, all exceptions thrown in a request execution used to be always 
> wrapped in a RuntimeException.
> After CASSANDRA-15650 changes, the behavior has been changed to: If the 
> exception is a RuntimeException, just rethrow it, otherwise wrap in 
> RuntimeException, as you can see in: 
> [https://github.com/apache/cassandra/commit/dfc279a22a5563ac7a832a586914d5410426e9b7#diff-0b019281b7e97248577c82af0e663ef4R211]
> This causes the tests that were always extracting the cause from the wrapping 
> RuntimeException before, to check the root cause of the error, to throw a NPE 
> when they call {{getCause()}}, tests such as 
> {{SimpleReadWriteTest#readWithSchemaDisagreement}} and 
> {{SimpleReadWriteTest#writeWithSchemaDisagreement}}.
> Can be fixed by simply not unwrapping the cause in those tests, use the 
> thrown exception directly, if the behavior of "not always wrapping in 
> RuntimeException" is expected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Assigned] (CASSANDRA-15683) Fix NPE in SimpleReadWriteTest after test framework behavior change

2020-04-02 Thread Kevin Gallardo (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Gallardo reassigned CASSANDRA-15683:
--

Assignee: Kevin Gallardo

> Fix NPE in SimpleReadWriteTest after test framework behavior change
> ---
>
> Key: CASSANDRA-15683
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15683
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Kevin Gallardo
>Assignee: Kevin Gallardo
>Priority: Normal
>
> In JVM dtests, all exceptions thrown in a request execution used to be always 
> wrapped in a RuntimeException.
> After CASSANDRA-15650 changes, the behavior has been changed to: If the 
> exception is a RuntimeException, just rethrow it, otherwise wrap in 
> RuntimeException, as you can see in: 
> [https://github.com/apache/cassandra/commit/dfc279a22a5563ac7a832a586914d5410426e9b7#diff-0b019281b7e97248577c82af0e663ef4R211]
> This causes the tests that were always extracting the cause from the wrapping 
> RuntimeException before, to check the root cause of the error, to throw a NPE 
> when they call {{getCause()}}, tests such as 
> {{SimpleReadWriteTest#readWithSchemaDisagreement}} and 
> {{SimpleReadWriteTest#writeWithSchemaDisagreement}}.
> Can be fixed by simply not unwrapping the cause in those tests, use the 
> thrown exception directly, if the behavior of "not always wrapping in 
> RuntimeException" is expected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-15683) Fix NPE in SimpleReadWriteTest after test framework behavior change

2020-04-02 Thread Kevin Gallardo (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Gallardo updated CASSANDRA-15683:
---
Fix Version/s: (was: 4.0)

> Fix NPE in SimpleReadWriteTest after test framework behavior change
> ---
>
> Key: CASSANDRA-15683
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15683
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Kevin Gallardo
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-alpha
>
>
> In JVM dtests, all exceptions thrown in a request execution used to be always 
> wrapped in a RuntimeException.
> After CASSANDRA-15650 changes, the behavior has been changed to: If the 
> exception is a RuntimeException, just rethrow it, otherwise wrap in 
> RuntimeException, as you can see in: 
> [https://github.com/apache/cassandra/commit/dfc279a22a5563ac7a832a586914d5410426e9b7#diff-0b019281b7e97248577c82af0e663ef4R211]
> This causes the tests that were always extracting the cause from the wrapping 
> RuntimeException before, to check the root cause of the error, to throw a NPE 
> when they call {{getCause()}}, tests such as 
> {{SimpleReadWriteTest#readWithSchemaDisagreement}} and 
> {{SimpleReadWriteTest#writeWithSchemaDisagreement}}.
> Can be fixed by simply not unwrapping the cause in those tests, use the 
> thrown exception directly, if the behavior of "not always wrapping in 
> RuntimeException" is expected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-15683) Fix NPE in SimpleReadWriteTest after test framework behavior change

2020-04-02 Thread Kevin Gallardo (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Gallardo updated CASSANDRA-15683:
---
Fix Version/s: 4.0-alpha
   4.0

> Fix NPE in SimpleReadWriteTest after test framework behavior change
> ---
>
> Key: CASSANDRA-15683
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15683
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Kevin Gallardo
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0, 4.0-alpha
>
>
> In JVM dtests, all exceptions thrown in a request execution used to be always 
> wrapped in a RuntimeException.
> After CASSANDRA-15650 changes, the behavior has been changed to: If the 
> exception is a RuntimeException, just rethrow it, otherwise wrap in 
> RuntimeException, as you can see in: 
> [https://github.com/apache/cassandra/commit/dfc279a22a5563ac7a832a586914d5410426e9b7#diff-0b019281b7e97248577c82af0e663ef4R211]
> This causes the tests that were always extracting the cause from the wrapping 
> RuntimeException before, to check the root cause of the error, to throw a NPE 
> when they call {{getCause()}}, tests such as 
> {{SimpleReadWriteTest#readWithSchemaDisagreement}} and 
> {{SimpleReadWriteTest#writeWithSchemaDisagreement}}.
> Can be fixed by simply not unwrapping the cause in those tests, use the 
> thrown exception directly, if the behavior of "not always wrapping in 
> RuntimeException" is expected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15683) Fix NPE in SimpleReadWriteTest after test framework behavior change

2020-04-02 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073839#comment-17073839
 ] 

Kevin Gallardo commented on CASSANDRA-15683:


[~dcapwell] [~blerer] [~e.dimitrova] ping since you were author/reviewers on 
https://issues.apache.org/jira/browse/CASSANDRA-15650

> Fix NPE in SimpleReadWriteTest after test framework behavior change
> ---
>
> Key: CASSANDRA-15683
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15683
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Kevin Gallardo
>Priority: Normal
>
> In JVM dtests, all exceptions thrown in a request execution used to be always 
> wrapped in a RuntimeException.
> After CASSANDRA-15650 changes, the behavior has been changed to: If the 
> exception is a RuntimeException, just rethrow it, otherwise wrap in 
> RuntimeException, as you can see in: 
> [https://github.com/apache/cassandra/commit/dfc279a22a5563ac7a832a586914d5410426e9b7#diff-0b019281b7e97248577c82af0e663ef4R211]
> This causes the tests that were always extracting the cause from the wrapping 
> RuntimeException before, to check the root cause of the error, to throw a NPE 
> when they call {{getCause()}}, tests such as 
> {{SimpleReadWriteTest#readWithSchemaDisagreement}} and 
> {{SimpleReadWriteTest#writeWithSchemaDisagreement}}.
> Can be fixed by simply not unwrapping the cause in those tests, use the 
> thrown exception directly, if the behavior of "not always wrapping in 
> RuntimeException" is expected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Created] (CASSANDRA-15683) Fix NPE in SimpleReadWriteTest after test framework behavior change

2020-04-02 Thread Kevin Gallardo (Jira)

Kevin Gallardo created CASSANDRA-15683:
--

 Summary: Fix NPE in SimpleReadWriteTest after test framework 
behavior change
 Key: CASSANDRA-15683
 URL: https://issues.apache.org/jira/browse/CASSANDRA-15683
 Project: Cassandra
  Issue Type: Bug
  Components: Test/dtest
Reporter: Kevin Gallardo


In JVM dtests, all exceptions thrown in a request execution used to be always 
wrapped in a RuntimeException.

After CASSANDRA-15650 changes, the behavior has been changed to: If the 
exception is a RuntimeException, just rethrow it, otherwise wrap in 
RuntimeException, as you can see in: 
[https://github.com/apache/cassandra/commit/dfc279a22a5563ac7a832a586914d5410426e9b7#diff-0b019281b7e97248577c82af0e663ef4R211]

This causes the tests that were always extracting the cause from the wrapping 
RuntimeException before, to check the root cause of the error, to throw a NPE 
when they call {{getCause()}}, tests such as 
{{SimpleReadWriteTest#readWithSchemaDisagreement}} and 
{{SimpleReadWriteTest#writeWithSchemaDisagreement}}.

Can be fixed by simply not unwrapping the cause in those tests, use the thrown 
exception directly, if the behavior of "not always wrapping in 
RuntimeException" is expected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15631) Add AssertJ test dependency

2020-04-01 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072930#comment-17072930
 ] 

Kevin Gallardo commented on CASSANDRA-15631:


Great also fwiw the tests are also run on each commit at 
https://app.circleci.com/pipelines/github/newkek/cassandra?branch=15631-trunk

> Add AssertJ test dependency
> ---
>
> Key: CASSANDRA-15631
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15631
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Test/dtest, Test/unit
>Reporter: Kevin Gallardo
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-beta
>
>
> See 
> [proposal|https://lists.apache.org/thread.html/rc562ec47578d0ae6f346ba9e3d7469c1cd3f8b521a72ddcb2accc47b%40%3Cdev.cassandra.apache.org%3E].
> The goal is to add [AssertJ|https://assertj.github.io/doc/] to the test 
> framework to allow for more comprehensible and easier to write tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-14520) ClosedChannelException handled as FSError

2020-04-01 Thread Kevin Gallardo (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Gallardo updated CASSANDRA-14520:
---
Resolution: Cannot Reproduce
Status: Resolved  (was: Open)

> ClosedChannelException handled as FSError
> -
>
> Key: CASSANDRA-14520
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14520
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Streaming and Messaging
>Reporter: Blake Eggleston
>Assignee: Kevin Gallardo
>Priority: Urgent
> Fix For: 4.0
>
>
> After the messaging service netty refactor, I’ve seen a few instances where a 
> closed socket causes a ClosedChannelException (an IOException subclass) to be 
> thrown. The exception is caught by ChannelProxy, interpreted as a disk error, 
> and is then re-thrown as an FSError, causing the node to be shutdown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Issue Comment Deleted] (CASSANDRA-14520) ClosedChannelException handled as FSError

2020-04-01 Thread Kevin Gallardo (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Gallardo updated CASSANDRA-14520:
---
Comment: was deleted

(was: Discussed this with [~blerer] in details and it seems that the second 
{{ClosedByInterruptException}} appearing in the stacktrace was caused by the 
first exception. It is hard to track exactly how that was happening because of 
the large refactoring around streaming and other parts of the code since the 
ticket was opened.
It would be safer to keep the current behavior in the {{ChannelProxy}} which is 
to throw an Error in case of an IOException.
Will close the issue, if this issue happens again with the current state of the 
code we can have a better look and figure out if the 
{{ClosedByInterruptException}} is still problematic.)

> ClosedChannelException handled as FSError
> -
>
> Key: CASSANDRA-14520
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14520
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Streaming and Messaging
>Reporter: Blake Eggleston
>Assignee: Kevin Gallardo
>Priority: Urgent
> Fix For: 4.0
>
>
> After the messaging service netty refactor, I’ve seen a few instances where a 
> closed socket causes a ClosedChannelException (an IOException subclass) to be 
> thrown. The exception is caught by ChannelProxy, interpreted as a disk error, 
> and is then re-thrown as an FSError, causing the node to be shutdown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14520) ClosedChannelException handled as FSError

2020-04-01 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072904#comment-17072904
 ] 

Kevin Gallardo commented on CASSANDRA-14520:


Discussed this with [~blerer] in details and it seems that the second 
{{ClosedByInterruptException}} appearing in the stacktrace was caused by the 
first exception. It is hard to track exactly how that was happening because of 
the large refactoring around streaming and other parts of the code since the 
ticket was opened.
It would be safer to keep the current behavior in the {{ChannelProxy}} which is 
to throw an Error in case of an IOException.
Will close the issue, if this issue happens again with the current state of the 
code we can have a better look and figure out if the 
{{ClosedByInterruptException}} is still problematic.

> ClosedChannelException handled as FSError
> -
>
> Key: CASSANDRA-14520
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14520
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Streaming and Messaging
>Reporter: Blake Eggleston
>Assignee: Kevin Gallardo
>Priority: Urgent
> Fix For: 4.0
>
>
> After the messaging service netty refactor, I’ve seen a few instances where a 
> closed socket causes a ClosedChannelException (an IOException subclass) to be 
> thrown. The exception is caught by ChannelProxy, interpreted as a disk error, 
> and is then re-thrown as an FSError, causing the node to be shutdown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15631) Add AssertJ test dependency

2020-04-01 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072859#comment-17072859
 ] 

Kevin Gallardo commented on CASSANDRA-15631:


Yeah I have been trying to figure out what was wrong here with this dependency. 

It seems a transitive dependency is trying to be pulled from maven via http and 
not https, and consequently it can’t because of new norms maven repo doesn’t 
support HTTP anymore.

I.e. I have added the assertj-core dependency, and its parent, 
assertj-parent-pom is trying to pull a dependency, but ant is trying to pull it 
via http.

It seems to be related to what’s described in here: 
https://clutcher.github.io/post/hybris/maven_central_repository_http_issue/

{noformat}
Starting from January 15, 2020 maven central repository is not supporting 
access via http://repo1.maven.org/maven2 due to security reasons. Unfortunately 
under the hood hybris uses ant-maven-task, which is not maintained since 2011 
and has hardcoded link to http version of maven central.
{noformat}

Without the explicit dependency we get an error like:

{noformat}
/Users/kevingallardo/Documents/workspace/cassandra-trunk/build.xml:905: Unable 
to resolve artifact: Unable to get dependency information: Unable to read the 
metadata file for artifact 'org.assertj:assertj-core:jar': POM 
'org.junit:junit-bom' not found in repository: Unable to download the artifact 
from any repository
  org.junit:junit-bom:pom:5.6.0
from the specified remote repositories:
  central (http://repo1.maven.org/maven2)
 for project org.junit:junit-bom
  org.assertj:assertj-core:jar:3.15.0
from the specified remote repositories:
  apache (https://repo.maven.apache.org/maven2),
  central (https://repo1.maven.org/maven2)
Path to dependency:
1) org.apache.cassandra:cassandra-test-deps:jar:4.0-alpha4-SNAPSHOT
{noformat}

> Add AssertJ test dependency
> ---
>
> Key: CASSANDRA-15631
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15631
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Test/dtest, Test/unit
>Reporter: Kevin Gallardo
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-beta
>
>
> See 
> [proposal|https://lists.apache.org/thread.html/rc562ec47578d0ae6f346ba9e3d7469c1cd3f8b521a72ddcb2accc47b%40%3Cdev.cassandra.apache.org%3E].
> The goal is to add [AssertJ|https://assertj.github.io/doc/] to the test 
> framework to allow for more comprehensible and easier to write tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15631) Add AssertJ test dependency

2020-03-31 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071854#comment-17071854
 ] 

Kevin Gallardo commented on CASSANDRA-15631:


[~dcapwell] I have added a new commit on the same branch that fixes the issues 
you mentioned. Thanks

> Add AssertJ test dependency
> ---
>
> Key: CASSANDRA-15631
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15631
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Test/dtest, Test/unit
>Reporter: Kevin Gallardo
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-beta
>
>
> See 
> [proposal|https://lists.apache.org/thread.html/rc562ec47578d0ae6f346ba9e3d7469c1cd3f8b521a72ddcb2accc47b%40%3Cdev.cassandra.apache.org%3E].
> The goal is to add [AssertJ|https://assertj.github.io/doc/] to the test 
> framework to allow for more comprehensible and easier to write tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-15676) flaky test testWriteUnknownResult- org.apache.cassandra.distributed.test.CasWriteTest

2020-03-30 Thread Kevin Gallardo (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Gallardo updated CASSANDRA-15676:
---
Fix Version/s: 4.0-beta

> flaky test testWriteUnknownResult- 
> org.apache.cassandra.distributed.test.CasWriteTest
> -
>
> Key: CASSANDRA-15676
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15676
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/dtest
>Reporter: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Failure observed in: 
> https://app.circleci.com/pipelines/github/newkek/cassandra/33/workflows/54007cf7-4424-4ec1-9655-665f6044e6d1/jobs/187/tests
> {noformat}
> testWriteUnknownResult - org.apache.cassandra.distributed.test.CasWriteTest
> junit.framework.AssertionFailedError: Expecting cause to be 
> CasWriteUncertainException
>   at 
> org.apache.cassandra.distributed.test.CasWriteTest.testWriteUnknownResult(CasWriteTest.java:257)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Created] (CASSANDRA-15676) flaky test testWriteUnknownResult- org.apache.cassandra.distributed.test.CasWriteTest

2020-03-30 Thread Kevin Gallardo (Jira)

Kevin Gallardo created CASSANDRA-15676:
--

 Summary: flaky test testWriteUnknownResult- 
org.apache.cassandra.distributed.test.CasWriteTest
 Key: CASSANDRA-15676
 URL: https://issues.apache.org/jira/browse/CASSANDRA-15676
 Project: Cassandra
  Issue Type: Task
  Components: Test/dtest
Reporter: Kevin Gallardo


Failure observed in: 
https://app.circleci.com/pipelines/github/newkek/cassandra/33/workflows/54007cf7-4424-4ec1-9655-665f6044e6d1/jobs/187/tests

{noformat}
testWriteUnknownResult - org.apache.cassandra.distributed.test.CasWriteTest
junit.framework.AssertionFailedError: Expecting cause to be 
CasWriteUncertainException
at 
org.apache.cassandra.distributed.test.CasWriteTest.testWriteUnknownResult(CasWriteTest.java:257)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15631) Add AssertJ test dependency

2020-03-26 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067816#comment-17067816
 ] 

Kevin Gallardo commented on CASSANDRA-15631:


Hey [~dcapwell] thank you very much for looking into it. For some reason I did 
not get a notification of your comment and have done some more work on the 
branch since then. Notably: I put the AssertJ dependency to 3.15.

bq. at line is there a reason you added the round? It slightly changes the 
original test behavior; I don't mind, but it is a semantic difference.

Right, I had rounded because {{measureDeep()}} returns a long and the signature 
of {{isLessThan()}} doesn't allow for a double, but instead I should change the 
compared value to a double so I don't have to round.

bq. at line you dropped this assert

Not sure which assert you were referring to since the code has changed since 
your comment, but tracing back to the line number you linked on trunk points to 
the assert on {{session.getSyncingTasks().size()}}. In my branch it was changed 
to {{assertThat(session.getSyncingTasks()).isEmpty()}}, so the functionality 
should be equivalent. Please let me know if you were referring to another 
assert.

bq. at line should we change the assert message to be more clear? This will 
only print out the numbers, may be nice to comment that they are object sizes.

Makes sense, will change that.

For CI, you can see the runs at 
https://app.circleci.com/pipelines/github/newkek/cassandra?branch=15631-trunk . 
Latest run (#17) had a failure that didn't seem related to these changes, and 
the test passed when I ran it locally.

> Add AssertJ test dependency
> ---
>
> Key: CASSANDRA-15631
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15631
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Test/dtest, Test/unit
>Reporter: Kevin Gallardo
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-beta
>
>
> See 
> [proposal|https://lists.apache.org/thread.html/rc562ec47578d0ae6f346ba9e3d7469c1cd3f8b521a72ddcb2accc47b%40%3Cdev.cassandra.apache.org%3E].
> The goal is to add [AssertJ|https://assertj.github.io/doc/] to the test 
> framework to allow for more comprehensible and easier to write tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement

2020-03-26 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067780#comment-17067780
 ] 

Kevin Gallardo commented on CASSANDRA-15543:


Thanks [~blerer]!

> flaky test 
> org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
> ---
>
> Key: CASSANDRA-15543
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15543
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: David Capwell
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-alpha
>
>
> This fails infrequently, last seen failure was on java 8
> {code}
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15631) Add AssertJ test dependency

2020-03-24 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066164#comment-17066164
 ] 

Kevin Gallardo commented on CASSANDRA-15631:


Changes proposed in https://github.com/newkek/cassandra/tree/15631-trunk.

As proposed on the ML, I added the test dependency and also went ahead and 
re-wrote one randomly chosen test to show some examples of what a full test 
written with AssertJ assertions would look like. Also included a custom 
assertions to show it helps with understanding more clearly intent of tests.

> Add AssertJ test dependency
> ---
>
> Key: CASSANDRA-15631
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15631
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Test/dtest, Test/unit
>Reporter: Kevin Gallardo
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-beta
>
>
> See 
> [proposal|https://lists.apache.org/thread.html/rc562ec47578d0ae6f346ba9e3d7469c1cd3f8b521a72ddcb2accc47b%40%3Cdev.cassandra.apache.org%3E].
> The goal is to add [AssertJ|https://assertj.github.io/doc/] to the test 
> framework to allow for more comprehensible and easier to write tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement

2020-03-24 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066073#comment-17066073
 ] 

Kevin Gallardo edited comment on CASSANDRA-15543 at 3/24/20, 7:34 PM:
--

After review from [~blerer] I have made some changes and the latest CI run is 
[here|https://app.circleci.com/pipelines/github/newkek/cassandra?branch=15543-trunk]
 as #11. All the unit and JVM dtests ran successfully.


was (Author: newkek):
After review from [~blerer] I have made some changes and the latest CI run is 
[here|https://app.circleci.com/pipelines/github/newkek/cassandra?branch=15543-trunk]
 as #11. All the unit and JVM dtests run successfully.

> flaky test 
> org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
> ---
>
> Key: CASSANDRA-15543
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15543
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: David Capwell
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-alpha
>
>
> This fails infrequently, last seen failure was on java 8
> {code}
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement

2020-03-24 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066110#comment-17066110
 ] 

Kevin Gallardo commented on CASSANDRA-15543:


Just pushed a small commit that removes some unused imports I forgot to clean 
up: 
[https://github.com/newkek/cassandra/commit/55e94d94168ec62b01bdb764e1ea74d3d217ec41]
 

 

Feel free to squash

> flaky test 
> org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
> ---
>
> Key: CASSANDRA-15543
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15543
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: David Capwell
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-alpha
>
>
> This fails infrequently, last seen failure was on java 8
> {code}
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement

2020-03-24 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066073#comment-17066073
 ] 

Kevin Gallardo commented on CASSANDRA-15543:


After review from [~blerer] I have made some changes and the latest CI run is 
[here|https://app.circleci.com/pipelines/github/newkek/cassandra?branch=15543-trunk]
 as #11. All the unit and JVM dtests run successfully.

> flaky test 
> org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
> ---
>
> Key: CASSANDRA-15543
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15543
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: David Capwell
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-alpha
>
>
> This fails infrequently, last seen failure was on java 8
> {code}
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14520) ClosedChannelException handled as FSError

2020-03-20 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063456#comment-17063456
 ] 

Kevin Gallardo commented on CASSANDRA-14520:


This exception is caused when a read of a local SSTable gets interrupted by 
another thread, or just interrupted. 
When this interruption happens, the error is caught by the {{ChannelProxy}}, 
and thrown as a {{FsReadError}}. 

My understanding then is that the {{FsReadError}} is not caught, but handled at 
a higher level in the Cassandra daemon by the {{DefaultFsErrorHandler}}. 

The {{DefaultFsErrorHandler}} has multiple ways to treating a FsError, decided 
by how the {{disk_failure_policy}} is configured, if configured to {{stop}}, it 
will indeed stop the daemon/node. It seems that by default the 
{{disk_failure_policy}} is {{ignore}} though, which would mean it doesn't stop 
the node.

I am not 100% sure that since the CASSANDRA-15066 changes, long reads cannot be 
interrupted by another thread, i.e. it seems this exception could still be 
thrown at any time.

In any case, it seem the case of a long read of a local SSTable being 
interrupted shouldn't trigger the node to shutdown, ever? In which case I would 
suggest to catch the {{ClosedByInterruptException}} in {{ChannelProxy}} and 
rethrow it as a runtime exception, which should be caught in the 
{{NettyStreamingMessageSender.FileStreamTask#run}} and have a more graceful 
exception handling.

Does that make sense?

[~bdeggleston] [~benedict]

> ClosedChannelException handled as FSError
> -
>
> Key: CASSANDRA-14520
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14520
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Streaming and Messaging
>Reporter: Blake Eggleston
>Assignee: Kevin Gallardo
>Priority: Urgent
> Fix For: 4.0
>
>
> After the messaging service netty refactor, I’ve seen a few instances where a 
> closed socket causes a ClosedChannelException (an IOException subclass) to be 
> thrown. The exception is caught by ChannelProxy, interpreted as a disk error, 
> and is then re-thrown as an FSError, causing the node to be shutdown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement

2020-03-18 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061801#comment-17061801
 ] 

Kevin Gallardo commented on CASSANDRA-15543:


I have launched builds of this branch, you can see the results [on 
CircleCI|https://app.circleci.com/pipelines/github/newkek/cassandra?branch=15543-trunk]

> flaky test 
> org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
> ---
>
> Key: CASSANDRA-15543
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15543
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: David Capwell
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-alpha
>
>
> This fails infrequently, last seen failure was on java 8
> {code}
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement

2020-03-18 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061740#comment-17061740
 ] 

Kevin Gallardo commented on CASSANDRA-15543:


Great, thanks Benedict and Benjamin.

> flaky test 
> org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
> ---
>
> Key: CASSANDRA-15543
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15543
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: David Capwell
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-alpha
>
>
> This fails infrequently, last seen failure was on java 8
> {code}
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement

2020-03-17 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061159#comment-17061159
 ] 

Kevin Gallardo edited comment on CASSANDRA-15543 at 3/17/20, 8:07 PM:
--

Hi [~benedict], I've seen you have done quite a bit of work in 
{{ReadCallback}}, I was wondering if you had some cycles to take a look at the 
patch above (and optionally the discussion going on in this ticket)? Thanks


was (Author: newkek):
Hi [~benedict], I've seen you have done quite a bit of work in 
{{ReadCallback}}, I was wondering if you had some cycles to take a look at the 
patch above? Thanks

> flaky test 
> org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
> ---
>
> Key: CASSANDRA-15543
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15543
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: David Capwell
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-alpha
>
>
> This fails infrequently, last seen failure was on java 8
> {code}
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement

2020-03-17 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061159#comment-17061159
 ] 

Kevin Gallardo commented on CASSANDRA-15543:


Hi [~benedict], I've seen you have done quite a bit of work in 
{{ReadCallback}}, I was wondering if you had some cycles to take a look at the 
patch above? Thanks

> flaky test 
> org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
> ---
>
> Key: CASSANDRA-15543
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15543
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: David Capwell
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-alpha
>
>
> This fails infrequently, last seen failure was on java 8
> {code}
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15642) Inconsistent failure messages on distributed queries

2020-03-13 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059049#comment-17059049
 ] 

Kevin Gallardo commented on CASSANDRA-15642:


[~e.dimitrova] fyi

> Inconsistent failure messages on distributed queries
> 
>
> Key: CASSANDRA-15642
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15642
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Consistency/Coordination
>Reporter: Kevin Gallardo
>Priority: Normal
>
> As a follow up to some exploration I have done for CASSANDRA-15543, I 
> realized the following behavior in both {{ReadCallback}} and 
> {{AbstractWriteHandler}}:
>  - await for responses
>  - when all required number of responses have come back: unblock the wait
>  - when a single failure happens: unblock the wait
>  - when unblocked, look to see if the counter of failures is > 1 and if so 
> return an error message based on the {{failures}} map that's been filled
> Error messages that can result from this behavior can be a ReadTimeout, a 
> ReadFailure, a WriteTimeout or a WriteFailure.
> In case of a Write/ReadFailure, the user will get back an error looking like 
> the following:
> "Failure: Received X responses, and Y failures"
> (if this behavior I describe is incorrect, please correct me)
> This causes a usability problem. Since the handler will fail and throw an 
> exception as soon as 1 failure happens, the error message that is returned to 
> the user may not be accurate.
> (note: I am not entirely sure of the behavior in case of timeouts for now)
> At, say, CL = QUORUM = 3, the failed request may complete first, then a 
> successful one completes, and another fails. If the exception is thrown fast 
> enough, the error message could say 
>  "Failure: Received 0 response, and 1 failure at CL = 3"
> Which 1. doesn't make a lot of sense because the CL doesn't match the 
> previous information, but 2. the information is incorrect. We received a 
> successful response, only it came after the initial failure.
> From that logic, I think it is safe to assume that the information returned 
> in the error message cannot be trusted in case of a failure. We can only know 
> that at least 1 node has failed, or not if the response is successful.
> I am suggesting that for a big improvement in usability, the ReadCallback and 
> AbstractWriteResponseHandler wait for all responses to come back before 
> unblocking the wait, or let it timeout. This is way, the users will be able 
> to have some trust around the numbers returned to them. Also we would be able 
> to return more information this way.
> Right now, an error that happens first prevents from a timeout to happen 
> because it fails immediately, and so potentially it hides problems with other 
> replicas. If we were to wait for all responses, we might get a timeout, in 
> that case we'd also be able to tell wether failures have happened *before* 
> that timeout, and have a more complete view where you can't detect both 
> situations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-15642) Inconsistent failure messages on distributed queries

2020-03-13 Thread Kevin Gallardo (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Gallardo updated CASSANDRA-15642:
---
Description: 
As a follow up to some exploration I have done for CASSANDRA-15543, I realized 
the following behavior in both {{ReadCallback}} and {{AbstractWriteHandler}}:
 - await for responses
 - when all required number of responses have come back: unblock the wait
 - when a single failure happens: unblock the wait
 - when unblocked, look to see if the counter of failures is > 1 and if so 
return an error message based on the {{failures}} map that's been filled

Error messages that can result from this behavior can be a ReadTimeout, a 
ReadFailure, a WriteTimeout or a WriteFailure.

In case of a Write/ReadFailure, the user will get back an error looking like 
the following:

"Failure: Received X responses, and Y failures"

(if this behavior I describe is incorrect, please correct me)

This causes a usability problem. Since the handler will fail and throw an 
exception as soon as 1 failure happens, the error message that is returned to 
the user may not be accurate.

(note: I am not entirely sure of the behavior in case of timeouts for now)

At, say, CL = QUORUM = 3, the failed request may complete first, then a 
successful one completes, and another fails. If the exception is thrown fast 
enough, the error message could say 
 "Failure: Received 0 response, and 1 failure at CL = 3"

Which 1. doesn't make a lot of sense because the CL doesn't match the previous 
information, but 2. the information is incorrect. We received a successful 
response, only it came after the initial failure.

>From that logic, I think it is safe to assume that the information returned in 
>the error message cannot be trusted in case of a failure. We can only know 
>that at least 1 node has failed, or not if the response is successful.

I am suggesting that for a big improvement in usability, the ReadCallback and 
AbstractWriteResponseHandler wait for all responses to come back before 
unblocking the wait, or let it timeout. This is way, the users will be able to 
have some trust around the numbers returned to them. Also we would be able to 
return more information this way.

Right now, an error that happens first prevents from a timeout to happen 
because it fails immediately, and so potentially it hides problems with other 
replicas. If we were to wait for all responses, we might get a timeout, in that 
case we'd also be able to tell wether failures have happened *before* that 
timeout, and have a more complete view where you can't detect both situations.

  was:
As a follow up to some exploration I have done for CASSANDRA-15543, I realized 
the following behavior in both {{ReadCallback}} and {{AbstractWriteHandler}}:
 - await for responses
 - when all required number of responses have come back: unblock the wait
 - when a single failure happens: unblock the wait
 - when unblocked, look to see if the counter of failures is > 1 and if so 
return an error message based on the {{failures}} map that's been filled

Error messages that can result from this behavior can be a ReadTimeout, a 
ReadFailure, a WriteTimeout or a WriteFailure.

In case of a Write/ReadFailure, the user will get back an error looking like 
the following:

"Failure: Received X responses, and Y failures"

(if this behavior I describe is incorrect, please correct me)

This causes a usability problem. Since the handler will fail and throw an 
exception as soon as 1 failure happens, the error message that is returned to 
the user is not accurate.

(note: I am not entirely sure of the behavior around timeouts for now)

At, say, CL = QUORUM = 3, the failed request may complete first, then a 
successful one completes, and another fails. If the exception is thrown fast 
enough, the error message could say 
 "Failure: Received 0 response, and 1 failure at CL = 3"

Which 1. doesn't make a lot of sense because the CL doesn't match the previous 
information, but 2. the information is incorrect. We received a successful 
response, only it came after the initial failure.

>From that logic, I think it is safe to assume that the information returned in 
>the error message cannot be trusted in case of a failure. We can only know 
>that at least 1 node has failed, or not if the response is successful.

I am suggesting that for a big improvement in usability, the ReadCallback and 
AbstractWriteResponseHandler wait for all responses to come back before 
unblocking the wait, or let it timeout. This is way, the users will be able to 
have some trust around the numbers returned to them. Also we would be able to 
return more information this way.

Right now, an error that happens first prevents from a timeout to happen 
because it fails immediately, and so potentially it hides problems with other 
replicas. If we were to wait for all responses, we might get a timeout,

[jira] [Created] (CASSANDRA-15642) Inconsistent failure messages on distributed queries

2020-03-13 Thread Kevin Gallardo (Jira)

Kevin Gallardo created CASSANDRA-15642:
--

 Summary: Inconsistent failure messages on distributed queries
 Key: CASSANDRA-15642
 URL: https://issues.apache.org/jira/browse/CASSANDRA-15642
 Project: Cassandra
  Issue Type: Improvement
  Components: Consistency/Coordination
Reporter: Kevin Gallardo


As a follow up to some exploration I have done for CASSANDRA-15543, I realized 
the following behavior in both {{ReadCallback}} and {{AbstractWriteHandler}}:
 - await for responses
 - when all required number of responses have come back: unblock the wait
 - when a single failure happens: unblock the wait
 - when unblocked, look to see if the counter of failures is > 1 and if so 
return an error message based on the {{failures}} map that's been filled

Error messages that can result from this behavior can be a ReadTimeout, a 
ReadFailure, a WriteTimeout or a WriteFailure.

In case of a Write/ReadFailure, the user will get back an error looking like 
the following:

"Failure: Received X responses, and Y failures"

(if this behavior I describe is incorrect, please correct me)

This causes a usability problem. Since the handler will fail and throw an 
exception as soon as 1 failure happens, the error message that is returned to 
the user is not accurate.

(note: I am not entirely sure of the behavior around timeouts for now)

At, say, CL = QUORUM = 3, the failed request may complete first, then a 
successful one completes, and another fails. If the exception is thrown fast 
enough, the error message could say 
 "Failure: Received 0 response, and 1 failure at CL = 3"

Which 1. doesn't make a lot of sense because the CL doesn't match the previous 
information, but 2. the information is incorrect. We received a successful 
response, only it came after the initial failure.

>From that logic, I think it is safe to assume that the information returned in 
>the error message cannot be trusted in case of a failure. We can only know 
>that at least 1 node has failed, or not if the response is successful.

I am suggesting that for a big improvement in usability, the ReadCallback and 
AbstractWriteResponseHandler wait for all responses to come back before 
unblocking the wait, or let it timeout. This is way, the users will be able to 
have some trust around the numbers returned to them. Also we would be able to 
return more information this way.

Right now, an error that happens first prevents from a timeout to happen 
because it fails immediately, and so potentially it hides problems with other 
replicas. If we were to wait for all responses, we might get a timeout, in that 
case we'd also be able to tell wether failures have happened *before* that 
timeout, and have a more complete view where you can't detect both situations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement

2020-03-13 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17058756#comment-17058756
 ] 

Kevin Gallardo commented on CASSANDRA-15543:


[~ifesdjeen] since you were the author of the test originally, would you have 
some to review the fixes above? Thank you

> flaky test 
> org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
> ---
>
> Key: CASSANDRA-15543
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15543
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: David Capwell
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-alpha
>
>
> This fails infrequently, last seen failure was on java 8
> {code}
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement

2020-03-12 Thread Kevin Gallardo (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Gallardo updated CASSANDRA-15543:
---
Test and Documentation Plan: No doc change needed
 Status: Patch Available  (was: In Progress)

> flaky test 
> org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
> ---
>
> Key: CASSANDRA-15543
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15543
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: David Capwell
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-alpha
>
>
> This fails infrequently, last seen failure was on java 8
> {code}
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement

2020-03-12 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17058206#comment-17058206
 ] 

Kevin Gallardo commented on CASSANDRA-15543:


Changes available at https://github.com/newkek/cassandra/tree/15543-trunk/

> flaky test 
> org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
> ---
>
> Key: CASSANDRA-15543
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15543
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: David Capwell
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-alpha
>
>
> This fails infrequently, last seen failure was on java 8
> {code}
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement

2020-03-11 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17057135#comment-17057135
 ] 

Kevin Gallardo commented on CASSANDRA-15543:


bq. Also, RequestFailureException is not used only here in the codebase...

Right, I was planning on making the immutable copy changes for calls to 
WriteFailureException too

> flaky test 
> org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
> ---
>
> Key: CASSANDRA-15543
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15543
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: David Capwell
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-alpha
>
>
> This fails infrequently, last seen failure was on java 8
> {code}
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement

2020-03-11 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17057091#comment-17057091
 ] 

Kevin Gallardo edited comment on CASSANDRA-15543 at 3/11/20, 3:23 PM:
--

Thanks for the reply, Ekaterina, and for confirming the behavior described.

Got some more time to think about this, and it seems the best way to approach 
it, since what I suggest would be a larger behavioral change, would be to 
create a separate ticket for it. 

I think in this ticket I would rather: Fix the discrepancies between the 
failures number and the failure messages with the immutable map copy + fix the 
test that expects 2 failures to only expect 1 to match current behavior. 

Then, in the other ticket we can properly advocate/discuss pros and cons of 
returning cohesive error messages VS fail-fast & incohesive error message.


was (Author: newkek):
Thanks for the reply, Ekaterina. 

Got some more time to think about this, and I think the best way to approach 
it, since what I suggest would be a larger behavioral change, would be to 
create a separate ticket for it.

I think in this ticket I would rather: Fix the discrepancies between the 
failures number and the failure messages with the immutable map copy + fix the 
test that expects 2 failures to only expect 1 to match current behavior. 

Then, in the other ticket we can discuss pros and cons of returning cohesive 
error messages VS fail-fast & incohesive error message.

> flaky test 
> org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
> ---
>
> Key: CASSANDRA-15543
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15543
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: David Capwell
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-alpha
>
>
> This fails infrequently, last seen failure was on java 8
> {code}
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement

2020-03-11 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17057091#comment-17057091
 ] 

Kevin Gallardo edited comment on CASSANDRA-15543 at 3/11/20, 3:23 PM:
--

Thanks for the reply, Ekaterina, and for confirming the behavior described.

Got some more time to think about this, and it seems the best way to approach 
it, since what I suggest would be a larger behavioral change, would be to 
create a separate ticket for it. 

I think in this ticket I would rather: Fix the discrepancies between the 
failures number and the failure messages with the immutable map copy + fix the 
test that expects 2 failures to only expect 1 to match current behavior. 

Then, in the other ticket we can properly advocate/discuss pros and cons of 
returning cohesive error messages VS fail-fast & incohesive error message.

Does that make sense?


was (Author: newkek):
Thanks for the reply, Ekaterina, and for confirming the behavior described.

Got some more time to think about this, and it seems the best way to approach 
it, since what I suggest would be a larger behavioral change, would be to 
create a separate ticket for it. 

I think in this ticket I would rather: Fix the discrepancies between the 
failures number and the failure messages with the immutable map copy + fix the 
test that expects 2 failures to only expect 1 to match current behavior. 

Then, in the other ticket we can properly advocate/discuss pros and cons of 
returning cohesive error messages VS fail-fast & incohesive error message.

> flaky test 
> org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
> ---
>
> Key: CASSANDRA-15543
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15543
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: David Capwell
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-alpha
>
>
> This fails infrequently, last seen failure was on java 8
> {code}
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement

2020-03-11 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17057091#comment-17057091
 ] 

Kevin Gallardo commented on CASSANDRA-15543:


Thanks for the reply, Ekaterina. 

Got some more time to think about this, and I think the best way to approach 
it, since what I suggest would be a larger behavioral change, would be to 
create a separate ticket for it.

I think in this ticket I would rather: Fix the discrepancies between the 
failures number and the failure messages with the immutable map copy + fix the 
test that expects 2 failures to only expect 1 to match current behavior. 

Then, in the other ticket we can discuss pros and cons of returning cohesive 
error messages VS fail-fast & incohesive error message.

> flaky test 
> org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
> ---
>
> Key: CASSANDRA-15543
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15543
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: David Capwell
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-alpha
>
>
> This fails infrequently, last seen failure was on java 8
> {code}
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Created] (CASSANDRA-15631) Add AssertJ test dependency

2020-03-11 Thread Kevin Gallardo (Jira)

Kevin Gallardo created CASSANDRA-15631:
--

 Summary: Add AssertJ test dependency
 Key: CASSANDRA-15631
 URL: https://issues.apache.org/jira/browse/CASSANDRA-15631
 Project: Cassandra
  Issue Type: Improvement
  Components: Test/dtest, Test/unit
Reporter: Kevin Gallardo
Assignee: Kevin Gallardo


See 
[proposal|https://lists.apache.org/thread.html/rc562ec47578d0ae6f346ba9e3d7469c1cd3f8b521a72ddcb2accc47b%40%3Cdev.cassandra.apache.org%3E].

The goal is to add [AssertJ|https://assertj.github.io/doc/] to the test 
framework to allow for more comprehensible and easier to write tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Assigned] (CASSANDRA-14520) ClosedChannelException handled as FSError

2020-03-10 Thread Kevin Gallardo (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Gallardo reassigned CASSANDRA-14520:
--

Assignee: Kevin Gallardo

> ClosedChannelException handled as FSError
> -
>
> Key: CASSANDRA-14520
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14520
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Streaming and Messaging
>Reporter: Blake Eggleston
>Assignee: Kevin Gallardo
>Priority: Urgent
> Fix For: 4.0
>
>
> After the messaging service netty refactor, I’ve seen a few instances where a 
> closed socket causes a ClosedChannelException (an IOException subclass) to be 
> thrown. The exception is caught by ChannelProxy, interpreted as a disk error, 
> and is then re-thrown as an FSError, causing the node to be shutdown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14520) ClosedChannelException handled as FSError

2020-03-10 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056029#comment-17056029
 ] 

Kevin Gallardo commented on CASSANDRA-14520:


Will start investigating this

> ClosedChannelException handled as FSError
> -
>
> Key: CASSANDRA-14520
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14520
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Streaming and Messaging
>Reporter: Blake Eggleston
>Assignee: Kevin Gallardo
>Priority: Urgent
> Fix For: 4.0
>
>
> After the messaging service netty refactor, I’ve seen a few instances where a 
> closed socket causes a ClosedChannelException (an IOException subclass) to be 
> thrown. The exception is caught by ChannelProxy, interpreted as a disk error, 
> and is then re-thrown as an FSError, causing the node to be shutdown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement

2020-03-09 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17054990#comment-17054990
 ] 

Kevin Gallardo edited comment on CASSANDRA-15543 at 3/9/20, 2:08 PM:
-

Sounds good, thanks, hope you had a good weekend :)

As a summary, 

In any case, I believe passing an immutable copy of the 
{{failureReasonByEndpoint}} map to the constructor of 
Read/WriteFailureException would reduce the chances for the {{number of 
failures}} and the failure messages to be inconsistent.

In addition to that, there's the remaining question of the behavior of 
ReadCallback when failures happen (do we fail fast? or do we wait for all 
responses to come back/timeout?). Depending on the outcome of that, the test 
that is flaky at the moment would need to be adjusted to expect 1 *or* 2 
failures in the response.


was (Author: newkek):
Sounds good.

As a summary, 

In any case, I believe passing an immutable copy of the 
{{failureReasonByEndpoint}} map to the constructor of 
Read/WriteFailureException would reduce the chances for the {{number of 
failures}} and the failure messages to be inconsistent.

In addition to that, there's the remaining question of the behavior of 
ReadCallback when failures happen (do we fail fast? or do we wait for all 
responses to come back/timeout?). Depending on the outcome of that, the test 
that is flaky at the moment would need to be adjusted to expect 1 *or* 2 
failures in the response.

> flaky test 
> org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
> ---
>
> Key: CASSANDRA-15543
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15543
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: David Capwell
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-alpha
>
>
> This fails infrequently, last seen failure was on java 8
> {code}
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement

2020-03-09 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17054990#comment-17054990
 ] 

Kevin Gallardo edited comment on CASSANDRA-15543 at 3/9/20, 2:07 PM:
-

Sounds good.

As a summary, 

In any case, I believe passing an immutable copy of the 
{{failureReasonByEndpoint}} map to the constructor of 
Read/WriteFailureException would reduce the chances for the {{number of 
failures}} and the failure messages to be inconsistent.

In addition to that, there's the remaining question of the behavior of 
ReadCallback when failures happen (do we fail fast? or do we wait for all 
responses to come back/timeout?). Depending on the outcome of that, the test 
that is flaky at the moment would need to be adjusted to expect 1 *or* 2 
failures in the response.


was (Author: newkek):
Sounds good.

As a summary, 

In any case, I believe passing an immutable copy of the 
{{failureReasonByEndpoint}} map to the constructor of 
Read/WriteFailureException would reduce the chances for the {{number of 
failures}} and the failure messages to be inconsistent.

In addition to that, there's the remaining question of the behavior of 
ReadCallback when failures happen (do we fail fast? or do we wait for all 
responses to come back/timeout?). Depending on the outcome of that the test 
that is flaky at the moment would need to be adjusted to expect 1 *or* 2 
failures in the response.

> flaky test 
> org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
> ---
>
> Key: CASSANDRA-15543
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15543
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: David Capwell
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-alpha
>
>
> This fails infrequently, last seen failure was on java 8
> {code}
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement

2020-03-09 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17054990#comment-17054990
 ] 

Kevin Gallardo commented on CASSANDRA-15543:


Sounds good.

As a summary, 

In any case, I believe passing an immutable copy of the 
{{failureReasonByEndpoint}} map to the constructor of 
Read/WriteFailureException would reduce the chances for the {{number of 
failures}} and the failure messages to be inconsistent.

In addition to that, there's the remaining question of the behavior of 
ReadCallback when failures happen (do we fail fast? or do we wait for all 
responses to come back/timeout?). Depending on the outcome of that the test 
that is flaky at the moment would need to be adjusted to expect 1 *or* 2 
failures in the response.

> flaky test 
> org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
> ---
>
> Key: CASSANDRA-15543
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15543
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: David Capwell
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-alpha
>
>
> This fails infrequently, last seen failure was on java 8
> {code}
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement

2020-03-06 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053733#comment-17053733
 ] 

Kevin Gallardo commented on CASSANDRA-15543:


{{RequestFailureException}} uses the map's size to find out the number of 
entries in it:

{code}
protected RequestFailureException(ExceptionCode code, ConsistencyLevel 
consistency, int received, int blockFor, Map failureReasonByEndpoint)
{
super(code, String.format("Operation failed - received %d responses and 
%d failures: %s",
  received,
  failureReasonByEndpoint.size(),
  buildFailureString(failureReasonByEndpoint)));
{code}

It does a copy of the map later in the constructor but I suppose there is still 
a chance the content of the map may be different between the call to {{size()}} 
and the entries when iterated in {{buildFailureString()}}.

> flaky test 
> org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
> ---
>
> Key: CASSANDRA-15543
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15543
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: David Capwell
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-alpha
>
>
> This fails infrequently, last seen failure was on java 8
> {code}
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement

2020-03-06 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053647#comment-17053647
 ] 

Kevin Gallardo edited comment on CASSANDRA-15543 at 3/6/20, 7:45 PM:
-

Also, I suggested we wait for all responses because this is what the test 
expects, the test waits for the {{ReadFailureException}} to contain *both* 
failures without accounting for possible race conditions of the 
{{ReadCallback}} returning before all responses have come back.

EDIT: I confused the number of received and the number of failures, I think 
what is below is incorrect.

-Also noticed that in the current context, the number of {{failure}} and the 
list of failures can be inconsistent with each other, because we create the 
exception with-
{code:java}
new ReadFailureException(replicaPlan().consistencyLevel(), received, blockFor, 
resolver.isDataPresent(), failureReasonByEndpoint){code}
-Concurrently, {{received}} can have been incremented but not the map of 
{{failureReasonByEndpoint}} and we could end up with:-
{code:java}
Operation failed - received 1 responses and 2 failures: INCOMPATIBLE_SCHEMA 
from 127.0.0.3:7012{code}
-"2" failures but only 1 error message. (number of {{received}} is irrelevant)-

-Conversely, since the reference to the map is passed to 
{{ReadFailureException}} but {{received}} is passed by value, we can end up in 
the following scenario:-
 - -1 error triggers awaitResults() to unblock-
 - -call to {{new ReadFailureException(received, failureReasonByEndpoint)}}-
 - -value of 1 for {{received}} copied when calling {{ReadFailureException}} 
constructor but only reference to {{failureReasonByEndpoint}} given-
 - -concurrently second failure calls to {{onFailure()}} which updates 
{{failures}} and the map-
 - -{{ReadFailureException}} has the {{1}} value for {{failures}} but the map 
has been updated to have 2 failures and constructs the error message-

-We end up with the error message:-
{code:java}
Operation failed - received 1 responses and 1 failures: INCOMPATIBLE_SCHEMA 
from 127.0.0.3:7012, INCOMPATIBLE_SCHEMA from 127.0.0.2:7012{code}
-"1" failure but 2 error messages. (number of {{received}} is irrelevant)-

-One way to solve these issues would be to pass a copy of the 
{{failureReasonByEndpoint}} to the constructor of {{ReadFailureException}}.-

-If {{await()}} waited until all messages came back or timeout the 2 issues 
wouldn't be problematic I think.-


was (Author: newkek):
Also, I suggested we wait for all responses because this is what the test 
expects, the test waits for the {{ReadFailureException}} to contain *both* 
failures without accounting for possible race conditions of the 
{{ReadCallback}} returning before all responses have come back.

Also noticed that in the current context, the number of {{failure}} and the 
list of failures can be inconsistent with each other, because we create the 
exception with {code}new ReadFailureException(replicaPlan().consistencyLevel(), 
received, blockFor, resolver.isDataPresent(), failureReasonByEndpoint){code}. 
Concurrently, {{received}} can have been incremented but not the map of 
{{failureReasonByEndpoint}} and we could end up with:

{code}Operation failed - received 1 responses and 2 failures: 
INCOMPATIBLE_SCHEMA from 127.0.0.3:7012{code}

"2" failures but only 1 error message. (number of {{received}} is irrelevant)

Conversely, since the reference to the map is passed to 
{{ReadFailureException}} but {{received}} is passed by value, we can end up in 
the following scenario:

- 1 error triggers awaitResults() to unblock
- call to {{new ReadFailureException(received, failureReasonByEndpoint)}}
- value of 1 for {{received}} copied when calling {{ReadFailureException}} 
constructor but only reference to {{failureReasonByEndpoint}} given
- concurrently second failure calls to {{onFailure()}} which updates 
{{failures}} and the map
- {{ReadFailureException}} has the {{1}} value for {{failures}} but the map has 
been updated to have 2 failures and constructs the error message

We end up with the error message:
{code}Operation failed - received 1 responses and 1 failures: 
INCOMPATIBLE_SCHEMA from 127.0.0.3:7012, INCOMPATIBLE_SCHEMA from 
127.0.0.2:7012{code}

"1" failure but 2 error messages. (number of {{received}} is irrelevant)

One way to solve these issues would be to pass a copy of the 
{{failureReasonByEndpoint}} to the constructor of {{ReadFailureException}}.

If {{await()}} waited until all messages came back or timeout the 2 issues 
wouldn't be problematic I think.

> flaky test 
> org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
> ---
>
> Key: CASSANDRA-15543
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15543
> Project:

[jira] [Commented] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement

2020-03-06 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053724#comment-17053724
 ] 

Kevin Gallardo commented on CASSANDRA-15543:


Actually I just realized that the value passed in the {{ReadFailureException}} 
constructor is {{received}}, I confused and thought it was the number of 
failures. Need to re-think about that comment above

> flaky test 
> org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
> ---
>
> Key: CASSANDRA-15543
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15543
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: David Capwell
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-alpha
>
>
> This fails infrequently, last seen failure was on java 8
> {code}
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement

2020-03-06 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053647#comment-17053647
 ] 

Kevin Gallardo edited comment on CASSANDRA-15543 at 3/6/20, 7:04 PM:
-

Also, I suggested we wait for all responses because this is what the test 
expects, the test waits for the {{ReadFailureException}} to contain *both* 
failures without accounting for possible race conditions of the 
{{ReadCallback}} returning before all responses have come back.

Also noticed that in the current context, the number of {{failure}} and the 
list of failures can be inconsistent with each other, because we create the 
exception with {code}new ReadFailureException(replicaPlan().consistencyLevel(), 
received, blockFor, resolver.isDataPresent(), failureReasonByEndpoint){code}. 
Concurrently, {{received}} can have been incremented but not the map of 
{{failureReasonByEndpoint}} and we could end up with:

{code}Operation failed - received 1 responses and 2 failures: 
INCOMPATIBLE_SCHEMA from 127.0.0.3:7012{code}

"2" failures but only 1 error message. (number of {{received}} is irrelevant)

Conversely, since the reference to the map is passed to 
{{ReadFailureException}} but {{received}} is passed by value, we can end up in 
the following scenario:

- 1 error triggers awaitResults() to unblock
- call to {{new ReadFailureException(received, failureReasonByEndpoint)}}
- value of 1 for {{received}} copied when calling {{ReadFailureException}} 
constructor but only reference to {{failureReasonByEndpoint}} given
- concurrently second failure calls to {{onFailure()}} which updates 
{{failures}} and the map
- {{ReadFailureException}} has the {{1}} value for {{failures}} but the map has 
been updated to have 2 failures and constructs the error message

We end up with the error message:
{code}Operation failed - received 1 responses and 1 failures: 
INCOMPATIBLE_SCHEMA from 127.0.0.3:7012, INCOMPATIBLE_SCHEMA from 
127.0.0.2:7012{code}

"1" failure but 2 error messages. (number of {{received}} is irrelevant)

One way to solve these issues would be to pass a copy of the 
{{failureReasonByEndpoint}} to the constructor of {{ReadFailureException}}.

If {{await()}} waited until all messages came back or timeout the 2 issues 
wouldn't be problematic I think.


was (Author: newkek):
Also, I suggested we wait for all responses because this is what the test 
expects, the test waits for the {{ReadFailureException}} to contain *both* 
failures without accounting for possible race conditions of the 
{{ReadCallback}} returning before all responses have come back.

Also noticed that in the current context, the number of {{failure}} and the 
list of failures can be inconsistent with each other, because we create the 
exception with {code}new ReadFailureException(replicaPlan().consistencyLevel(), 
received, blockFor, resolver.isDataPresent(), failureReasonByEndpoint){code}. 
Concurrently, {{received}} can have been incremented but not the map of 
{{failureReasonByEndpoint}} and we could end up with:

{code}Operation failed - received 1 responses and 2 failures: 
INCOMPATIBLE_SCHEMA from 127.0.0.3:7012{code}

"2" failures but only 1 error message. (number of {{received}} is irrelevant)

Conversely, since the reference to the map is passed to 
{{ReadFailureException}} but {{received}} is passed by value, we can end up in 
the following scenario:

- 1 error triggers awaitResults() to unblock
- call to {{new ReadFailureException(received, failureReasonByEndpoint)}}
- value of 1 for {{received}} copied when calling {{ReadFailureException}} 
constructor but only reference to {{failureReasonByEndpoint}} given
- concurrently second failure calls to {{onFailure()}} which updates 
{{failures}} and the map
- {{ReadFailureException}} has the {{1}} value for {{failures}} but the map has 
been updated to have 2 failures and constructs the error message

We end up with the error message:
{code}Operation failed - received 1 responses and 1 failures: 
INCOMPATIBLE_SCHEMA from 127.0.0.3:7012, INCOMPATIBLE_SCHEMA from 
127.0.0.2:7012{code}

"1" failure but 2 error messages. (number of {{received}} is irrelevant)

One way to solve this second issue would be to pass a copy of the 
{{failureReasonByEndpoint}} to the constructor of {{ReadFailureException}}.

If {{await()}} waited until all messages came back or timeout the 2 issues 
wouldn't be problematic I think.

> flaky test 
> org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
> ---
>
> Key: CASSANDRA-15543
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15543
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: David Capwell
>Assignee: Kevin Gallardo
>

[jira] [Commented] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement

2020-03-06 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053647#comment-17053647
 ] 

Kevin Gallardo commented on CASSANDRA-15543:


Also, I suggested we wait for all responses because this is what the test 
expects, the test waits for the {{ReadFailureException}} to contain *both* 
failures without accounting for possible race conditions of the 
{{ReadCallback}} returning before all responses have come back.

Also noticed that in the current context, the number of {{failure}} and the 
list of failures can be inconsistent with each other, because we create the 
exception with {code}new ReadFailureException(replicaPlan().consistencyLevel(), 
received, blockFor, resolver.isDataPresent(), failureReasonByEndpoint){code}. 
Concurrently, {{received}} can have been incremented but not the map of 
{{failureReasonByEndpoint}} and we could end up with:

{code}Operation failed - received 1 responses and 2 failures: 
INCOMPATIBLE_SCHEMA from 127.0.0.3:7012{code}

"2" failures but only 1 error message. (number of {{received}} is irrelevant)

Conversely, since the reference to the map is passed to 
{{ReadFailureException}} but {{received}} is passed by value, we can end up in 
the following scenario:

- 1 error triggers awaitResults() to unblock
- call to {{new ReadFailureException(received, failureReasonByEndpoint)}}
- value of 1 for {{received}} copied when calling {{ReadFailureException}} 
constructor but only reference to {{failureReasonByEndpoint}} given
- concurrently second failure calls to {{onFailure()}} which updates 
{{failures}} and the map
- {{ReadFailureException}} has the {{1}} value for {{failures}} but the map has 
been updated to have 2 failures and constructs the error message

We end up with the error message:
{code}Operation failed - received 1 responses and 1 failures: 
INCOMPATIBLE_SCHEMA from 127.0.0.3:7012, INCOMPATIBLE_SCHEMA from 
127.0.0.2:7012{code}

"1" failure but 2 error messages. (number of {{received}} is irrelevant)

One way to solve this second issue would be to pass a copy of the 
{{failureReasonByEndpoint}} to the constructor of {{ReadFailureException}}.

If {{await()}} waited until all messages came back or timeout the 2 issues 
wouldn't be problematic I think.

> flaky test 
> org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
> ---
>
> Key: CASSANDRA-15543
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15543
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: David Capwell
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-alpha
>
>
> This fails infrequently, last seen failure was on java 8
> {code}
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement

2020-03-06 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053578#comment-17053578
 ] 

Kevin Gallardo edited comment on CASSANDRA-15543 at 3/6/20, 4:36 PM:
-

I have been looking into this issue and trying to reproduce it locally. Did not 
have much luck, re-executed the test about 2500 times (let it run overnight) 
and only managed to reproduce something ~inconsistent in one case.

The test setup is: execute a schema alteration request on 1 node, don't let the 
node propagate the schema changes to other nodes, then execute a select at 
Consistency = ALL on that same node and expect a failure, because the 2 other 
nodes did not get the schema alteration request.

So the message we expect from the Select request would be the following:
{code:java}
Operation failed - received 1 responses and 2 failures: INCOMPATIBLE_SCHEMA 
from 127.0.0.3:7012, INCOMPATIBLE_SCHEMA from 127.0.0.2:7012{code}
Node1's response is successful (because it has the schema changes), but all the 
others fail because they have a stale schema.

The closest I got to reproducing something inconsistent to this was:
{code:java}
Operation failed - received 0 responses and 2 failures: INCOMPATIBLE_SCHEMA 
from 127.0.0.3:7012, INCOMPATIBLE_SCHEMA from 127.0.0.2:7012{code}
But also let's consider the ones Brandon had observed.

I have looked into some of the places where {{ReadFailureException}} are 
generated, and landed in the 
[{{ReadCallback}}|https://github.com/apache/cassandra/blob/aed15bfb01e753f9bac8b149b382c7a7c8d33183/src/java/org/apache/cassandra/service/reads/ReadCallback.java#L162]
 class.

The logic of the code is:
- call {{awaitResults()}}
- in {{awaitResults()}}:
-- block on {{await()}}
-- when {{await()}} exits, check what responses we got and whether errors came 
back

Also looked into {{onFailure()}} method, the logic is the following:
{code:java}
int n = waitingFor(from)
  ? failuresUpdater.incrementAndGet(this)
  : failures;

failureReasonByEndpoint.put(from, failureReason);

if (blockFor + n > replicaPlan().contacts().size()) {
condition.signalAll(); // this signals as soon as 1 failure happens
}
{code}
{{waitingFor(from)}} is supposed to filter out errors coming from non local 
nodes it seems. In this test it always returns true, for simplicity we'll 
consider it is.
 Same for {{blockFor}}, and {{replicaPlan().contacts().size()}} these are both 
{{3}} for the test, I supposed these may differ in some use case.

As we can see above once 1 failure happens, the {{condition}} will be signaled 
to unblock waiting callers. When {{awaitResults()}} unblocks, it checks the 
value of {{failures}} to see how many errors happened and report it as a 
{{ReadFailureException}}.

The code is designed so that the {{condition}} is freed when either all 
responses are successful, or one failure happened.

Now my understanding of this is that, on certain circumstances, we end up 
returning an error message as soon as 1 error happens, and not wait for all 
responses to come back, and the values of {{failures}} and {{received}} are not 
guaranteed to be at a certain value, except we know in case of a failure that 
{{failures}} will be > 0.

I am unclear on what the expected behavior should be here. 
>From my POV, I would think the most reasonable logic would be that {{await()}} 
>only exits once all responses have come back, failed or successful, or it 
>times out. Therefore I would replace the {{"SimpleCondition condition"}} with 
>a {{CountDownLatch}} initialized to the value of the number of expected 
>responses, and {{countDown()}} on {{onReceived()}} and {{onFailure()}} and 
>have {{await()}} use that countDownLatch.

If the behavior of failing as soon as 1 failure happens is preferred, then the 
logic can stay that way, but the test needs to be modified to account for the 
possibility that only 1 error and No successful response may be returned in the 
{{ReadFailureException}}.

[~brandon.williams] [~e.dimitrova] got any insights?


was (Author: newkek):
I have been looking into this issue and trying to reproduce it locally. Did not 
have much luck, re-executed the test about 2500 times (let it run overnight) 
and only managed to reproduce something ~inconsistent in one case.

The test setup is: execute a schema alteration request on 1 node, don't let the 
node propagate the schema changes to other nodes, then execute a select at 
Consistency = ALL on that same node and expect a failure, because the 2 other 
nodes did not get the schema alteration request.

So the message we expect from the Select request would be the following:
{code:java}
Operation failed - received 1 responses and 2 failures: INCOMPATIBLE_SCHEMA 
from 127.0.0.3:7012, INCOMPATIBLE_SCHEMA from 127.0.0.2:7012{code}
Node1's response is successful (because it has the schema changes), but all the 
others fail because

[jira] [Commented] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement

2020-03-06 Thread Kevin Gallardo (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053578#comment-17053578
 ] 

Kevin Gallardo commented on CASSANDRA-15543:


I have been looking into this issue and trying to reproduce it locally. Did not 
have much luck, re-executed the test about 2500 times (let it run overnight) 
and only managed to reproduce something ~inconsistent in one case.

The test setup is: execute a schema alteration request on 1 node, don't let the 
node propagate the schema changes to other nodes, then execute a select at 
Consistency = ALL on that same node and expect a failure, because the 2 other 
nodes did not get the schema alteration request.

So the message we expect from the Select request would be the following:
{code:java}
Operation failed - received 1 responses and 2 failures: INCOMPATIBLE_SCHEMA 
from 127.0.0.3:7012, INCOMPATIBLE_SCHEMA from 127.0.0.2:7012{code}
Node1's response is successful (because it has the schema changes), but all the 
others fail because they have a stale schema.

The closest I got to reproducing something inconsistent to this was:
{code:java}
Operation failed - received 0 responses and 2 failures: INCOMPATIBLE_SCHEMA 
from 127.0.0.3:7012, INCOMPATIBLE_SCHEMA from 127.0.0.2:7012{code}
But also let's consider the ones Brandon had observed.

I have looked into some of the places where {{ReadFailureException}} are 
generated, and landed in the 
[{{ReadCallback}}|https://github.com/apache/cassandra/blob/aed15bfb01e753f9bac8b149b382c7a7c8d33183/src/java/org/apache/cassandra/service/reads/ReadCallback.java#L162]
 class.

The logic of the code is:
- call {{awaitResults()}}
- in {{awaitResults()}}:
-- block on {{await()}}
-- when {{await()}} exits, check what responses we got and whether errors came 
back

Also looked into {{onFailure()}} method, the logic is the following:
{code:java}
int n = waitingFor(from)
  ? failuresUpdater.incrementAndGet(this)
  : failures;

failureReasonByEndpoint.put(from, failureReason);

if (blockFor + n > replicaPlan().contacts().size()) {
condition.signalAll(); // this signals as soon as 1 failure happens
}
{code}
{{waitingFor(from)}} is supposed to filter out errors coming from non local 
nodes it seems. In this test it always returns true, for simplicity we'll 
consider it is.
 Same for {{blockFor}}, and {{replicaPlan().contacts().size()}} these are both 
{{3}} for the test, I supposed these may differ in some use case.

As we can see above once 1 failure happens, the {{condition}} will be signaled 
to unblock waiting callers. When {{awaitResults()}} unblocks, it checks the 
value of {{failures}} to see how many errors happened and report it as a 
{{ReadFailureException}}.

The code is designed so that the {{condition}} is freed when either all 
responses are successful, or one failure happened.

Now my understanding of this is that, on certain circumstances, we end up 
returning an error message as soon as 1 error happens, and not wait for all 
responses to come back, and the values of {{failures}} and {{received}} are not 
guaranteed to be at a certain value, except we know in case of a failure that 
{{failures}} will be > 0.

I am unclear on what the expected behavior should be here. 
>From my POV, I would think the most reasonable logic would be that {{await()}} 
>only exits once all responses have come back, failed or successful, or it 
>times out. Therefore I would replace the {{SimpleCondition condition}} with a 
>{{CountDownLatch}} initialized to the value of the number of expected 
>responses, and {{countDown()}} on {{onReceived}} and {{onFailure}} and have 
>{{await()}} use that countDownLatch.

If the behavior of failing as soon as 1 failure happens is preferred, then the 
logic can stay that way, but the test needs to be modified to account for the 
possibility that only 1 error and No successful response may be returned in the 
`ReadFailureException`.

[~brandon.williams] [~e.dimitrova] got any insights?

> flaky test 
> org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
> ---
>
> Key: CASSANDRA-15543
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15543
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: David Capwell
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-alpha
>
>
> This fails infrequently, last seen failure was on java 8
> {code}
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement

2020-03-05 Thread Kevin Gallardo (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Gallardo reassigned CASSANDRA-15543:
--

Assignee: Kevin Gallardo

> flaky test 
> org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
> ---
>
> Key: CASSANDRA-15543
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15543
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: David Capwell
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-alpha
>
>
> This fails infrequently, last seen failure was on java 8
> {code}
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

81 matches

Mail list logo