[jira] [Commented] (CASSANDRA-15712) Introduce MIDRES config in CircleCI
[ https://issues.apache.org/jira/browse/CASSANDRA-15712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080555#comment-17080555 ] Kevin Gallardo commented on CASSANDRA-15712: Note that it would be better to solve https://issues.apache.org/jira/browse/CASSANDRA-15686 before this issue > Introduce MIDRES config in CircleCI > --- > > Key: CASSANDRA-15712 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15712 > Project: Cassandra > Issue Type: Improvement > Components: Test/dtest, Test/unit >Reporter: Kevin Gallardo >Priority: Normal > > From document: > [https://gist.github.com/newkek/bb79dccbe7d2f5e41b2a3daac3858fde] > We have identified that the current HIGHRES configuration seems to require > resources that might not bring the best cost efficiency to the build. > We have also identified several "good compromise" configurations that allow > to get decent performance out of the test suites, without going all out on > the big config. > It seems it would be useful for a lot of people to have this available as a > {{config.yml.MIDRES}} configuration in the {{.circleci}} folder. This way we > do not need to argue on modifying the {{HIGHRES}} configuration so as to not > impact the people already using it , but can still have easy access the > "compromise" config. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15712) Introduce MIDRES config in CircleCI
[ https://issues.apache.org/jira/browse/CASSANDRA-15712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080556#comment-17080556 ] Kevin Gallardo commented on CASSANDRA-15712: [~dcapwell] FYI, if you have any feedback > Introduce MIDRES config in CircleCI > --- > > Key: CASSANDRA-15712 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15712 > Project: Cassandra > Issue Type: Improvement > Components: Test/dtest, Test/unit >Reporter: Kevin Gallardo >Priority: Normal > > From document: > [https://gist.github.com/newkek/bb79dccbe7d2f5e41b2a3daac3858fde] > We have identified that the current HIGHRES configuration seems to require > resources that might not bring the best cost efficiency to the build. > We have also identified several "good compromise" configurations that allow > to get decent performance out of the test suites, without going all out on > the big config. > It seems it would be useful for a lot of people to have this available as a > {{config.yml.MIDRES}} configuration in the {{.circleci}} folder. This way we > do not need to argue on modifying the {{HIGHRES}} configuration so as to not > impact the people already using it, but can still have easy access the > "compromise" config. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15712) Introduce MIDRES config in CircleCI
[ https://issues.apache.org/jira/browse/CASSANDRA-15712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Gallardo updated CASSANDRA-15712: --- Description: >From document: >[https://gist.github.com/newkek/bb79dccbe7d2f5e41b2a3daac3858fde] We have identified that the current HIGHRES configuration seems to require resources that might not bring the best cost efficiency to the build. We have also identified several "good compromise" configurations that allow to get decent performance out of the test suites, without going all out on the big config. It seems it would be useful for a lot of people to have this available as a {{config.yml.MIDRES}} configuration in the {{.circleci}} folder. This way we do not need to argue on modifying the {{HIGHRES}} configuration so as to not impact the people already using it, but can still have easy access the "compromise" config. was: >From document: >[https://gist.github.com/newkek/bb79dccbe7d2f5e41b2a3daac3858fde] We have identified that the current HIGHRES configuration seems to require resources that might not bring the best cost efficiency to the build. We have also identified several "good compromise" configurations that allow to get decent performance out of the test suites, without going all out on the big config. It seems it would be useful for a lot of people to have this available as a {{config.yml.MIDRES}} configuration in the {{.circleci}} folder. This way we do not need to argue on modifying the {{HIGHRES}} configuration so as to not impact the people already using it , but can still have easy access the "compromise" config. > Introduce MIDRES config in CircleCI > --- > > Key: CASSANDRA-15712 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15712 > Project: Cassandra > Issue Type: Improvement > Components: Test/dtest, Test/unit >Reporter: Kevin Gallardo >Priority: Normal > > From document: > [https://gist.github.com/newkek/bb79dccbe7d2f5e41b2a3daac3858fde] > We have identified that the current HIGHRES configuration seems to require > resources that might not bring the best cost efficiency to the build. > We have also identified several "good compromise" configurations that allow > to get decent performance out of the test suites, without going all out on > the big config. > It seems it would be useful for a lot of people to have this available as a > {{config.yml.MIDRES}} configuration in the {{.circleci}} folder. This way we > do not need to argue on modifying the {{HIGHRES}} configuration so as to not > impact the people already using it, but can still have easy access the > "compromise" config. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15712) Introduce MIDRES config in CircleCI
[ https://issues.apache.org/jira/browse/CASSANDRA-15712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Gallardo updated CASSANDRA-15712: --- Workflow: Cassandra Default Workflow (was: Cassandra Bug Workflow) Issue Type: Improvement (was: Bug) > Introduce MIDRES config in CircleCI > --- > > Key: CASSANDRA-15712 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15712 > Project: Cassandra > Issue Type: Improvement > Components: Test/dtest, Test/unit >Reporter: Kevin Gallardo >Priority: Normal > > From document: > [https://gist.github.com/newkek/bb79dccbe7d2f5e41b2a3daac3858fde] > We have identified that the current HIGHRES configuration seems to require > resources that might not bring the best cost efficiency to the build. > We have also identified several "good compromise" configurations that allow > to get decent performance out of the test suites, without going all out on > the big config. > It seems it would be useful for a lot of people to have this available as a > {{config.yml.MIDRES}} configuration in the {{.circleci}} folder. This way we > do not need to argue on modifying the {{HIGHRES}} configuration so as to not > impact the people already using it , but can still have easy access the > "compromise" config. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-15712) Introduce MIDRES config in CircleCI
Kevin Gallardo created CASSANDRA-15712: -- Summary: Introduce MIDRES config in CircleCI Key: CASSANDRA-15712 URL: https://issues.apache.org/jira/browse/CASSANDRA-15712 Project: Cassandra Issue Type: Bug Components: Test/dtest, Test/unit Reporter: Kevin Gallardo >From document: >[https://gist.github.com/newkek/bb79dccbe7d2f5e41b2a3daac3858fde] We have identified that the current HIGHRES configuration seems to require resources that might not bring the best cost efficiency to the build. We have also identified several "good compromise" configurations that allow to get decent performance out of the test suites, without going all out on the big config. It seems it would be useful for a lot of people to have this available as a {{config.yml.MIDRES}} configuration in the {{.circleci}} folder. This way we do not need to argue on modifying the {{HIGHRES}} configuration so as to not impact the people already using it , but can still have easy access the "compromise" config. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15686) Improvements in circle CI default config
[ https://issues.apache.org/jira/browse/CASSANDRA-15686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080546#comment-17080546 ] Kevin Gallardo commented on CASSANDRA-15686: {quote}and saw more failures than normal (normally ~1-3 failures, had 9; most known flaky tests) {quote} Hm that's a bummer, I think I saw similar failures when testing out too, but did not investigate more and thought they were just flaky. Do you think these could be fixed, or put in another bucket to run in another non-parallel suite? For reference, this is the tests I had run a little while ago (about 2 weeks) for the *unit* tests only, on parallelizing the tests on current LOWRES config: - [Run of p=4 and i=medium with no change to the current build (which means it uses 1 runner by default)|https://app.circleci.com/pipelines/github/newkek/cassandra/4/workflows/16897387-cbf0-49e8-82bc-327d75885256] - [Same hardware, but hardcoding {{-Dtest.runners=2}}|https://app.circleci.com/pipelines/github/newkek/cassandra/5/workflows/56b341b5-8f74-414e-852e-52cba8acd28a] You can see the runtime goes from 19min to 12. I am looking at old experiments I ran, and it seems I was able to run with 8 runners on unit tests, and all that was reported seem like failures from flaky tests: [run|https://app.circleci.com/pipelines/github/newkek/cassandra/22/workflows/68bf0582-2b43-4b71-a610-e1e76c6ff418/jobs/90] and [corresponding config|https://github.com/newkek/cassandra/blob/00103c51d3c6b1fd0c534b649e146bbe45bcdf11/.circleci/config.yml#L975] > Improvements in circle CI default config > > > Key: CASSANDRA-15686 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15686 > Project: Cassandra > Issue Type: Bug > Components: Build >Reporter: Kevin Gallardo >Priority: Normal > > I have been looking at and played around with the [default CircleCI > config|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml], > a few comments/questions regarding the following topics: > * Python dtests do not run successfully (200-300 failures) on {{medium}} > instances, they seem to only run with small flaky failures on {{large}} > instances or higher > * Python Upgrade tests: > ** Do not seem to run without many failures on any instance types / any > parallelism setting > ** Do not seem to parallelize well, it seems each container is going to > download multiple C* versions > ** Additionally it seems the configuration is not up to date, as currently > we get errors because {{JAVA8_HOME}} is not set > * Unit tests do not seem to parallelize optimally, number of test runners do > not reflect the available CPUs on the container. Ideally if # of runners == # > of CPUs, build time is improved, on any type of instances. > ** For instance when using the current configuration, running on medium > instances, build will use 1 junit test runner, but 2 CPUs are available. If > using 2 runners, the build time is reduced from 19min (at the current main > config of parallelism=4) to 12min. > * There are some typos in the file, some dtests say "Run Unit Tests" but > they are JVM dtests (see > [here|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml#L1077], > > [here|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml#L1386]) > So some ways to process these would be: > * Do the Python dtests run successfully for anyone on {{medium}} instances? > If not, would it make sense to bump them to {{large}} so that they can be run > successfully? > * Does anybody ever run the python upgrade tests on CircleCI and what is the > configuration that makes it work? > * Would it make sense to either hardcode the number of test runners in the > unit tests with `-Dtest.runners` in the config file to reflect the number of > CPUs on the instances, or change the build so that it is able to detect the > appropriate number of core available automatically? > Additionally, it seems this default config file (config.yml) is not as well > maintained as the > [{{config-2_1.yml}}|https://github.com/apache/cassandra/blob/trunk/.circleci/config-2_1.yml] > (+its lowres/highres) version in the same folder (from CASSANDRA-14806). > What is the reasoning for maintaining these 2 versions of the build? Could > the better maintained version be used as the default? We could generate a > lowres version of the new config-2_1.yml, and rename it {{config.yml}} so > that it gets picked up by CircleCI automatically instead of the current > default. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15642) Inconsistent failure messages on distributed queries
[ https://issues.apache.org/jira/browse/CASSANDRA-15642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078645#comment-17078645 ] Kevin Gallardo commented on CASSANDRA-15642: I see, the differentiation "waiting until we can guarantee we can never succeed", I will say it makes more sense presented this way. Although when testing for CASSANDRA-15543, {{blockFor()}} and {{cassandraReplicaCount()}} were both the same, which means we would still fail as soon as n >= 1, bringing it back to the same conclusion. bq. if you want to file a ticket for it I think I can make us both happy: we should always fail a query as soon as we know it cannot succeed [...] I think we still miss out on potential information that could be returned to the user and improve usability, but I have presented my arguments already, so I won't keep insisting. The case of the schema agreement error is still to me a clear situation where things could be improved. But if anything I would hope there was be a place where this sort of behavior was documented and explained, rather than users having to discover it by themselves in unfortunate circumstances, or having to go through the code. Also I agree it seems to me like a good idea from my POV for the "speculative read" (or put more simply a "retry" iiuc?). It would be an improvement, though I'm thinking the drivers already provides this sort of utility that are well customizable by the users, compared to a server-side solution so I suppose it has upsides and downsides. But something that makes completing a request more robust seems like a good idea regardless. > Inconsistent failure messages on distributed queries > > > Key: CASSANDRA-15642 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15642 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Coordination >Reporter: Kevin Gallardo >Priority: Normal > > As a follow up to some exploration I have done for CASSANDRA-15543, I > realized the following behavior in both {{ReadCallback}} and > {{AbstractWriteHandler}}: > - await for responses > - when all required number of responses have come back: unblock the wait > - when a single failure happens: unblock the wait > - when unblocked, look to see if the counter of failures is > 1 and if so > return an error message based on the {{failures}} map that's been filled > Error messages that can result from this behavior can be a ReadTimeout, a > ReadFailure, a WriteTimeout or a WriteFailure. > In case of a Write/ReadFailure, the user will get back an error looking like > the following: > "Failure: Received X responses, and Y failures" > (if this behavior I describe is incorrect, please correct me) > This causes a usability problem. Since the handler will fail and throw an > exception as soon as 1 failure happens, the error message that is returned to > the user may not be accurate. > (note: I am not entirely sure of the behavior in case of timeouts for now) > For example, say a request at CL = QUORUM = 3, a failed request may complete > first, then a successful one completes, and another fails. If the exception > is thrown fast enough, the error message could say > "Failure: Received 0 response, and 1 failure at CL = 3" > Which: > 1. doesn't make a lot of sense because the CL doesn't match the number of > results in the message, so you end up thinking "what happened with the rest > of the required CL?" > 2. the information is incorrect. We did receive a successful response, only > it came after the initial failure. > From that logic, I think it is safe to assume that the information returned > in the error message cannot be trusted in case of a failure. Only information > users should extract out of it is that at least 1 node has failed. > For a big improvement in usability, the {{ReadCallback}} and > {{AbstractWriteResponseHandler}} could instead wait for all responses to come > back before unblocking the wait, or let it timeout. This is way, the users > will be able to have some trust around the information returned to them. > Additionally, an error that happens first prevents a timeout to happen > because it fails immediately, and so potentially it hides problems with other > replicas. If we were to wait for all responses, we might get a timeout, in > that case we'd also be able to tell wether failures have happened *before* > that timeout, and have a more complete diagnostic where you can't detect both > errors at the same time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15686) Improvements in circle CI default config
[ https://issues.apache.org/jira/browse/CASSANDRA-15686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078372#comment-17078372 ] Kevin Gallardo commented on CASSANDRA-15686: bq. Unit tests should be able to complete on Circle CI with medium instances. The unit tests are run automatically, so that's the default. JVM dtests also run by default FWIW, they are able to complete on medium instances, as opposed to the Python dtests. bq. Pretty much all others won't due to limited resources. You shouldn't try running them, but if you do, then we can't guarantee they may complete unless high resource settings are used. Right, that was my understanding initially. Since those tests are available to run on the lowres config though, and there is no guidance or document saying they are not reliable on low config, and people get confused as to why the Python dtests for example run with 200-300 failures when using the default config. I was suggesting here to either bump up the default config, so that they can be run successfully, or remove them from the config since they can't be run (and only keep them in the HIGHRES file?), or maybe document this better. I understood David recommended fixing the tests to make them run on medium instances too, that sounds reasonable too IMO if it makes things less confusing in the long run. Thanks for the input > Improvements in circle CI default config > > > Key: CASSANDRA-15686 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15686 > Project: Cassandra > Issue Type: Bug > Components: Build >Reporter: Kevin Gallardo >Priority: Normal > > I have been looking at and played around with the [default CircleCI > config|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml], > a few comments/questions regarding the following topics: > * Python dtests do not run successfully (200-300 failures) on {{medium}} > instances, they seem to only run with small flaky failures on {{large}} > instances or higher > * Python Upgrade tests: > ** Do not seem to run without many failures on any instance types / any > parallelism setting > ** Do not seem to parallelize well, it seems each container is going to > download multiple C* versions > ** Additionally it seems the configuration is not up to date, as currently > we get errors because {{JAVA8_HOME}} is not set > * Unit tests do not seem to parallelize optimally, number of test runners do > not reflect the available CPUs on the container. Ideally if # of runners == # > of CPUs, build time is improved, on any type of instances. > ** For instance when using the current configuration, running on medium > instances, build will use 1 junit test runner, but 2 CPUs are available. If > using 2 runners, the build time is reduced from 19min (at the current main > config of parallelism=4) to 12min. > * There are some typos in the file, some dtests say "Run Unit Tests" but > they are JVM dtests (see > [here|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml#L1077], > > [here|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml#L1386]) > So some ways to process these would be: > * Do the Python dtests run successfully for anyone on {{medium}} instances? > If not, would it make sense to bump them to {{large}} so that they can be run > successfully? > * Does anybody ever run the python upgrade tests on CircleCI and what is the > configuration that makes it work? > * Would it make sense to either hardcode the number of test runners in the > unit tests with `-Dtest.runners` in the config file to reflect the number of > CPUs on the instances, or change the build so that it is able to detect the > appropriate number of core available automatically? > Additionally, it seems this default config file (config.yml) is not as well > maintained as the > [{{config-2_1.yml}}|https://github.com/apache/cassandra/blob/trunk/.circleci/config-2_1.yml] > (+its lowres/highres) version in the same folder (from CASSANDRA-14806). > What is the reasoning for maintaining these 2 versions of the build? Could > the better maintained version be used as the default? We could generate a > lowres version of the new config-2_1.yml, and rename it {{config.yml}} so > that it gets picked up by CircleCI automatically instead of the current > default. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15642) Inconsistent failure messages on distributed queries
[ https://issues.apache.org/jira/browse/CASSANDRA-15642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17077652#comment-17077652 ] Kevin Gallardo commented on CASSANDRA-15642: Hm maybe I am not clear on the proposition Currently the code waits for {{blockFor}} responses, unless there's a failure. The proposed behavior in case of successes doesn't change. You would wait for the CL required, if there's only successes, then it's a success. If there's failures and successes, it's a failure, like now, except it would provide the complete list of who failed and who succeeded, right now the server doesn't wait to know which fails and which succeeds, it fails-fast in case of at least 1 failure and if the failure happens first, the rest is ignored. bq. Is that an argument, or a belief? Well, it was both probably, now that you point it out bq. I'm personally not sure how I would utilise this extra information. [..] Hm, how the user processes that information, is up to the user. There's plenty of things you can do in case of failures and timeouts, as you mention, but I think it is outside the scope of this conversation though. The point is that they would get more information with the proposed changes, so they can investigate having more information. Indeed one downside is delayed answer, in one scenario: when there's [failures + 1 or more node doesn't respond]. I can't really tell how frequent this would happen, or how impactful it is to delay the response by some seconds in that very specific scenario, but it seems to me that the benefits of providing more information outweighs that special case's downside 路♂️ I agree it's a change in behavior, introducing that as optional makes sense too then if too impactful > Inconsistent failure messages on distributed queries > > > Key: CASSANDRA-15642 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15642 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Coordination >Reporter: Kevin Gallardo >Priority: Normal > > As a follow up to some exploration I have done for CASSANDRA-15543, I > realized the following behavior in both {{ReadCallback}} and > {{AbstractWriteHandler}}: > - await for responses > - when all required number of responses have come back: unblock the wait > - when a single failure happens: unblock the wait > - when unblocked, look to see if the counter of failures is > 1 and if so > return an error message based on the {{failures}} map that's been filled > Error messages that can result from this behavior can be a ReadTimeout, a > ReadFailure, a WriteTimeout or a WriteFailure. > In case of a Write/ReadFailure, the user will get back an error looking like > the following: > "Failure: Received X responses, and Y failures" > (if this behavior I describe is incorrect, please correct me) > This causes a usability problem. Since the handler will fail and throw an > exception as soon as 1 failure happens, the error message that is returned to > the user may not be accurate. > (note: I am not entirely sure of the behavior in case of timeouts for now) > For example, say a request at CL = QUORUM = 3, a failed request may complete > first, then a successful one completes, and another fails. If the exception > is thrown fast enough, the error message could say > "Failure: Received 0 response, and 1 failure at CL = 3" > Which: > 1. doesn't make a lot of sense because the CL doesn't match the number of > results in the message, so you end up thinking "what happened with the rest > of the required CL?" > 2. the information is incorrect. We did receive a successful response, only > it came after the initial failure. > From that logic, I think it is safe to assume that the information returned > in the error message cannot be trusted in case of a failure. Only information > users should extract out of it is that at least 1 node has failed. > For a big improvement in usability, the {{ReadCallback}} and > {{AbstractWriteResponseHandler}} could instead wait for all responses to come > back before unblocking the wait, or let it timeout. This is way, the users > will be able to have some trust around the information returned to them. > Additionally, an error that happens first prevents a timeout to happen > because it fails immediately, and so potentially it hides problems with other > replicas. If we were to wait for all responses, we might get a timeout, in > that case we'd also be able to tell wether failures have happened *before* > that timeout, and have a more complete diagnostic where you can't detect both > errors at the same time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe,
[jira] [Comment Edited] (CASSANDRA-15642) Inconsistent failure messages on distributed queries
[ https://issues.apache.org/jira/browse/CASSANDRA-15642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17077518#comment-17077518 ] Kevin Gallardo edited comment on CASSANDRA-15642 at 4/7/20, 8:52 PM: - {quote}So what do you propose? Some questions to consider {quote} I propose waiting for {{blockFor}} (with a {{CountDownLatch}} for example in {{ReadCallback}}, as detailed on CASSANDRA-15543), until timeout, if it happens. And in case of a timeout, provide a cohesive message saying: "_These_ are the hosts that responded successfully, _these_ that failed, _these_ haven't responded before timeout.". If all required responses have comeback, if failures then detail who failed and who succeeded. Which provides better usability than the current incomplete error messages. {quote}How long until you're sure you've received all the responses you might ever receive? {quote} The trap! Theoretically, you can't know, it's an asynchronous distributed system. I am well aware of things like the FLP impossibility. In more seriousness what you can say though is "we haven't received anything from X within the given timeframe, this might indicate an issue". Now, this is my personal POV, but I think it's valuable to also provide that information for Ops to have a look into the nodes that haven't responded in time. Keeping in mind that this is a difference in behavior only in the case of: [some sub-requests fail AND some don't respond at all]. A good example of usability improvement I can mention is the case what I've seen on CASSANDRA-15543. The test author initially assumed that for the read with the schema disagreement they would get in the error message "hosts x and y don't have the correct schema, but host z does". But you can't expect that from the error atm, all you can be sure to get is "x, or y don't have the required schema". Usability would be improved vastly if you were able to give the complete view of the problem. ISTM that in this scenario, this information can in most cases be returned without timeout (cannot strongly guarantee that, because again, distributed async system, but usability would be improved for a larger amount of situations) {quote}How would you balance the delayed responses with user requirements to take corrective action promptly in response to failures? {quote} I would argue that the user wouldn't be able to act optimally anyway with only partial information. In the case of "failures + timeouts" then there would be more value in them knowing that there is also timeouts with some nodes, instead of just the failure. Thanks for the discussion was (Author: newkek): {quote}So what do you propose? Some questions to consider {quote} I propose waiting for {{blockFor}} (with a {{CountDownLatch}} for example in {{ReadCallback}}, as detailed on CASSANDRA-15543), until timeout, if it happens. And in case of a timeout, provide a cohesive message saying: "_These_ are the hosts that responded successfully, _these_ that failed, _these_ haven't responded before timeout.". If all required responses have comeback, if failures then detail who failed and who succeeded. Which provides better usability than the current incomplete error messages. {quote}How long until you're sure you've received all the responses you might ever receive? {quote} The trap! Theoretically, you can't know, it's an asynchronous distributed system. I am well aware of things like the FLP impossibility. In more seriousness what you can say though is "we haven't received anything from X within the given timeframe, this might indicate an issue". Now, this is my personal POV, but I think it's valuable to also provide that information for Ops to have a look into the nodes that haven't responded in time. Keeping in mind that this is a difference in behavior only in the case of: [some sub-requests fail AND some don't respond at all]. A good example of usability improvement I can mention is the case what I've seen on CASSANDRA-15543. The test author initially assumed that for the read with the schema disagreement they would get in the error message "hosts x and y don't have the correct schema, but host z does". But you can't expect that from the error atm, all you can be sure to get is "x, or y don't have the required schema". Usability would be improved vastly if you were able to give the complete view of the problem. ISTM that in this scenario, this information can in most cases be returned without timeout. Cannot be strongly guaranteed, because again, distributed async system, but usability would be improved for a larger amount of situations. {quote}How would you balance the delayed responses with user requirements to take corrective action promptly in response to failures? {quote} I would argue that the user wouldn't be able to act optimally anyway with only partial information. In the case
[jira] [Comment Edited] (CASSANDRA-15642) Inconsistent failure messages on distributed queries
[ https://issues.apache.org/jira/browse/CASSANDRA-15642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17077518#comment-17077518 ] Kevin Gallardo edited comment on CASSANDRA-15642 at 4/7/20, 8:37 PM: - {quote}So what do you propose? Some questions to consider {quote} I propose waiting for {{blockFor}} (with a {{CountDownLatch}} for example in {{ReadCallback}}, as detailed on CASSANDRA-15543), until timeout, if it happens. And in case of a timeout, provide a cohesive message saying: "_These_ are the hosts that responded successfully, _these_ that failed, _these_ haven't responded before timeout.". If all required responses have comeback, if failures then detail who failed and who succeeded. Which provides better usability than the current incomplete error messages. {quote}How long until you're sure you've received all the responses you might ever receive? {quote} The trap! Theoretically, you can't know, it's an asynchronous distributed system. I am well aware of things like the FLP impossibility. In more seriousness what you can say though is "we haven't received anything from X within the given timeframe, this might indicate an issue". Now, this is my personal POV, but I think it's valuable to also provide that information for Ops to have a look into the nodes that haven't responded in time. Keeping in mind that this is a difference in behavior only in the case of: [some sub-requests fail AND some don't respond at all]. A good example of usability improvement I can mention is the case what I've seen on CASSANDRA-15543. The test author initially assumed that for the read with the schema disagreement they would get in the error message "hosts x and y don't have the correct schema, but host z does". But you can't expect that from the error atm, all you can be sure to get is "x, or y don't have the required schema". Usability would be improved vastly if you were able to give the complete view of the problem. ISTM that in this scenario, this information can in most cases be returned without timeout. Cannot be strongly guaranteed, because again, distributed async system, but usability would be improved for a larger amount of situations. {quote}How would you balance the delayed responses with user requirements to take corrective action promptly in response to failures? {quote} I would argue that the user wouldn't be able to act optimally anyway with only partial information. In the case of "failures + timeouts" then there would be more value in them knowing that there is also timeouts with some nodes, instead of just the failure. Thanks for the discussion was (Author: newkek): {quote}So what do you propose? Some questions to consider {quote} I propose waiting for {{blockFor}} (with a {{CountDownLatch}} for example in {{ReadCallback}}, as detailed on CASSANDRA-15543), until timeout, if it happens. And in case of a timeout, provide a cohesive message saying: "_These_ are the hosts that responded successfully, _these_ that failed, _these_ haven't responded before timeout." If not timeout, when all required responses have comeback, if failures then explain who failed and who succeeded. Which provides better usability than the current incomplete error messages. {quote}How long until you're sure you've received all the responses you might ever receive? {quote} The trap! Theoretically, you can't know, it's an asynchronous distributed system. I am well aware of things like the FLP impossibility. In more seriousness what you can say though is "we haven't received anything from X within the given timeframe, this might indicate an issue". Now, this is my personal POV, but I think it's valuable to also provide that information for Ops to have a look into the nodes that haven't responded in time. Keeping in mind that this is a difference in behavior only in the case of: [some sub-requests fail AND some don't respond at all]. A good example of usability improvement I can mention is the case what I've seen on CASSANDRA-15543. The test author initially assumed that for the read with the schema disagreement they would get in the error message "hosts x and y don't have the correct schema, but host z does". But you can't expect that from the error atm, all you can be sure to get is "x, or y don't have the required schema". Usability would be improved vastly if you were able to give the complete view of the problem. ISTM that in this scenario, this information can in most cases be returned without timeout. Cannot be strongly guaranteed, because again, distributed async system, but usability would be improved for a larger amount of situations. {quote}How would you balance the delayed responses with user requirements to take corrective action promptly in response to failures? {quote} I would argue that the user wouldn't be able to act optimally anyway with only partial
[jira] [Commented] (CASSANDRA-15686) Improvements in circle CI default config
[ https://issues.apache.org/jira/browse/CASSANDRA-15686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17077529#comment-17077529 ] Kevin Gallardo commented on CASSANDRA-15686: bq. I was told that if you have a build that requests > medium on a free tier, CircleCI would downgrade to medium automatically So that was incorrect. Instead you get an error saying: _This job has been blocked because resource-class:xlarge is not available on your plan. Please upgrade to continue building._ So it doesn't build, but as mentioned earlier, the dtests don't build successfully either on medium, so I would think it wouldn't make much difference for free tier users to bump to large or more for the dtests, and it would allow people with {{large}} instances access to build without having to switch their config. Does that make sense? > Improvements in circle CI default config > > > Key: CASSANDRA-15686 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15686 > Project: Cassandra > Issue Type: Bug > Components: Build >Reporter: Kevin Gallardo >Priority: Normal > > I have been looking at and played around with the [default CircleCI > config|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml], > a few comments/questions regarding the following topics: > * Python dtests do not run successfully (200-300 failures) on {{medium}} > instances, they seem to only run with small flaky failures on {{large}} > instances or higher > * Python Upgrade tests: > ** Do not seem to run without many failures on any instance types / any > parallelism setting > ** Do not seem to parallelize well, it seems each container is going to > download multiple C* versions > ** Additionally it seems the configuration is not up to date, as currently > we get errors because {{JAVA8_HOME}} is not set > * Unit tests do not seem to parallelize optimally, number of test runners do > not reflect the available CPUs on the container. Ideally if # of runners == # > of CPUs, build time is improved, on any type of instances. > ** For instance when using the current configuration, running on medium > instances, build will use 1 junit test runner, but 2 CPUs are available. If > using 2 runners, the build time is reduced from 19min (at the current main > config of parallelism=4) to 12min. > * There are some typos in the file, some dtests say "Run Unit Tests" but > they are JVM dtests (see > [here|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml#L1077], > > [here|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml#L1386]) > So some ways to process these would be: > * Do the Python dtests run successfully for anyone on {{medium}} instances? > If not, would it make sense to bump them to {{large}} so that they can be run > successfully? > * Does anybody ever run the python upgrade tests on CircleCI and what is the > configuration that makes it work? > * Would it make sense to either hardcode the number of test runners in the > unit tests with `-Dtest.runners` in the config file to reflect the number of > CPUs on the instances, or change the build so that it is able to detect the > appropriate number of core available automatically? > Additionally, it seems this default config file (config.yml) is not as well > maintained as the > [{{config-2_1.yml}}|https://github.com/apache/cassandra/blob/trunk/.circleci/config-2_1.yml] > (+its lowres/highres) version in the same folder (from CASSANDRA-14806). > What is the reasoning for maintaining these 2 versions of the build? Could > the better maintained version be used as the default? We could generate a > lowres version of the new config-2_1.yml, and rename it {{config.yml}} so > that it gets picked up by CircleCI automatically instead of the current > default. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15642) Inconsistent failure messages on distributed queries
[ https://issues.apache.org/jira/browse/CASSANDRA-15642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17077518#comment-17077518 ] Kevin Gallardo commented on CASSANDRA-15642: {quote}So what do you propose? Some questions to consider {quote} I propose waiting for {{blockFor}} (with a {{CountDownLatch}} for example in {{ReadCallback}}, as detailed on CASSANDRA-15543), until timeout, if it happens. And in case of a timeout, provide a cohesive message saying: "_These_ are the hosts that responded successfully, _these_ that failed, _these_ haven't responded before timeout." If not timeout, when all required responses have comeback, if failures then explain who failed and who succeeded. Which provides better usability than the current incomplete error messages. {quote}How long until you're sure you've received all the responses you might ever receive? {quote} The trap! Theoretically, you can't know, it's an asynchronous distributed system. I am well aware of things like the FLP impossibility. In more seriousness what you can say though is "we haven't received anything from X within the given timeframe, this might indicate an issue". Now, this is my personal POV, but I think it's valuable to also provide that information for Ops to have a look into the nodes that haven't responded in time. Keeping in mind that this is a difference in behavior only in the case of: [some sub-requests fail AND some don't respond at all]. A good example of usability improvement I can mention is the case what I've seen on CASSANDRA-15543. The test author initially assumed that for the read with the schema disagreement they would get in the error message "hosts x and y don't have the correct schema, but host z does". But you can't expect that from the error atm, all you can be sure to get is "x, or y don't have the required schema". Usability would be improved vastly if you were able to give the complete view of the problem. ISTM that in this scenario, this information can in most cases be returned without timeout. Cannot be strongly guaranteed, because again, distributed async system, but usability would be improved for a larger amount of situations. {quote}How would you balance the delayed responses with user requirements to take corrective action promptly in response to failures? {quote} I would argue that the user wouldn't be able to act optimally anyway with only partial information. In the case of "failures + timeouts" then there would be more value in them knowing that there is also timeouts with some nodes, instead of just the failure. Thanks for the discussion > Inconsistent failure messages on distributed queries > > > Key: CASSANDRA-15642 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15642 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Coordination >Reporter: Kevin Gallardo >Priority: Normal > > As a follow up to some exploration I have done for CASSANDRA-15543, I > realized the following behavior in both {{ReadCallback}} and > {{AbstractWriteHandler}}: > - await for responses > - when all required number of responses have come back: unblock the wait > - when a single failure happens: unblock the wait > - when unblocked, look to see if the counter of failures is > 1 and if so > return an error message based on the {{failures}} map that's been filled > Error messages that can result from this behavior can be a ReadTimeout, a > ReadFailure, a WriteTimeout or a WriteFailure. > In case of a Write/ReadFailure, the user will get back an error looking like > the following: > "Failure: Received X responses, and Y failures" > (if this behavior I describe is incorrect, please correct me) > This causes a usability problem. Since the handler will fail and throw an > exception as soon as 1 failure happens, the error message that is returned to > the user may not be accurate. > (note: I am not entirely sure of the behavior in case of timeouts for now) > For example, say a request at CL = QUORUM = 3, a failed request may complete > first, then a successful one completes, and another fails. If the exception > is thrown fast enough, the error message could say > "Failure: Received 0 response, and 1 failure at CL = 3" > Which: > 1. doesn't make a lot of sense because the CL doesn't match the number of > results in the message, so you end up thinking "what happened with the rest > of the required CL?" > 2. the information is incorrect. We did receive a successful response, only > it came after the initial failure. > From that logic, I think it is safe to assume that the information returned > in the error message cannot be trusted in case of a failure. Only information > users should extract out of it is that at least 1 node has failed. > For a
[jira] [Commented] (CASSANDRA-15686) Improvements in circle CI default config
[ https://issues.apache.org/jira/browse/CASSANDRA-15686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076590#comment-17076590 ] Kevin Gallardo commented on CASSANDRA-15686: bq. I am only speculating but I thought there were tests that spin up network and there are many tests which share disk, and I don't know if we isolate the paths or not. If this is not true then it should be a simple change to bump the number of runners for unit tests (definitely not jvm dtests). Hm afaict running with multiple runners is fine for the unit tests, I have tested on different configurations and it didn't seem to cause problems as long as # runners <= # CPUs > Improvements in circle CI default config > > > Key: CASSANDRA-15686 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15686 > Project: Cassandra > Issue Type: Bug > Components: Build >Reporter: Kevin Gallardo >Priority: Normal > > I have been looking at and played around with the [default CircleCI > config|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml], > a few comments/questions regarding the following topics: > * Python dtests do not run successfully (200-300 failures) on {{medium}} > instances, they seem to only run with small flaky failures on {{large}} > instances or higher > * Python Upgrade tests: > ** Do not seem to run without many failures on any instance types / any > parallelism setting > ** Do not seem to parallelize well, it seems each container is going to > download multiple C* versions > ** Additionally it seems the configuration is not up to date, as currently > we get errors because {{JAVA8_HOME}} is not set > * Unit tests do not seem to parallelize optimally, number of test runners do > not reflect the available CPUs on the container. Ideally if # of runners == # > of CPUs, build time is improved, on any type of instances. > ** For instance when using the current configuration, running on medium > instances, build will use 1 junit test runner, but 2 CPUs are available. If > using 2 runners, the build time is reduced from 19min (at the current main > config of parallelism=4) to 12min. > * There are some typos in the file, some dtests say "Run Unit Tests" but > they are JVM dtests (see > [here|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml#L1077], > > [here|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml#L1386]) > So some ways to process these would be: > * Do the Python dtests run successfully for anyone on {{medium}} instances? > If not, would it make sense to bump them to {{large}} so that they can be run > successfully? > * Does anybody ever run the python upgrade tests on CircleCI and what is the > configuration that makes it work? > * Would it make sense to either hardcode the number of test runners in the > unit tests with `-Dtest.runners` in the config file to reflect the number of > CPUs on the instances, or change the build so that it is able to detect the > appropriate number of core available automatically? > Additionally, it seems this default config file (config.yml) is not as well > maintained as the > [{{config-2_1.yml}}|https://github.com/apache/cassandra/blob/trunk/.circleci/config-2_1.yml] > (+its lowres/highres) version in the same folder (from CASSANDRA-14806). > What is the reasoning for maintaining these 2 versions of the build? Could > the better maintained version be used as the default? We could generate a > lowres version of the new config-2_1.yml, and rename it {{config.yml}} so > that it gets picked up by CircleCI automatically instead of the current > default. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15686) Improvements in circle CI default config
[ https://issues.apache.org/jira/browse/CASSANDRA-15686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076459#comment-17076459 ] Kevin Gallardo commented on CASSANDRA-15686: Confirm the build is running as normal by using the {{config-2_1.yml}} directly renamed as {{config.yml}}, see build #42 here: https://app.circleci.com/pipelines/github/newkek/cassandra?branch=chg > Improvements in circle CI default config > > > Key: CASSANDRA-15686 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15686 > Project: Cassandra > Issue Type: Bug > Components: Build >Reporter: Kevin Gallardo >Priority: Normal > > I have been looking at and played around with the [default CircleCI > config|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml], > a few comments/questions regarding the following topics: > * Python dtests do not run successfully (200-300 failures) on {{medium}} > instances, they seem to only run with small flaky failures on {{large}} > instances or higher > * Python Upgrade tests: > ** Do not seem to run without many failures on any instance types / any > parallelism setting > ** Do not seem to parallelize well, it seems each container is going to > download multiple C* versions > ** Additionally it seems the configuration is not up to date, as currently > we get errors because {{JAVA8_HOME}} is not set > * Unit tests do not seem to parallelize optimally, number of test runners do > not reflect the available CPUs on the container. Ideally if # of runners == # > of CPUs, build time is improved, on any type of instances. > ** For instance when using the current configuration, running on medium > instances, build will use 1 junit test runner, but 2 CPUs are available. If > using 2 runners, the build time is reduced from 19min (at the current main > config of parallelism=4) to 12min. > * There are some typos in the file, some dtests say "Run Unit Tests" but > they are JVM dtests (see > [here|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml#L1077], > > [here|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml#L1386]) > So some ways to process these would be: > * Do the Python dtests run successfully for anyone on {{medium}} instances? > If not, would it make sense to bump them to {{large}} so that they can be run > successfully? > * Does anybody ever run the python upgrade tests on CircleCI and what is the > configuration that makes it work? > * Would it make sense to either hardcode the number of test runners in the > unit tests with `-Dtest.runners` in the config file to reflect the number of > CPUs on the instances, or change the build so that it is able to detect the > appropriate number of core available automatically? > Additionally, it seems this default config file (config.yml) is not as well > maintained as the > [{{config-2_1.yml}}|https://github.com/apache/cassandra/blob/trunk/.circleci/config-2_1.yml] > (+its lowres/highres) version in the same folder (from CASSANDRA-14806). > What is the reasoning for maintaining these 2 versions of the build? Could > the better maintained version be used as the default? We could generate a > lowres version of the new config-2_1.yml, and rename it {{config.yml}} so > that it gets picked up by CircleCI automatically instead of the current > default. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15686) Improvements in circle CI default config
[ https://issues.apache.org/jira/browse/CASSANDRA-15686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076386#comment-17076386 ] Kevin Gallardo commented on CASSANDRA-15686: bq. I was under the impression the lager instances were not in the free tier; Correct, it is not. I was told that if you have a build that requests > medium on a free tier, CircleCI would downgrade to medium automatically, but I have not confirmed that, need to double check. I have seen [other OSS projects|https://github.com/envoyproxy/envoy/blob/master/.circleci/config.yml#L9] using {{xlarge}} resources on their default build config 路♂️. The problem is that the tests do not run well at all on medium, contributors get confused by seeing the python dtest build available and launches it, and then gets confused by the 200-300+ failures resulting. If these don't build successfully at all by default, I am wondering if it is worth to keep it in the "lowres" build at all given the confusion it causes? bq. Sorry I don't follow. [...] So the only file we should be maintaining is config-2_1.yml. So as we discussed offline, it would be possible to, instead of using the generated version of the {{config-2_1.yml}} file, we use the {{config-2_1.yml}} by default. This would cause less confusion to contributors too, we could also keep a HIGHRES version as discussed so that it doesn't change people's workflows. > Improvements in circle CI default config > > > Key: CASSANDRA-15686 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15686 > Project: Cassandra > Issue Type: Bug > Components: Build >Reporter: Kevin Gallardo >Priority: Normal > > I have been looking at and played around with the [default CircleCI > config|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml], > a few comments/questions regarding the following topics: > * Python dtests do not run successfully (200-300 failures) on {{medium}} > instances, they seem to only run with small flaky failures on {{large}} > instances or higher > * Python Upgrade tests: > ** Do not seem to run without many failures on any instance types / any > parallelism setting > ** Do not seem to parallelize well, it seems each container is going to > download multiple C* versions > ** Additionally it seems the configuration is not up to date, as currently > we get errors because {{JAVA8_HOME}} is not set > * Unit tests do not seem to parallelize optimally, number of test runners do > not reflect the available CPUs on the container. Ideally if # of runners == # > of CPUs, build time is improved, on any type of instances. > ** For instance when using the current configuration, running on medium > instances, build will use 1 junit test runner, but 2 CPUs are available. If > using 2 runners, the build time is reduced from 19min (at the current main > config of parallelism=4) to 12min. > * There are some typos in the file, some dtests say "Run Unit Tests" but > they are JVM dtests (see > [here|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml#L1077], > > [here|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml#L1386]) > So some ways to process these would be: > * Do the Python dtests run successfully for anyone on {{medium}} instances? > If not, would it make sense to bump them to {{large}} so that they can be run > successfully? > * Does anybody ever run the python upgrade tests on CircleCI and what is the > configuration that makes it work? > * Would it make sense to either hardcode the number of test runners in the > unit tests with `-Dtest.runners` in the config file to reflect the number of > CPUs on the instances, or change the build so that it is able to detect the > appropriate number of core available automatically? > Additionally, it seems this default config file (config.yml) is not as well > maintained as the > [{{config-2_1.yml}}|https://github.com/apache/cassandra/blob/trunk/.circleci/config-2_1.yml] > (+its lowres/highres) version in the same folder (from CASSANDRA-14806). > What is the reasoning for maintaining these 2 versions of the build? Could > the better maintained version be used as the default? We could generate a > lowres version of the new config-2_1.yml, and rename it {{config.yml}} so > that it gets picked up by CircleCI automatically instead of the current > default. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15642) Inconsistent failure messages on distributed queries
[ https://issues.apache.org/jira/browse/CASSANDRA-15642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076374#comment-17076374 ] Kevin Gallardo commented on CASSANDRA-15642: bq. What is your definition of complete/reliable? To me it would mean that user gets a complete view of the states of the other requests necessary to complete a request at a certain CL. Instead, in the current state of things: * say for CL=3 * first response that comes back is a failure * then later the 2 other responses are successful * the error message may say "1 failure, 0 successful response" The "0 successful response" cannot be trusted because some successful responses actually came back, but after the failure. And the "1 failure" cannot be trusted either, because there may have been more failures that would not be reported because of the current fail-fast behavior. The alternate solution you mention to save the state first doesn't provide with a complete view of the situation either as far as I understand. The state is saved and things are less inconsistent, but the errors returned to the user may still be misleading as explained above > Inconsistent failure messages on distributed queries > > > Key: CASSANDRA-15642 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15642 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Coordination >Reporter: Kevin Gallardo >Priority: Normal > > As a follow up to some exploration I have done for CASSANDRA-15543, I > realized the following behavior in both {{ReadCallback}} and > {{AbstractWriteHandler}}: > - await for responses > - when all required number of responses have come back: unblock the wait > - when a single failure happens: unblock the wait > - when unblocked, look to see if the counter of failures is > 1 and if so > return an error message based on the {{failures}} map that's been filled > Error messages that can result from this behavior can be a ReadTimeout, a > ReadFailure, a WriteTimeout or a WriteFailure. > In case of a Write/ReadFailure, the user will get back an error looking like > the following: > "Failure: Received X responses, and Y failures" > (if this behavior I describe is incorrect, please correct me) > This causes a usability problem. Since the handler will fail and throw an > exception as soon as 1 failure happens, the error message that is returned to > the user may not be accurate. > (note: I am not entirely sure of the behavior in case of timeouts for now) > For example, say a request at CL = QUORUM = 3, a failed request may complete > first, then a successful one completes, and another fails. If the exception > is thrown fast enough, the error message could say > "Failure: Received 0 response, and 1 failure at CL = 3" > Which: > 1. doesn't make a lot of sense because the CL doesn't match the number of > results in the message, so you end up thinking "what happened with the rest > of the required CL?" > 2. the information is incorrect. We did receive a successful response, only > it came after the initial failure. > From that logic, I think it is safe to assume that the information returned > in the error message cannot be trusted in case of a failure. Only information > users should extract out of it is that at least 1 node has failed. > For a big improvement in usability, the {{ReadCallback}} and > {{AbstractWriteResponseHandler}} could instead wait for all responses to come > back before unblocking the wait, or let it timeout. This is way, the users > will be able to have some trust around the information returned to them. > Additionally, an error that happens first prevents a timeout to happen > because it fails immediately, and so potentially it hides problems with other > replicas. If we were to wait for all responses, we might get a timeout, in > that case we'd also be able to tell wether failures have happened *before* > that timeout, and have a more complete diagnostic where you can't detect both > errors at the same time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15642) Inconsistent failure messages on distributed queries
[ https://issues.apache.org/jira/browse/CASSANDRA-15642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074827#comment-17074827 ] Kevin Gallardo edited comment on CASSANDRA-15642 at 4/3/20, 7:10 PM: - [~benedict] thanks for having a look at the ticket! bq. There is no real significant increase in information by waiting, and it delays how quickly a client may action this information I would argue that right now that a user wouldn't even be aware that the information is not reliable, as it's only by digging in server code that I realized that this is happening. The error message tell you "received X errors and X responses" and there is no indication externally to the user/client that the information returned in not complete/reliable. Additionally, given these findings I don't know how the client would be able to properly action on the information, quickly or not, if the info is not reliable? I am not sure that giving partial info quickly is better than cohesive info bq. It is possible to simply report consistent information I am not sure how that would be possible without waiting all the responses come back or timeout, but happy to be explained if I am missing something was (Author: newkek): [~benedict] thanks for having a look at the ticket! bq. There is no real significant increase in information by waiting, and it delays how quickly a client may action this information I would argue that right now that a user wouldn't even be aware that the information is not reliable, as it's only by digging in server code that I realized that this is happening. The error message tell you "received X errors and X responses" and there is no indication externally to the user/client that the information returned in not reliable. Additionally, given these findings I don't know how the client would be able to properly action on the information, quickly or not, if the info is not reliable? I am not sure that giving partial info quickly is better than cohesive info bq. It is possible to simply report consistent information I am not sure how that would be possible without waiting all the responses come back or timeout, but happy to be explained if I am missing something > Inconsistent failure messages on distributed queries > > > Key: CASSANDRA-15642 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15642 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Coordination >Reporter: Kevin Gallardo >Priority: Normal > > As a follow up to some exploration I have done for CASSANDRA-15543, I > realized the following behavior in both {{ReadCallback}} and > {{AbstractWriteHandler}}: > - await for responses > - when all required number of responses have come back: unblock the wait > - when a single failure happens: unblock the wait > - when unblocked, look to see if the counter of failures is > 1 and if so > return an error message based on the {{failures}} map that's been filled > Error messages that can result from this behavior can be a ReadTimeout, a > ReadFailure, a WriteTimeout or a WriteFailure. > In case of a Write/ReadFailure, the user will get back an error looking like > the following: > "Failure: Received X responses, and Y failures" > (if this behavior I describe is incorrect, please correct me) > This causes a usability problem. Since the handler will fail and throw an > exception as soon as 1 failure happens, the error message that is returned to > the user may not be accurate. > (note: I am not entirely sure of the behavior in case of timeouts for now) > For example, say a request at CL = QUORUM = 3, a failed request may complete > first, then a successful one completes, and another fails. If the exception > is thrown fast enough, the error message could say > "Failure: Received 0 response, and 1 failure at CL = 3" > Which: > 1. doesn't make a lot of sense because the CL doesn't match the number of > results in the message, so you end up thinking "what happened with the rest > of the required CL?" > 2. the information is incorrect. We did receive a successful response, only > it came after the initial failure. > From that logic, I think it is safe to assume that the information returned > in the error message cannot be trusted in case of a failure. Only information > users should extract out of it is that at least 1 node has failed. > For a big improvement in usability, the {{ReadCallback}} and > {{AbstractWriteResponseHandler}} could instead wait for all responses to come > back before unblocking the wait, or let it timeout. This is way, the users > will be able to have some trust around the information returned to them. > Additionally, an error that happens first prevents a timeout to happen > because
[jira] [Commented] (CASSANDRA-15642) Inconsistent failure messages on distributed queries
[ https://issues.apache.org/jira/browse/CASSANDRA-15642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074827#comment-17074827 ] Kevin Gallardo commented on CASSANDRA-15642: [~benedict] thanks for having a look at the ticket! bq. There is no real significant increase in information by waiting, and it delays how quickly a client may action this information I would argue that right now that a user wouldn't even be aware that the information is not reliable, as it's only by digging in server code that I realized that this is happening. The error message tell you "received X errors and X responses" and there is no indication externally to the user/client that the information returned in not reliable. Additionally, given these findings I don't know how the client would be able to properly action on the information, quickly or not, if the info is not reliable? I am not sure that giving partial info quickly is better than cohesive info bq. It is possible to simply report consistent information I am not sure how that would be possible without waiting all the responses come back or timeout, but happy to be explained if I am missing something > Inconsistent failure messages on distributed queries > > > Key: CASSANDRA-15642 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15642 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Coordination >Reporter: Kevin Gallardo >Priority: Normal > > As a follow up to some exploration I have done for CASSANDRA-15543, I > realized the following behavior in both {{ReadCallback}} and > {{AbstractWriteHandler}}: > - await for responses > - when all required number of responses have come back: unblock the wait > - when a single failure happens: unblock the wait > - when unblocked, look to see if the counter of failures is > 1 and if so > return an error message based on the {{failures}} map that's been filled > Error messages that can result from this behavior can be a ReadTimeout, a > ReadFailure, a WriteTimeout or a WriteFailure. > In case of a Write/ReadFailure, the user will get back an error looking like > the following: > "Failure: Received X responses, and Y failures" > (if this behavior I describe is incorrect, please correct me) > This causes a usability problem. Since the handler will fail and throw an > exception as soon as 1 failure happens, the error message that is returned to > the user may not be accurate. > (note: I am not entirely sure of the behavior in case of timeouts for now) > For example, say a request at CL = QUORUM = 3, a failed request may complete > first, then a successful one completes, and another fails. If the exception > is thrown fast enough, the error message could say > "Failure: Received 0 response, and 1 failure at CL = 3" > Which: > 1. doesn't make a lot of sense because the CL doesn't match the number of > results in the message, so you end up thinking "what happened with the rest > of the required CL?" > 2. the information is incorrect. We did receive a successful response, only > it came after the initial failure. > From that logic, I think it is safe to assume that the information returned > in the error message cannot be trusted in case of a failure. Only information > users should extract out of it is that at least 1 node has failed. > For a big improvement in usability, the {{ReadCallback}} and > {{AbstractWriteResponseHandler}} could instead wait for all responses to come > back before unblocking the wait, or let it timeout. This is way, the users > will be able to have some trust around the information returned to them. > Additionally, an error that happens first prevents a timeout to happen > because it fails immediately, and so potentially it hides problems with other > replicas. If we were to wait for all responses, we might get a timeout, in > that case we'd also be able to tell wether failures have happened *before* > that timeout, and have a more complete diagnostic where you can't detect both > errors at the same time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15642) Inconsistent failure messages on distributed queries
[ https://issues.apache.org/jira/browse/CASSANDRA-15642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Gallardo updated CASSANDRA-15642: --- Description: As a follow up to some exploration I have done for CASSANDRA-15543, I realized the following behavior in both {{ReadCallback}} and {{AbstractWriteHandler}}: - await for responses - when all required number of responses have come back: unblock the wait - when a single failure happens: unblock the wait - when unblocked, look to see if the counter of failures is > 1 and if so return an error message based on the {{failures}} map that's been filled Error messages that can result from this behavior can be a ReadTimeout, a ReadFailure, a WriteTimeout or a WriteFailure. In case of a Write/ReadFailure, the user will get back an error looking like the following: "Failure: Received X responses, and Y failures" (if this behavior I describe is incorrect, please correct me) This causes a usability problem. Since the handler will fail and throw an exception as soon as 1 failure happens, the error message that is returned to the user may not be accurate. (note: I am not entirely sure of the behavior in case of timeouts for now) For example, say a request at CL = QUORUM = 3, a failed request may complete first, then a successful one completes, and another fails. If the exception is thrown fast enough, the error message could say "Failure: Received 0 response, and 1 failure at CL = 3" Which: 1. doesn't make a lot of sense because the CL doesn't match the number of results in the message, so you end up thinking "what happened with the rest of the required CL?" 2. the information is incorrect. We did receive a successful response, only it came after the initial failure. >From that logic, I think it is safe to assume that the information returned in >the error message cannot be trusted in case of a failure. Only information >users should extract out of it is that at least 1 node has failed. For a big improvement in usability, the {{ReadCallback}} and {{AbstractWriteResponseHandler}} could instead wait for all responses to come back before unblocking the wait, or let it timeout. This is way, the users will be able to have some trust around the information returned to them. Additionally, an error that happens first prevents a timeout to happen because it fails immediately, and so potentially it hides problems with other replicas. If we were to wait for all responses, we might get a timeout, in that case we'd also be able to tell wether failures have happened *before* that timeout, and have a more complete diagnostic where you can't detect both errors at the same time. was: As a follow up to some exploration I have done for CASSANDRA-15543, I realized the following behavior in both {{ReadCallback}} and {{AbstractWriteHandler}}: - await for responses - when all required number of responses have come back: unblock the wait - when a single failure happens: unblock the wait - when unblocked, look to see if the counter of failures is > 1 and if so return an error message based on the {{failures}} map that's been filled Error messages that can result from this behavior can be a ReadTimeout, a ReadFailure, a WriteTimeout or a WriteFailure. In case of a Write/ReadFailure, the user will get back an error looking like the following: "Failure: Received X responses, and Y failures" (if this behavior I describe is incorrect, please correct me) This causes a usability problem. Since the handler will fail and throw an exception as soon as 1 failure happens, the error message that is returned to the user may not be accurate. (note: I am not entirely sure of the behavior in case of timeouts for now) At, say, CL = QUORUM = 3, the failed request may complete first, then a successful one completes, and another fails. If the exception is thrown fast enough, the error message could say "Failure: Received 0 response, and 1 failure at CL = 3" Which 1. doesn't make a lot of sense because the CL doesn't match the previous information, but 2. the information is incorrect. We received a successful response, only it came after the initial failure. >From that logic, I think it is safe to assume that the information returned in >the error message cannot be trusted in case of a failure. We can only know >that at least 1 node has failed, or not if the response is successful. I am suggesting that for a big improvement in usability, the ReadCallback and AbstractWriteResponseHandler wait for all responses to come back before unblocking the wait, or let it timeout. This is way, the users will be able to have some trust around the numbers returned to them. Also we would be able to return more information this way. Right now, an error that happens first prevents from a timeout to happen because it fails immediately, and so potentially it hides problems with other
[jira] [Updated] (CASSANDRA-15685) flaky testWithMismatchingPending - org.apache.cassandra.distributed.test.PreviewRepairTest
[ https://issues.apache.org/jira/browse/CASSANDRA-15685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Gallardo updated CASSANDRA-15685: --- Resolution: Duplicate Status: Resolved (was: Open) > flaky testWithMismatchingPending - > org.apache.cassandra.distributed.test.PreviewRepairTest > -- > > Key: CASSANDRA-15685 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15685 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: Kevin Gallardo >Priority: Normal > Fix For: 4.0-beta > > > Observed in: > https://app.circleci.com/pipelines/github/newkek/cassandra/34/workflows/1c6b157d-13c3-48a9-85fb-9fe8c153256b/jobs/191/tests > Failure: > {noformat} > testWithMismatchingPending - > org.apache.cassandra.distributed.test.PreviewRepairTest > junit.framework.AssertionFailedError > at > org.apache.cassandra.distributed.test.PreviewRepairTest.testWithMismatchingPending(PreviewRepairTest.java:97) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15685) flaky testWithMismatchingPending - org.apache.cassandra.distributed.test.PreviewRepairTest
[ https://issues.apache.org/jira/browse/CASSANDRA-15685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074648#comment-17074648 ] Kevin Gallardo commented on CASSANDRA-15685: Agreed > flaky testWithMismatchingPending - > org.apache.cassandra.distributed.test.PreviewRepairTest > -- > > Key: CASSANDRA-15685 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15685 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: Kevin Gallardo >Priority: Normal > Fix For: 4.0-beta > > > Observed in: > https://app.circleci.com/pipelines/github/newkek/cassandra/34/workflows/1c6b157d-13c3-48a9-85fb-9fe8c153256b/jobs/191/tests > Failure: > {noformat} > testWithMismatchingPending - > org.apache.cassandra.distributed.test.PreviewRepairTest > junit.framework.AssertionFailedError > at > org.apache.cassandra.distributed.test.PreviewRepairTest.testWithMismatchingPending(PreviewRepairTest.java:97) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15685) flaky testWithMismatchingPending - org.apache.cassandra.distributed.test.PreviewRepairTest
[ https://issues.apache.org/jira/browse/CASSANDRA-15685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074077#comment-17074077 ] Kevin Gallardo commented on CASSANDRA-15685: Hey [~dcapwell] afaict I have only seen this test failing on this run, I have not seen it fail on other more recent builds happening after CASSANDRA-15650 indeed. Opened this ticket since I couldn't find a report of this failure. I'm ok closing it if it's been fixed in CASSANDRA-15650 > flaky testWithMismatchingPending - > org.apache.cassandra.distributed.test.PreviewRepairTest > -- > > Key: CASSANDRA-15685 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15685 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: Kevin Gallardo >Priority: Normal > Fix For: 4.0-beta > > > Observed in: > https://app.circleci.com/pipelines/github/newkek/cassandra/34/workflows/1c6b157d-13c3-48a9-85fb-9fe8c153256b/jobs/191/tests > Failure: > {noformat} > testWithMismatchingPending - > org.apache.cassandra.distributed.test.PreviewRepairTest > junit.framework.AssertionFailedError > at > org.apache.cassandra.distributed.test.PreviewRepairTest.testWithMismatchingPending(PreviewRepairTest.java:97) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15631) Add AssertJ test dependency
[ https://issues.apache.org/jira/browse/CASSANDRA-15631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074062#comment-17074062 ] Kevin Gallardo commented on CASSANDRA-15631: Awesome, thank you! So for future reference, as mentioned on the proposal thread, the first test suite using AssertJ, to use as an example is now located at [{{org.apache.cassandra.repair.RepairJobTest}}|https://github.com/apache/cassandra/blob/6d343c59f275b1165be08e2256be878ac9b95fec/test/unit/org/apache/cassandra/repair/RepairJobTest.java#L73] > Add AssertJ test dependency > --- > > Key: CASSANDRA-15631 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15631 > Project: Cassandra > Issue Type: Improvement > Components: Test/dtest, Test/unit >Reporter: Kevin Gallardo >Assignee: Kevin Gallardo >Priority: Normal > Fix For: 4.0-beta > > > See > [proposal|https://lists.apache.org/thread.html/rc562ec47578d0ae6f346ba9e3d7469c1cd3f8b521a72ddcb2accc47b%40%3Cdev.cassandra.apache.org%3E]. > The goal is to add [AssertJ|https://assertj.github.io/doc/] to the test > framework to allow for more comprehensible and easier to write tests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15686) Improvements in circle CI default config
[ https://issues.apache.org/jira/browse/CASSANDRA-15686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Gallardo updated CASSANDRA-15686: --- Description: I have been looking at and played around with the [default CircleCI config|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml], a few comments/questions regarding the following topics: * Python dtests do not run successfully (200-300 failures) on {{medium}} instances, they seem to only run with small flaky failures on {{large}} instances or higher * Python Upgrade tests: ** Do not seem to run without many failures on any instance types / any parallelism setting ** Do not seem to parallelize well, it seems each container is going to download multiple C* versions ** Additionally it seems the configuration is not up to date, as currently we get errors because {{JAVA8_HOME}} is not set * Unit tests do not seem to parallelize optimally, number of test runners do not reflect the available CPUs on the container. Ideally if # of runners == # of CPUs, build time is improved, on any type of instances. ** For instance when using the current configuration, running on medium instances, build will use 1 junit test runner, but 2 CPUs are available. If using 2 runners, the build time is reduced from 19min (at the current main config of parallelism=4) to 12min. * There are some typos in the file, some dtests say "Run Unit Tests" but they are JVM dtests (see [here|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml#L1077], [here|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml#L1386]) So some ways to process these would be: * Do the Python dtests run successfully for anyone on {{medium}} instances? If not, would it make sense to bump them to {{large}} so that they can be run successfully? * Does anybody ever run the python upgrade tests on CircleCI and what is the configuration that makes it work? * Would it make sense to either hardcode the number of test runners in the unit tests with `-Dtest.runners` in the config file to reflect the number of CPUs on the instances, or change the build so that it is able to detect the appropriate number of core available automatically? Additionally, it seems this default config file (config.yml) is not as well maintained as the [{{config-2_1.yml}}|https://github.com/apache/cassandra/blob/trunk/.circleci/config-2_1.yml] (+its lowres/highres) version in the same folder (from CASSANDRA-14806). What is the reasoning for maintaining these 2 versions of the build? Could the better maintained version be used as the default? We could generate a lowres version of the new config-2_1.yml, and rename it {{config.yml}} so that it gets picked up by CircleCI automatically instead of the current default. was: I have been looking at and played around with the [default CircleCI config|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml], a few comments/questions regarding the following topics: * Python dtests do not run successfully (200-300 failures) on {{medium}} instances, they seem to only run with small flaky failures on {{large}} instances or higher * Python Upgrade tests: ** Do not seem to run without many failures on any instance types / any parallelism setting ** Do not seem to parallelize well, it seems each container is going to download multiple C* versions ** Additionally it seems the configuration is not up to date, as currently we get errors because {{JAVA8_HOME}} is not set * Unit tests do not seem to parallelize optimally, number of test runners do not reflect the available CPUs on the container. Ideally if # of runners == # of CPUs, build time is improved, on any type of instances. ** For instance when using the current configuration, running on medium instances, build will use 1 junit test runner, but 2 CPUs are available. If using 2 runners, the build time is reduced from 19min (at the current main config of parallelism=4) to 12min. * There are some typos in the file, some dtests say "Run Unit Tests" but they are JVM dtests (see [here|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml#L1077], [here|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml#L1386]) So some ways to process these would be: * Do the Python dtests run successfully for anyone on {{medium}} instances? If not, would it make sense to bump them to {{large}} so that they can be run successfully? * Does anybody ever run the python upgrade tests on CircleCI and what is the configuration that makes it work? * Would it make sense to either hardcode the number of test runners in the unit tests with `-Dtest.runners` in the config file to reflect the number of CPUs on the instances, or change the build so that it is able to detect the appropriate number of core available automatically? Additionally, it seems this default
[jira] [Created] (CASSANDRA-15686) Improvements in circle CI default config
Kevin Gallardo created CASSANDRA-15686: -- Summary: Improvements in circle CI default config Key: CASSANDRA-15686 URL: https://issues.apache.org/jira/browse/CASSANDRA-15686 Project: Cassandra Issue Type: Bug Reporter: Kevin Gallardo I have been looking at and played around with the [default CircleCI config|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml], a few comments/questions regarding the following topics: * Python dtests do not run successfully (200-300 failures) on {{medium}} instances, they seem to only run with small flaky failures on {{large}} instances or higher * Python Upgrade tests: ** Do not seem to run without many failures on any instance types / any parallelism setting ** Do not seem to parallelize well, it seems each container is going to download multiple C* versions ** Additionally it seems the configuration is not up to date, as currently we get errors because {{JAVA8_HOME}} is not set * Unit tests do not seem to parallelize optimally, number of test runners do not reflect the available CPUs on the container. Ideally if # of runners == # of CPUs, build time is improved, on any type of instances. ** For instance when using the current configuration, running on medium instances, build will use 1 junit test runner, but 2 CPUs are available. If using 2 runners, the build time is reduced from 19min (at the current main config of parallelism=4) to 12min. * There are some typos in the file, some dtests say "Run Unit Tests" but they are JVM dtests (see [here|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml#L1077], [here|https://github.com/apache/cassandra/blob/trunk/.circleci/config.yml#L1386]) So some ways to process these would be: * Do the Python dtests run successfully for anyone on {{medium}} instances? If not, would it make sense to bump them to {{large}} so that they can be run successfully? * Does anybody ever run the python upgrade tests on CircleCI and what is the configuration that makes it work? * Would it make sense to either hardcode the number of test runners in the unit tests with `-Dtest.runners` in the config file to reflect the number of CPUs on the instances, or change the build so that it is able to detect the appropriate number of core available automatically? Additionally, it seems this default config file (config.yml) is not as well maintained as the [{{config-2_1.yml}}|https://github.com/apache/cassandra/blob/trunk/.circleci/config-2_1.yml] (+its lowres/highres) version in the same folder. What is the reasoning for maintaining these 2 versions of the build? Could the better maintained version be used as the default? We could generate a lowres version of the new config-2_1.yml, and rename it {{config.yml}} so that it gets picked up by CircleCI automatically instead of the current default. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15685) flaky testWithMismatchingPending - org.apache.cassandra.distributed.test.PreviewRepairTest
[ https://issues.apache.org/jira/browse/CASSANDRA-15685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Gallardo updated CASSANDRA-15685: --- Bug Category: Parent values: Correctness(12982) Complexity: Low Hanging Fruit Discovered By: Unit Test Severity: Low Status: Open (was: Triage Needed) > flaky testWithMismatchingPending - > org.apache.cassandra.distributed.test.PreviewRepairTest > -- > > Key: CASSANDRA-15685 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15685 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: Kevin Gallardo >Priority: Normal > Fix For: 4.0-beta > > > Observed in: > https://app.circleci.com/pipelines/github/newkek/cassandra/34/workflows/1c6b157d-13c3-48a9-85fb-9fe8c153256b/jobs/191/tests > Failure: > {noformat} > testWithMismatchingPending - > org.apache.cassandra.distributed.test.PreviewRepairTest > junit.framework.AssertionFailedError > at > org.apache.cassandra.distributed.test.PreviewRepairTest.testWithMismatchingPending(PreviewRepairTest.java:97) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15685) flaky testWithMismatchingPending - org.apache.cassandra.distributed.test.PreviewRepairTest
[ https://issues.apache.org/jira/browse/CASSANDRA-15685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Gallardo updated CASSANDRA-15685: --- Fix Version/s: 4.0-beta > flaky testWithMismatchingPending - > org.apache.cassandra.distributed.test.PreviewRepairTest > -- > > Key: CASSANDRA-15685 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15685 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: Kevin Gallardo >Priority: Normal > Fix For: 4.0-beta > > > Observed in: > https://app.circleci.com/pipelines/github/newkek/cassandra/34/workflows/1c6b157d-13c3-48a9-85fb-9fe8c153256b/jobs/191/tests > Failure: > {noformat} > testWithMismatchingPending - > org.apache.cassandra.distributed.test.PreviewRepairTest > junit.framework.AssertionFailedError > at > org.apache.cassandra.distributed.test.PreviewRepairTest.testWithMismatchingPending(PreviewRepairTest.java:97) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-15685) flaky testWithMismatchingPending - org.apache.cassandra.distributed.test.PreviewRepairTest
Kevin Gallardo created CASSANDRA-15685: -- Summary: flaky testWithMismatchingPending - org.apache.cassandra.distributed.test.PreviewRepairTest Key: CASSANDRA-15685 URL: https://issues.apache.org/jira/browse/CASSANDRA-15685 Project: Cassandra Issue Type: Bug Components: Test/dtest Reporter: Kevin Gallardo Observed in: https://app.circleci.com/pipelines/github/newkek/cassandra/34/workflows/1c6b157d-13c3-48a9-85fb-9fe8c153256b/jobs/191/tests Failure: {noformat} testWithMismatchingPending - org.apache.cassandra.distributed.test.PreviewRepairTest junit.framework.AssertionFailedError at org.apache.cassandra.distributed.test.PreviewRepairTest.testWithMismatchingPending(PreviewRepairTest.java:97) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15683) Fix NPE in SimpleReadWriteTest after test framework behavior change
[ https://issues.apache.org/jira/browse/CASSANDRA-15683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Gallardo updated CASSANDRA-15683: --- Test and Documentation Plan: no doc changes Status: Patch Available (was: In Progress) > Fix NPE in SimpleReadWriteTest after test framework behavior change > --- > > Key: CASSANDRA-15683 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15683 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: Kevin Gallardo >Assignee: Kevin Gallardo >Priority: Normal > Fix For: 4.0-alpha > > > In JVM dtests, all exceptions thrown in a request execution used to be always > wrapped in a RuntimeException. > After CASSANDRA-15650 changes, the behavior has been changed to: If the > exception is a RuntimeException, just rethrow it, otherwise wrap in > RuntimeException, as you can see in: > [https://github.com/apache/cassandra/commit/dfc279a22a5563ac7a832a586914d5410426e9b7#diff-0b019281b7e97248577c82af0e663ef4R211] > This causes the tests that were always extracting the cause from the wrapping > RuntimeException before, to check the root cause of the error, to throw a NPE > when they call {{getCause()}}, tests such as > {{SimpleReadWriteTest#readWithSchemaDisagreement}} and > {{SimpleReadWriteTest#writeWithSchemaDisagreement}}. > Can be fixed by simply not unwrapping the cause in those tests, use the > thrown exception directly, if the behavior of "not always wrapping in > RuntimeException" is agreed to be correct. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15683) Fix NPE in SimpleReadWriteTest after test framework behavior change
[ https://issues.apache.org/jira/browse/CASSANDRA-15683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073863#comment-17073863 ] Kevin Gallardo commented on CASSANDRA-15683: Changes available at: https://github.com/newkek/cassandra/pull/1 > Fix NPE in SimpleReadWriteTest after test framework behavior change > --- > > Key: CASSANDRA-15683 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15683 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: Kevin Gallardo >Assignee: Kevin Gallardo >Priority: Normal > Fix For: 4.0-alpha > > > In JVM dtests, all exceptions thrown in a request execution used to be always > wrapped in a RuntimeException. > After CASSANDRA-15650 changes, the behavior has been changed to: If the > exception is a RuntimeException, just rethrow it, otherwise wrap in > RuntimeException, as you can see in: > [https://github.com/apache/cassandra/commit/dfc279a22a5563ac7a832a586914d5410426e9b7#diff-0b019281b7e97248577c82af0e663ef4R211] > This causes the tests that were always extracting the cause from the wrapping > RuntimeException before, to check the root cause of the error, to throw a NPE > when they call {{getCause()}}, tests such as > {{SimpleReadWriteTest#readWithSchemaDisagreement}} and > {{SimpleReadWriteTest#writeWithSchemaDisagreement}}. > Can be fixed by simply not unwrapping the cause in those tests, use the > thrown exception directly, if the behavior of "not always wrapping in > RuntimeException" is agreed to be correct. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15683) Fix NPE in SimpleReadWriteTest after test framework behavior change
[ https://issues.apache.org/jira/browse/CASSANDRA-15683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Gallardo updated CASSANDRA-15683: --- Description: In JVM dtests, all exceptions thrown in a request execution used to be always wrapped in a RuntimeException. After CASSANDRA-15650 changes, the behavior has been changed to: If the exception is a RuntimeException, just rethrow it, otherwise wrap in RuntimeException, as you can see in: [https://github.com/apache/cassandra/commit/dfc279a22a5563ac7a832a586914d5410426e9b7#diff-0b019281b7e97248577c82af0e663ef4R211] This causes the tests that were always extracting the cause from the wrapping RuntimeException before, to check the root cause of the error, to throw a NPE when they call {{getCause()}}, tests such as {{SimpleReadWriteTest#readWithSchemaDisagreement}} and {{SimpleReadWriteTest#writeWithSchemaDisagreement}}. Can be fixed by simply not unwrapping the cause in those tests, use the thrown exception directly, if the behavior of "not always wrapping in RuntimeException" is agreed to be correct. was: In JVM dtests, all exceptions thrown in a request execution used to be always wrapped in a RuntimeException. After CASSANDRA-15650 changes, the behavior has been changed to: If the exception is a RuntimeException, just rethrow it, otherwise wrap in RuntimeException, as you can see in: [https://github.com/apache/cassandra/commit/dfc279a22a5563ac7a832a586914d5410426e9b7#diff-0b019281b7e97248577c82af0e663ef4R211] This causes the tests that were always extracting the cause from the wrapping RuntimeException before, to check the root cause of the error, to throw a NPE when they call {{getCause()}}, tests such as {{SimpleReadWriteTest#readWithSchemaDisagreement}} and {{SimpleReadWriteTest#writeWithSchemaDisagreement}}. Can be fixed by simply not unwrapping the cause in those tests, use the thrown exception directly, if the behavior of "not always wrapping in RuntimeException" is expected. > Fix NPE in SimpleReadWriteTest after test framework behavior change > --- > > Key: CASSANDRA-15683 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15683 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: Kevin Gallardo >Assignee: Kevin Gallardo >Priority: Normal > Fix For: 4.0-alpha > > > In JVM dtests, all exceptions thrown in a request execution used to be always > wrapped in a RuntimeException. > After CASSANDRA-15650 changes, the behavior has been changed to: If the > exception is a RuntimeException, just rethrow it, otherwise wrap in > RuntimeException, as you can see in: > [https://github.com/apache/cassandra/commit/dfc279a22a5563ac7a832a586914d5410426e9b7#diff-0b019281b7e97248577c82af0e663ef4R211] > This causes the tests that were always extracting the cause from the wrapping > RuntimeException before, to check the root cause of the error, to throw a NPE > when they call {{getCause()}}, tests such as > {{SimpleReadWriteTest#readWithSchemaDisagreement}} and > {{SimpleReadWriteTest#writeWithSchemaDisagreement}}. > Can be fixed by simply not unwrapping the cause in those tests, use the > thrown exception directly, if the behavior of "not always wrapping in > RuntimeException" is agreed to be correct. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15683) Fix NPE in SimpleReadWriteTest after test framework behavior change
[ https://issues.apache.org/jira/browse/CASSANDRA-15683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Gallardo updated CASSANDRA-15683: --- Bug Category: Parent values: Correctness(12982)Level 1 values: Test Failure(12990) Complexity: Low Hanging Fruit Discovered By: Unit Test Severity: Low Status: Open (was: Triage Needed) > Fix NPE in SimpleReadWriteTest after test framework behavior change > --- > > Key: CASSANDRA-15683 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15683 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: Kevin Gallardo >Assignee: Kevin Gallardo >Priority: Normal > Fix For: 4.0-alpha > > > In JVM dtests, all exceptions thrown in a request execution used to be always > wrapped in a RuntimeException. > After CASSANDRA-15650 changes, the behavior has been changed to: If the > exception is a RuntimeException, just rethrow it, otherwise wrap in > RuntimeException, as you can see in: > [https://github.com/apache/cassandra/commit/dfc279a22a5563ac7a832a586914d5410426e9b7#diff-0b019281b7e97248577c82af0e663ef4R211] > This causes the tests that were always extracting the cause from the wrapping > RuntimeException before, to check the root cause of the error, to throw a NPE > when they call {{getCause()}}, tests such as > {{SimpleReadWriteTest#readWithSchemaDisagreement}} and > {{SimpleReadWriteTest#writeWithSchemaDisagreement}}. > Can be fixed by simply not unwrapping the cause in those tests, use the > thrown exception directly, if the behavior of "not always wrapping in > RuntimeException" is expected. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-15683) Fix NPE in SimpleReadWriteTest after test framework behavior change
[ https://issues.apache.org/jira/browse/CASSANDRA-15683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Gallardo reassigned CASSANDRA-15683: -- Assignee: Kevin Gallardo > Fix NPE in SimpleReadWriteTest after test framework behavior change > --- > > Key: CASSANDRA-15683 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15683 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: Kevin Gallardo >Assignee: Kevin Gallardo >Priority: Normal > > In JVM dtests, all exceptions thrown in a request execution used to be always > wrapped in a RuntimeException. > After CASSANDRA-15650 changes, the behavior has been changed to: If the > exception is a RuntimeException, just rethrow it, otherwise wrap in > RuntimeException, as you can see in: > [https://github.com/apache/cassandra/commit/dfc279a22a5563ac7a832a586914d5410426e9b7#diff-0b019281b7e97248577c82af0e663ef4R211] > This causes the tests that were always extracting the cause from the wrapping > RuntimeException before, to check the root cause of the error, to throw a NPE > when they call {{getCause()}}, tests such as > {{SimpleReadWriteTest#readWithSchemaDisagreement}} and > {{SimpleReadWriteTest#writeWithSchemaDisagreement}}. > Can be fixed by simply not unwrapping the cause in those tests, use the > thrown exception directly, if the behavior of "not always wrapping in > RuntimeException" is expected. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15683) Fix NPE in SimpleReadWriteTest after test framework behavior change
[ https://issues.apache.org/jira/browse/CASSANDRA-15683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Gallardo updated CASSANDRA-15683: --- Fix Version/s: (was: 4.0) > Fix NPE in SimpleReadWriteTest after test framework behavior change > --- > > Key: CASSANDRA-15683 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15683 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: Kevin Gallardo >Assignee: Kevin Gallardo >Priority: Normal > Fix For: 4.0-alpha > > > In JVM dtests, all exceptions thrown in a request execution used to be always > wrapped in a RuntimeException. > After CASSANDRA-15650 changes, the behavior has been changed to: If the > exception is a RuntimeException, just rethrow it, otherwise wrap in > RuntimeException, as you can see in: > [https://github.com/apache/cassandra/commit/dfc279a22a5563ac7a832a586914d5410426e9b7#diff-0b019281b7e97248577c82af0e663ef4R211] > This causes the tests that were always extracting the cause from the wrapping > RuntimeException before, to check the root cause of the error, to throw a NPE > when they call {{getCause()}}, tests such as > {{SimpleReadWriteTest#readWithSchemaDisagreement}} and > {{SimpleReadWriteTest#writeWithSchemaDisagreement}}. > Can be fixed by simply not unwrapping the cause in those tests, use the > thrown exception directly, if the behavior of "not always wrapping in > RuntimeException" is expected. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15683) Fix NPE in SimpleReadWriteTest after test framework behavior change
[ https://issues.apache.org/jira/browse/CASSANDRA-15683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Gallardo updated CASSANDRA-15683: --- Fix Version/s: 4.0-alpha 4.0 > Fix NPE in SimpleReadWriteTest after test framework behavior change > --- > > Key: CASSANDRA-15683 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15683 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: Kevin Gallardo >Assignee: Kevin Gallardo >Priority: Normal > Fix For: 4.0, 4.0-alpha > > > In JVM dtests, all exceptions thrown in a request execution used to be always > wrapped in a RuntimeException. > After CASSANDRA-15650 changes, the behavior has been changed to: If the > exception is a RuntimeException, just rethrow it, otherwise wrap in > RuntimeException, as you can see in: > [https://github.com/apache/cassandra/commit/dfc279a22a5563ac7a832a586914d5410426e9b7#diff-0b019281b7e97248577c82af0e663ef4R211] > This causes the tests that were always extracting the cause from the wrapping > RuntimeException before, to check the root cause of the error, to throw a NPE > when they call {{getCause()}}, tests such as > {{SimpleReadWriteTest#readWithSchemaDisagreement}} and > {{SimpleReadWriteTest#writeWithSchemaDisagreement}}. > Can be fixed by simply not unwrapping the cause in those tests, use the > thrown exception directly, if the behavior of "not always wrapping in > RuntimeException" is expected. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15683) Fix NPE in SimpleReadWriteTest after test framework behavior change
[ https://issues.apache.org/jira/browse/CASSANDRA-15683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073839#comment-17073839 ] Kevin Gallardo commented on CASSANDRA-15683: [~dcapwell] [~blerer] [~e.dimitrova] ping since you were author/reviewers on https://issues.apache.org/jira/browse/CASSANDRA-15650 > Fix NPE in SimpleReadWriteTest after test framework behavior change > --- > > Key: CASSANDRA-15683 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15683 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: Kevin Gallardo >Priority: Normal > > In JVM dtests, all exceptions thrown in a request execution used to be always > wrapped in a RuntimeException. > After CASSANDRA-15650 changes, the behavior has been changed to: If the > exception is a RuntimeException, just rethrow it, otherwise wrap in > RuntimeException, as you can see in: > [https://github.com/apache/cassandra/commit/dfc279a22a5563ac7a832a586914d5410426e9b7#diff-0b019281b7e97248577c82af0e663ef4R211] > This causes the tests that were always extracting the cause from the wrapping > RuntimeException before, to check the root cause of the error, to throw a NPE > when they call {{getCause()}}, tests such as > {{SimpleReadWriteTest#readWithSchemaDisagreement}} and > {{SimpleReadWriteTest#writeWithSchemaDisagreement}}. > Can be fixed by simply not unwrapping the cause in those tests, use the > thrown exception directly, if the behavior of "not always wrapping in > RuntimeException" is expected. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-15683) Fix NPE in SimpleReadWriteTest after test framework behavior change
Kevin Gallardo created CASSANDRA-15683: -- Summary: Fix NPE in SimpleReadWriteTest after test framework behavior change Key: CASSANDRA-15683 URL: https://issues.apache.org/jira/browse/CASSANDRA-15683 Project: Cassandra Issue Type: Bug Components: Test/dtest Reporter: Kevin Gallardo In JVM dtests, all exceptions thrown in a request execution used to be always wrapped in a RuntimeException. After CASSANDRA-15650 changes, the behavior has been changed to: If the exception is a RuntimeException, just rethrow it, otherwise wrap in RuntimeException, as you can see in: [https://github.com/apache/cassandra/commit/dfc279a22a5563ac7a832a586914d5410426e9b7#diff-0b019281b7e97248577c82af0e663ef4R211] This causes the tests that were always extracting the cause from the wrapping RuntimeException before, to check the root cause of the error, to throw a NPE when they call {{getCause()}}, tests such as {{SimpleReadWriteTest#readWithSchemaDisagreement}} and {{SimpleReadWriteTest#writeWithSchemaDisagreement}}. Can be fixed by simply not unwrapping the cause in those tests, use the thrown exception directly, if the behavior of "not always wrapping in RuntimeException" is expected. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15631) Add AssertJ test dependency
[ https://issues.apache.org/jira/browse/CASSANDRA-15631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072930#comment-17072930 ] Kevin Gallardo commented on CASSANDRA-15631: Great also fwiw the tests are also run on each commit at https://app.circleci.com/pipelines/github/newkek/cassandra?branch=15631-trunk > Add AssertJ test dependency > --- > > Key: CASSANDRA-15631 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15631 > Project: Cassandra > Issue Type: Improvement > Components: Test/dtest, Test/unit >Reporter: Kevin Gallardo >Assignee: Kevin Gallardo >Priority: Normal > Fix For: 4.0-beta > > > See > [proposal|https://lists.apache.org/thread.html/rc562ec47578d0ae6f346ba9e3d7469c1cd3f8b521a72ddcb2accc47b%40%3Cdev.cassandra.apache.org%3E]. > The goal is to add [AssertJ|https://assertj.github.io/doc/] to the test > framework to allow for more comprehensible and easier to write tests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-14520) ClosedChannelException handled as FSError
[ https://issues.apache.org/jira/browse/CASSANDRA-14520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Gallardo updated CASSANDRA-14520: --- Resolution: Cannot Reproduce Status: Resolved (was: Open) > ClosedChannelException handled as FSError > - > > Key: CASSANDRA-14520 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14520 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Streaming and Messaging >Reporter: Blake Eggleston >Assignee: Kevin Gallardo >Priority: Urgent > Fix For: 4.0 > > > After the messaging service netty refactor, I’ve seen a few instances where a > closed socket causes a ClosedChannelException (an IOException subclass) to be > thrown. The exception is caught by ChannelProxy, interpreted as a disk error, > and is then re-thrown as an FSError, causing the node to be shutdown. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Issue Comment Deleted] (CASSANDRA-14520) ClosedChannelException handled as FSError
[ https://issues.apache.org/jira/browse/CASSANDRA-14520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Gallardo updated CASSANDRA-14520: --- Comment: was deleted (was: Discussed this with [~blerer] in details and it seems that the second {{ClosedByInterruptException}} appearing in the stacktrace was caused by the first exception. It is hard to track exactly how that was happening because of the large refactoring around streaming and other parts of the code since the ticket was opened. It would be safer to keep the current behavior in the {{ChannelProxy}} which is to throw an Error in case of an IOException. Will close the issue, if this issue happens again with the current state of the code we can have a better look and figure out if the {{ClosedByInterruptException}} is still problematic.) > ClosedChannelException handled as FSError > - > > Key: CASSANDRA-14520 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14520 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Streaming and Messaging >Reporter: Blake Eggleston >Assignee: Kevin Gallardo >Priority: Urgent > Fix For: 4.0 > > > After the messaging service netty refactor, I’ve seen a few instances where a > closed socket causes a ClosedChannelException (an IOException subclass) to be > thrown. The exception is caught by ChannelProxy, interpreted as a disk error, > and is then re-thrown as an FSError, causing the node to be shutdown. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14520) ClosedChannelException handled as FSError
[ https://issues.apache.org/jira/browse/CASSANDRA-14520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072904#comment-17072904 ] Kevin Gallardo commented on CASSANDRA-14520: Discussed this with [~blerer] in details and it seems that the second {{ClosedByInterruptException}} appearing in the stacktrace was caused by the first exception. It is hard to track exactly how that was happening because of the large refactoring around streaming and other parts of the code since the ticket was opened. It would be safer to keep the current behavior in the {{ChannelProxy}} which is to throw an Error in case of an IOException. Will close the issue, if this issue happens again with the current state of the code we can have a better look and figure out if the {{ClosedByInterruptException}} is still problematic. > ClosedChannelException handled as FSError > - > > Key: CASSANDRA-14520 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14520 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Streaming and Messaging >Reporter: Blake Eggleston >Assignee: Kevin Gallardo >Priority: Urgent > Fix For: 4.0 > > > After the messaging service netty refactor, I’ve seen a few instances where a > closed socket causes a ClosedChannelException (an IOException subclass) to be > thrown. The exception is caught by ChannelProxy, interpreted as a disk error, > and is then re-thrown as an FSError, causing the node to be shutdown. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15631) Add AssertJ test dependency
[ https://issues.apache.org/jira/browse/CASSANDRA-15631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072859#comment-17072859 ] Kevin Gallardo commented on CASSANDRA-15631: Yeah I have been trying to figure out what was wrong here with this dependency. It seems a transitive dependency is trying to be pulled from maven via http and not https, and consequently it can’t because of new norms maven repo doesn’t support HTTP anymore. I.e. I have added the assertj-core dependency, and its parent, assertj-parent-pom is trying to pull a dependency, but ant is trying to pull it via http. It seems to be related to what’s described in here: https://clutcher.github.io/post/hybris/maven_central_repository_http_issue/ {noformat} Starting from January 15, 2020 maven central repository is not supporting access via http://repo1.maven.org/maven2 due to security reasons. Unfortunately under the hood hybris uses ant-maven-task, which is not maintained since 2011 and has hardcoded link to http version of maven central. {noformat} Without the explicit dependency we get an error like: {noformat} /Users/kevingallardo/Documents/workspace/cassandra-trunk/build.xml:905: Unable to resolve artifact: Unable to get dependency information: Unable to read the metadata file for artifact 'org.assertj:assertj-core:jar': POM 'org.junit:junit-bom' not found in repository: Unable to download the artifact from any repository org.junit:junit-bom:pom:5.6.0 from the specified remote repositories: central (http://repo1.maven.org/maven2) for project org.junit:junit-bom org.assertj:assertj-core:jar:3.15.0 from the specified remote repositories: apache (https://repo.maven.apache.org/maven2), central (https://repo1.maven.org/maven2) Path to dependency: 1) org.apache.cassandra:cassandra-test-deps:jar:4.0-alpha4-SNAPSHOT {noformat} > Add AssertJ test dependency > --- > > Key: CASSANDRA-15631 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15631 > Project: Cassandra > Issue Type: Improvement > Components: Test/dtest, Test/unit >Reporter: Kevin Gallardo >Assignee: Kevin Gallardo >Priority: Normal > Fix For: 4.0-beta > > > See > [proposal|https://lists.apache.org/thread.html/rc562ec47578d0ae6f346ba9e3d7469c1cd3f8b521a72ddcb2accc47b%40%3Cdev.cassandra.apache.org%3E]. > The goal is to add [AssertJ|https://assertj.github.io/doc/] to the test > framework to allow for more comprehensible and easier to write tests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15631) Add AssertJ test dependency
[ https://issues.apache.org/jira/browse/CASSANDRA-15631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071854#comment-17071854 ] Kevin Gallardo commented on CASSANDRA-15631: [~dcapwell] I have added a new commit on the same branch that fixes the issues you mentioned. Thanks > Add AssertJ test dependency > --- > > Key: CASSANDRA-15631 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15631 > Project: Cassandra > Issue Type: Improvement > Components: Test/dtest, Test/unit >Reporter: Kevin Gallardo >Assignee: Kevin Gallardo >Priority: Normal > Fix For: 4.0-beta > > > See > [proposal|https://lists.apache.org/thread.html/rc562ec47578d0ae6f346ba9e3d7469c1cd3f8b521a72ddcb2accc47b%40%3Cdev.cassandra.apache.org%3E]. > The goal is to add [AssertJ|https://assertj.github.io/doc/] to the test > framework to allow for more comprehensible and easier to write tests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15676) flaky test testWriteUnknownResult- org.apache.cassandra.distributed.test.CasWriteTest
[ https://issues.apache.org/jira/browse/CASSANDRA-15676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Gallardo updated CASSANDRA-15676: --- Fix Version/s: 4.0-beta > flaky test testWriteUnknownResult- > org.apache.cassandra.distributed.test.CasWriteTest > - > > Key: CASSANDRA-15676 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15676 > Project: Cassandra > Issue Type: Task > Components: Test/dtest >Reporter: Kevin Gallardo >Priority: Normal > Fix For: 4.0-beta > > > Failure observed in: > https://app.circleci.com/pipelines/github/newkek/cassandra/33/workflows/54007cf7-4424-4ec1-9655-665f6044e6d1/jobs/187/tests > {noformat} > testWriteUnknownResult - org.apache.cassandra.distributed.test.CasWriteTest > junit.framework.AssertionFailedError: Expecting cause to be > CasWriteUncertainException > at > org.apache.cassandra.distributed.test.CasWriteTest.testWriteUnknownResult(CasWriteTest.java:257) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-15676) flaky test testWriteUnknownResult- org.apache.cassandra.distributed.test.CasWriteTest
Kevin Gallardo created CASSANDRA-15676: -- Summary: flaky test testWriteUnknownResult- org.apache.cassandra.distributed.test.CasWriteTest Key: CASSANDRA-15676 URL: https://issues.apache.org/jira/browse/CASSANDRA-15676 Project: Cassandra Issue Type: Task Components: Test/dtest Reporter: Kevin Gallardo Failure observed in: https://app.circleci.com/pipelines/github/newkek/cassandra/33/workflows/54007cf7-4424-4ec1-9655-665f6044e6d1/jobs/187/tests {noformat} testWriteUnknownResult - org.apache.cassandra.distributed.test.CasWriteTest junit.framework.AssertionFailedError: Expecting cause to be CasWriteUncertainException at org.apache.cassandra.distributed.test.CasWriteTest.testWriteUnknownResult(CasWriteTest.java:257) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15631) Add AssertJ test dependency
[ https://issues.apache.org/jira/browse/CASSANDRA-15631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067816#comment-17067816 ] Kevin Gallardo commented on CASSANDRA-15631: Hey [~dcapwell] thank you very much for looking into it. For some reason I did not get a notification of your comment and have done some more work on the branch since then. Notably: I put the AssertJ dependency to 3.15. bq. at line is there a reason you added the round? It slightly changes the original test behavior; I don't mind, but it is a semantic difference. Right, I had rounded because {{measureDeep()}} returns a long and the signature of {{isLessThan()}} doesn't allow for a double, but instead I should change the compared value to a double so I don't have to round. bq. at line you dropped this assert Not sure which assert you were referring to since the code has changed since your comment, but tracing back to the line number you linked on trunk points to the assert on {{session.getSyncingTasks().size()}}. In my branch it was changed to {{assertThat(session.getSyncingTasks()).isEmpty()}}, so the functionality should be equivalent. Please let me know if you were referring to another assert. bq. at line should we change the assert message to be more clear? This will only print out the numbers, may be nice to comment that they are object sizes. Makes sense, will change that. For CI, you can see the runs at https://app.circleci.com/pipelines/github/newkek/cassandra?branch=15631-trunk . Latest run (#17) had a failure that didn't seem related to these changes, and the test passed when I ran it locally. > Add AssertJ test dependency > --- > > Key: CASSANDRA-15631 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15631 > Project: Cassandra > Issue Type: Improvement > Components: Test/dtest, Test/unit >Reporter: Kevin Gallardo >Assignee: Kevin Gallardo >Priority: Normal > Fix For: 4.0-beta > > > See > [proposal|https://lists.apache.org/thread.html/rc562ec47578d0ae6f346ba9e3d7469c1cd3f8b521a72ddcb2accc47b%40%3Cdev.cassandra.apache.org%3E]. > The goal is to add [AssertJ|https://assertj.github.io/doc/] to the test > framework to allow for more comprehensible and easier to write tests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
[ https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067780#comment-17067780 ] Kevin Gallardo commented on CASSANDRA-15543: Thanks [~blerer]! > flaky test > org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement > --- > > Key: CASSANDRA-15543 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15543 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: David Capwell >Assignee: Kevin Gallardo >Priority: Normal > Fix For: 4.0-alpha > > > This fails infrequently, last seen failure was on java 8 > {code} > junit.framework.AssertionFailedError > at > org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15631) Add AssertJ test dependency
[ https://issues.apache.org/jira/browse/CASSANDRA-15631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066164#comment-17066164 ] Kevin Gallardo commented on CASSANDRA-15631: Changes proposed in https://github.com/newkek/cassandra/tree/15631-trunk. As proposed on the ML, I added the test dependency and also went ahead and re-wrote one randomly chosen test to show some examples of what a full test written with AssertJ assertions would look like. Also included a custom assertions to show it helps with understanding more clearly intent of tests. > Add AssertJ test dependency > --- > > Key: CASSANDRA-15631 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15631 > Project: Cassandra > Issue Type: Improvement > Components: Test/dtest, Test/unit >Reporter: Kevin Gallardo >Assignee: Kevin Gallardo >Priority: Normal > Fix For: 4.0-beta > > > See > [proposal|https://lists.apache.org/thread.html/rc562ec47578d0ae6f346ba9e3d7469c1cd3f8b521a72ddcb2accc47b%40%3Cdev.cassandra.apache.org%3E]. > The goal is to add [AssertJ|https://assertj.github.io/doc/] to the test > framework to allow for more comprehensible and easier to write tests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
[ https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066073#comment-17066073 ] Kevin Gallardo edited comment on CASSANDRA-15543 at 3/24/20, 7:34 PM: -- After review from [~blerer] I have made some changes and the latest CI run is [here|https://app.circleci.com/pipelines/github/newkek/cassandra?branch=15543-trunk] as #11. All the unit and JVM dtests ran successfully. was (Author: newkek): After review from [~blerer] I have made some changes and the latest CI run is [here|https://app.circleci.com/pipelines/github/newkek/cassandra?branch=15543-trunk] as #11. All the unit and JVM dtests run successfully. > flaky test > org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement > --- > > Key: CASSANDRA-15543 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15543 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: David Capwell >Assignee: Kevin Gallardo >Priority: Normal > Fix For: 4.0-alpha > > > This fails infrequently, last seen failure was on java 8 > {code} > junit.framework.AssertionFailedError > at > org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
[ https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066110#comment-17066110 ] Kevin Gallardo commented on CASSANDRA-15543: Just pushed a small commit that removes some unused imports I forgot to clean up: [https://github.com/newkek/cassandra/commit/55e94d94168ec62b01bdb764e1ea74d3d217ec41] Feel free to squash > flaky test > org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement > --- > > Key: CASSANDRA-15543 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15543 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: David Capwell >Assignee: Kevin Gallardo >Priority: Normal > Fix For: 4.0-alpha > > > This fails infrequently, last seen failure was on java 8 > {code} > junit.framework.AssertionFailedError > at > org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
[ https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066073#comment-17066073 ] Kevin Gallardo commented on CASSANDRA-15543: After review from [~blerer] I have made some changes and the latest CI run is [here|https://app.circleci.com/pipelines/github/newkek/cassandra?branch=15543-trunk] as #11. All the unit and JVM dtests run successfully. > flaky test > org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement > --- > > Key: CASSANDRA-15543 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15543 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: David Capwell >Assignee: Kevin Gallardo >Priority: Normal > Fix For: 4.0-alpha > > > This fails infrequently, last seen failure was on java 8 > {code} > junit.framework.AssertionFailedError > at > org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14520) ClosedChannelException handled as FSError
[ https://issues.apache.org/jira/browse/CASSANDRA-14520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063456#comment-17063456 ] Kevin Gallardo commented on CASSANDRA-14520: This exception is caused when a read of a local SSTable gets interrupted by another thread, or just interrupted. When this interruption happens, the error is caught by the {{ChannelProxy}}, and thrown as a {{FsReadError}}. My understanding then is that the {{FsReadError}} is not caught, but handled at a higher level in the Cassandra daemon by the {{DefaultFsErrorHandler}}. The {{DefaultFsErrorHandler}} has multiple ways to treating a FsError, decided by how the {{disk_failure_policy}} is configured, if configured to {{stop}}, it will indeed stop the daemon/node. It seems that by default the {{disk_failure_policy}} is {{ignore}} though, which would mean it doesn't stop the node. I am not 100% sure that since the CASSANDRA-15066 changes, long reads cannot be interrupted by another thread, i.e. it seems this exception could still be thrown at any time. In any case, it seem the case of a long read of a local SSTable being interrupted shouldn't trigger the node to shutdown, ever? In which case I would suggest to catch the {{ClosedByInterruptException}} in {{ChannelProxy}} and rethrow it as a runtime exception, which should be caught in the {{NettyStreamingMessageSender.FileStreamTask#run}} and have a more graceful exception handling. Does that make sense? [~bdeggleston] [~benedict] > ClosedChannelException handled as FSError > - > > Key: CASSANDRA-14520 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14520 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Streaming and Messaging >Reporter: Blake Eggleston >Assignee: Kevin Gallardo >Priority: Urgent > Fix For: 4.0 > > > After the messaging service netty refactor, I’ve seen a few instances where a > closed socket causes a ClosedChannelException (an IOException subclass) to be > thrown. The exception is caught by ChannelProxy, interpreted as a disk error, > and is then re-thrown as an FSError, causing the node to be shutdown. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
[ https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061801#comment-17061801 ] Kevin Gallardo commented on CASSANDRA-15543: I have launched builds of this branch, you can see the results [on CircleCI|https://app.circleci.com/pipelines/github/newkek/cassandra?branch=15543-trunk] > flaky test > org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement > --- > > Key: CASSANDRA-15543 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15543 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: David Capwell >Assignee: Kevin Gallardo >Priority: Normal > Fix For: 4.0-alpha > > > This fails infrequently, last seen failure was on java 8 > {code} > junit.framework.AssertionFailedError > at > org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
[ https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061740#comment-17061740 ] Kevin Gallardo commented on CASSANDRA-15543: Great, thanks Benedict and Benjamin. > flaky test > org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement > --- > > Key: CASSANDRA-15543 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15543 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: David Capwell >Assignee: Kevin Gallardo >Priority: Normal > Fix For: 4.0-alpha > > > This fails infrequently, last seen failure was on java 8 > {code} > junit.framework.AssertionFailedError > at > org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
[ https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061159#comment-17061159 ] Kevin Gallardo edited comment on CASSANDRA-15543 at 3/17/20, 8:07 PM: -- Hi [~benedict], I've seen you have done quite a bit of work in {{ReadCallback}}, I was wondering if you had some cycles to take a look at the patch above (and optionally the discussion going on in this ticket)? Thanks was (Author: newkek): Hi [~benedict], I've seen you have done quite a bit of work in {{ReadCallback}}, I was wondering if you had some cycles to take a look at the patch above? Thanks > flaky test > org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement > --- > > Key: CASSANDRA-15543 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15543 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: David Capwell >Assignee: Kevin Gallardo >Priority: Normal > Fix For: 4.0-alpha > > > This fails infrequently, last seen failure was on java 8 > {code} > junit.framework.AssertionFailedError > at > org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
[ https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061159#comment-17061159 ] Kevin Gallardo commented on CASSANDRA-15543: Hi [~benedict], I've seen you have done quite a bit of work in {{ReadCallback}}, I was wondering if you had some cycles to take a look at the patch above? Thanks > flaky test > org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement > --- > > Key: CASSANDRA-15543 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15543 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: David Capwell >Assignee: Kevin Gallardo >Priority: Normal > Fix For: 4.0-alpha > > > This fails infrequently, last seen failure was on java 8 > {code} > junit.framework.AssertionFailedError > at > org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15642) Inconsistent failure messages on distributed queries
[ https://issues.apache.org/jira/browse/CASSANDRA-15642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059049#comment-17059049 ] Kevin Gallardo commented on CASSANDRA-15642: [~e.dimitrova] fyi > Inconsistent failure messages on distributed queries > > > Key: CASSANDRA-15642 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15642 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Coordination >Reporter: Kevin Gallardo >Priority: Normal > > As a follow up to some exploration I have done for CASSANDRA-15543, I > realized the following behavior in both {{ReadCallback}} and > {{AbstractWriteHandler}}: > - await for responses > - when all required number of responses have come back: unblock the wait > - when a single failure happens: unblock the wait > - when unblocked, look to see if the counter of failures is > 1 and if so > return an error message based on the {{failures}} map that's been filled > Error messages that can result from this behavior can be a ReadTimeout, a > ReadFailure, a WriteTimeout or a WriteFailure. > In case of a Write/ReadFailure, the user will get back an error looking like > the following: > "Failure: Received X responses, and Y failures" > (if this behavior I describe is incorrect, please correct me) > This causes a usability problem. Since the handler will fail and throw an > exception as soon as 1 failure happens, the error message that is returned to > the user may not be accurate. > (note: I am not entirely sure of the behavior in case of timeouts for now) > At, say, CL = QUORUM = 3, the failed request may complete first, then a > successful one completes, and another fails. If the exception is thrown fast > enough, the error message could say > "Failure: Received 0 response, and 1 failure at CL = 3" > Which 1. doesn't make a lot of sense because the CL doesn't match the > previous information, but 2. the information is incorrect. We received a > successful response, only it came after the initial failure. > From that logic, I think it is safe to assume that the information returned > in the error message cannot be trusted in case of a failure. We can only know > that at least 1 node has failed, or not if the response is successful. > I am suggesting that for a big improvement in usability, the ReadCallback and > AbstractWriteResponseHandler wait for all responses to come back before > unblocking the wait, or let it timeout. This is way, the users will be able > to have some trust around the numbers returned to them. Also we would be able > to return more information this way. > Right now, an error that happens first prevents from a timeout to happen > because it fails immediately, and so potentially it hides problems with other > replicas. If we were to wait for all responses, we might get a timeout, in > that case we'd also be able to tell wether failures have happened *before* > that timeout, and have a more complete view where you can't detect both > situations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15642) Inconsistent failure messages on distributed queries
[ https://issues.apache.org/jira/browse/CASSANDRA-15642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Gallardo updated CASSANDRA-15642: --- Description: As a follow up to some exploration I have done for CASSANDRA-15543, I realized the following behavior in both {{ReadCallback}} and {{AbstractWriteHandler}}: - await for responses - when all required number of responses have come back: unblock the wait - when a single failure happens: unblock the wait - when unblocked, look to see if the counter of failures is > 1 and if so return an error message based on the {{failures}} map that's been filled Error messages that can result from this behavior can be a ReadTimeout, a ReadFailure, a WriteTimeout or a WriteFailure. In case of a Write/ReadFailure, the user will get back an error looking like the following: "Failure: Received X responses, and Y failures" (if this behavior I describe is incorrect, please correct me) This causes a usability problem. Since the handler will fail and throw an exception as soon as 1 failure happens, the error message that is returned to the user may not be accurate. (note: I am not entirely sure of the behavior in case of timeouts for now) At, say, CL = QUORUM = 3, the failed request may complete first, then a successful one completes, and another fails. If the exception is thrown fast enough, the error message could say "Failure: Received 0 response, and 1 failure at CL = 3" Which 1. doesn't make a lot of sense because the CL doesn't match the previous information, but 2. the information is incorrect. We received a successful response, only it came after the initial failure. >From that logic, I think it is safe to assume that the information returned in >the error message cannot be trusted in case of a failure. We can only know >that at least 1 node has failed, or not if the response is successful. I am suggesting that for a big improvement in usability, the ReadCallback and AbstractWriteResponseHandler wait for all responses to come back before unblocking the wait, or let it timeout. This is way, the users will be able to have some trust around the numbers returned to them. Also we would be able to return more information this way. Right now, an error that happens first prevents from a timeout to happen because it fails immediately, and so potentially it hides problems with other replicas. If we were to wait for all responses, we might get a timeout, in that case we'd also be able to tell wether failures have happened *before* that timeout, and have a more complete view where you can't detect both situations. was: As a follow up to some exploration I have done for CASSANDRA-15543, I realized the following behavior in both {{ReadCallback}} and {{AbstractWriteHandler}}: - await for responses - when all required number of responses have come back: unblock the wait - when a single failure happens: unblock the wait - when unblocked, look to see if the counter of failures is > 1 and if so return an error message based on the {{failures}} map that's been filled Error messages that can result from this behavior can be a ReadTimeout, a ReadFailure, a WriteTimeout or a WriteFailure. In case of a Write/ReadFailure, the user will get back an error looking like the following: "Failure: Received X responses, and Y failures" (if this behavior I describe is incorrect, please correct me) This causes a usability problem. Since the handler will fail and throw an exception as soon as 1 failure happens, the error message that is returned to the user is not accurate. (note: I am not entirely sure of the behavior around timeouts for now) At, say, CL = QUORUM = 3, the failed request may complete first, then a successful one completes, and another fails. If the exception is thrown fast enough, the error message could say "Failure: Received 0 response, and 1 failure at CL = 3" Which 1. doesn't make a lot of sense because the CL doesn't match the previous information, but 2. the information is incorrect. We received a successful response, only it came after the initial failure. >From that logic, I think it is safe to assume that the information returned in >the error message cannot be trusted in case of a failure. We can only know >that at least 1 node has failed, or not if the response is successful. I am suggesting that for a big improvement in usability, the ReadCallback and AbstractWriteResponseHandler wait for all responses to come back before unblocking the wait, or let it timeout. This is way, the users will be able to have some trust around the numbers returned to them. Also we would be able to return more information this way. Right now, an error that happens first prevents from a timeout to happen because it fails immediately, and so potentially it hides problems with other replicas. If we were to wait for all responses, we might get a timeout,
[jira] [Created] (CASSANDRA-15642) Inconsistent failure messages on distributed queries
Kevin Gallardo created CASSANDRA-15642: -- Summary: Inconsistent failure messages on distributed queries Key: CASSANDRA-15642 URL: https://issues.apache.org/jira/browse/CASSANDRA-15642 Project: Cassandra Issue Type: Improvement Components: Consistency/Coordination Reporter: Kevin Gallardo As a follow up to some exploration I have done for CASSANDRA-15543, I realized the following behavior in both {{ReadCallback}} and {{AbstractWriteHandler}}: - await for responses - when all required number of responses have come back: unblock the wait - when a single failure happens: unblock the wait - when unblocked, look to see if the counter of failures is > 1 and if so return an error message based on the {{failures}} map that's been filled Error messages that can result from this behavior can be a ReadTimeout, a ReadFailure, a WriteTimeout or a WriteFailure. In case of a Write/ReadFailure, the user will get back an error looking like the following: "Failure: Received X responses, and Y failures" (if this behavior I describe is incorrect, please correct me) This causes a usability problem. Since the handler will fail and throw an exception as soon as 1 failure happens, the error message that is returned to the user is not accurate. (note: I am not entirely sure of the behavior around timeouts for now) At, say, CL = QUORUM = 3, the failed request may complete first, then a successful one completes, and another fails. If the exception is thrown fast enough, the error message could say "Failure: Received 0 response, and 1 failure at CL = 3" Which 1. doesn't make a lot of sense because the CL doesn't match the previous information, but 2. the information is incorrect. We received a successful response, only it came after the initial failure. >From that logic, I think it is safe to assume that the information returned in >the error message cannot be trusted in case of a failure. We can only know >that at least 1 node has failed, or not if the response is successful. I am suggesting that for a big improvement in usability, the ReadCallback and AbstractWriteResponseHandler wait for all responses to come back before unblocking the wait, or let it timeout. This is way, the users will be able to have some trust around the numbers returned to them. Also we would be able to return more information this way. Right now, an error that happens first prevents from a timeout to happen because it fails immediately, and so potentially it hides problems with other replicas. If we were to wait for all responses, we might get a timeout, in that case we'd also be able to tell wether failures have happened *before* that timeout, and have a more complete view where you can't detect both situations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
[ https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17058756#comment-17058756 ] Kevin Gallardo commented on CASSANDRA-15543: [~ifesdjeen] since you were the author of the test originally, would you have some to review the fixes above? Thank you > flaky test > org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement > --- > > Key: CASSANDRA-15543 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15543 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: David Capwell >Assignee: Kevin Gallardo >Priority: Normal > Fix For: 4.0-alpha > > > This fails infrequently, last seen failure was on java 8 > {code} > junit.framework.AssertionFailedError > at > org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
[ https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Gallardo updated CASSANDRA-15543: --- Test and Documentation Plan: No doc change needed Status: Patch Available (was: In Progress) > flaky test > org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement > --- > > Key: CASSANDRA-15543 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15543 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: David Capwell >Assignee: Kevin Gallardo >Priority: Normal > Fix For: 4.0-alpha > > > This fails infrequently, last seen failure was on java 8 > {code} > junit.framework.AssertionFailedError > at > org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
[ https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17058206#comment-17058206 ] Kevin Gallardo commented on CASSANDRA-15543: Changes available at https://github.com/newkek/cassandra/tree/15543-trunk/ > flaky test > org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement > --- > > Key: CASSANDRA-15543 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15543 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: David Capwell >Assignee: Kevin Gallardo >Priority: Normal > Fix For: 4.0-alpha > > > This fails infrequently, last seen failure was on java 8 > {code} > junit.framework.AssertionFailedError > at > org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
[ https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17057135#comment-17057135 ] Kevin Gallardo commented on CASSANDRA-15543: bq. Also, RequestFailureException is not used only here in the codebase... Right, I was planning on making the immutable copy changes for calls to WriteFailureException too > flaky test > org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement > --- > > Key: CASSANDRA-15543 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15543 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: David Capwell >Assignee: Kevin Gallardo >Priority: Normal > Fix For: 4.0-alpha > > > This fails infrequently, last seen failure was on java 8 > {code} > junit.framework.AssertionFailedError > at > org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
[ https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17057091#comment-17057091 ] Kevin Gallardo edited comment on CASSANDRA-15543 at 3/11/20, 3:23 PM: -- Thanks for the reply, Ekaterina, and for confirming the behavior described. Got some more time to think about this, and it seems the best way to approach it, since what I suggest would be a larger behavioral change, would be to create a separate ticket for it. I think in this ticket I would rather: Fix the discrepancies between the failures number and the failure messages with the immutable map copy + fix the test that expects 2 failures to only expect 1 to match current behavior. Then, in the other ticket we can properly advocate/discuss pros and cons of returning cohesive error messages VS fail-fast & incohesive error message. was (Author: newkek): Thanks for the reply, Ekaterina. Got some more time to think about this, and I think the best way to approach it, since what I suggest would be a larger behavioral change, would be to create a separate ticket for it. I think in this ticket I would rather: Fix the discrepancies between the failures number and the failure messages with the immutable map copy + fix the test that expects 2 failures to only expect 1 to match current behavior. Then, in the other ticket we can discuss pros and cons of returning cohesive error messages VS fail-fast & incohesive error message. > flaky test > org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement > --- > > Key: CASSANDRA-15543 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15543 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: David Capwell >Assignee: Kevin Gallardo >Priority: Normal > Fix For: 4.0-alpha > > > This fails infrequently, last seen failure was on java 8 > {code} > junit.framework.AssertionFailedError > at > org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
[ https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17057091#comment-17057091 ] Kevin Gallardo edited comment on CASSANDRA-15543 at 3/11/20, 3:23 PM: -- Thanks for the reply, Ekaterina, and for confirming the behavior described. Got some more time to think about this, and it seems the best way to approach it, since what I suggest would be a larger behavioral change, would be to create a separate ticket for it. I think in this ticket I would rather: Fix the discrepancies between the failures number and the failure messages with the immutable map copy + fix the test that expects 2 failures to only expect 1 to match current behavior. Then, in the other ticket we can properly advocate/discuss pros and cons of returning cohesive error messages VS fail-fast & incohesive error message. Does that make sense? was (Author: newkek): Thanks for the reply, Ekaterina, and for confirming the behavior described. Got some more time to think about this, and it seems the best way to approach it, since what I suggest would be a larger behavioral change, would be to create a separate ticket for it. I think in this ticket I would rather: Fix the discrepancies between the failures number and the failure messages with the immutable map copy + fix the test that expects 2 failures to only expect 1 to match current behavior. Then, in the other ticket we can properly advocate/discuss pros and cons of returning cohesive error messages VS fail-fast & incohesive error message. > flaky test > org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement > --- > > Key: CASSANDRA-15543 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15543 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: David Capwell >Assignee: Kevin Gallardo >Priority: Normal > Fix For: 4.0-alpha > > > This fails infrequently, last seen failure was on java 8 > {code} > junit.framework.AssertionFailedError > at > org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
[ https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17057091#comment-17057091 ] Kevin Gallardo commented on CASSANDRA-15543: Thanks for the reply, Ekaterina. Got some more time to think about this, and I think the best way to approach it, since what I suggest would be a larger behavioral change, would be to create a separate ticket for it. I think in this ticket I would rather: Fix the discrepancies between the failures number and the failure messages with the immutable map copy + fix the test that expects 2 failures to only expect 1 to match current behavior. Then, in the other ticket we can discuss pros and cons of returning cohesive error messages VS fail-fast & incohesive error message. > flaky test > org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement > --- > > Key: CASSANDRA-15543 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15543 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: David Capwell >Assignee: Kevin Gallardo >Priority: Normal > Fix For: 4.0-alpha > > > This fails infrequently, last seen failure was on java 8 > {code} > junit.framework.AssertionFailedError > at > org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-15631) Add AssertJ test dependency
Kevin Gallardo created CASSANDRA-15631: -- Summary: Add AssertJ test dependency Key: CASSANDRA-15631 URL: https://issues.apache.org/jira/browse/CASSANDRA-15631 Project: Cassandra Issue Type: Improvement Components: Test/dtest, Test/unit Reporter: Kevin Gallardo Assignee: Kevin Gallardo See [proposal|https://lists.apache.org/thread.html/rc562ec47578d0ae6f346ba9e3d7469c1cd3f8b521a72ddcb2accc47b%40%3Cdev.cassandra.apache.org%3E]. The goal is to add [AssertJ|https://assertj.github.io/doc/] to the test framework to allow for more comprehensible and easier to write tests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-14520) ClosedChannelException handled as FSError
[ https://issues.apache.org/jira/browse/CASSANDRA-14520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Gallardo reassigned CASSANDRA-14520: -- Assignee: Kevin Gallardo > ClosedChannelException handled as FSError > - > > Key: CASSANDRA-14520 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14520 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Streaming and Messaging >Reporter: Blake Eggleston >Assignee: Kevin Gallardo >Priority: Urgent > Fix For: 4.0 > > > After the messaging service netty refactor, I’ve seen a few instances where a > closed socket causes a ClosedChannelException (an IOException subclass) to be > thrown. The exception is caught by ChannelProxy, interpreted as a disk error, > and is then re-thrown as an FSError, causing the node to be shutdown. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14520) ClosedChannelException handled as FSError
[ https://issues.apache.org/jira/browse/CASSANDRA-14520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056029#comment-17056029 ] Kevin Gallardo commented on CASSANDRA-14520: Will start investigating this > ClosedChannelException handled as FSError > - > > Key: CASSANDRA-14520 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14520 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Streaming and Messaging >Reporter: Blake Eggleston >Assignee: Kevin Gallardo >Priority: Urgent > Fix For: 4.0 > > > After the messaging service netty refactor, I’ve seen a few instances where a > closed socket causes a ClosedChannelException (an IOException subclass) to be > thrown. The exception is caught by ChannelProxy, interpreted as a disk error, > and is then re-thrown as an FSError, causing the node to be shutdown. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
[ https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17054990#comment-17054990 ] Kevin Gallardo edited comment on CASSANDRA-15543 at 3/9/20, 2:08 PM: - Sounds good, thanks, hope you had a good weekend :) As a summary, In any case, I believe passing an immutable copy of the {{failureReasonByEndpoint}} map to the constructor of Read/WriteFailureException would reduce the chances for the {{number of failures}} and the failure messages to be inconsistent. In addition to that, there's the remaining question of the behavior of ReadCallback when failures happen (do we fail fast? or do we wait for all responses to come back/timeout?). Depending on the outcome of that, the test that is flaky at the moment would need to be adjusted to expect 1 *or* 2 failures in the response. was (Author: newkek): Sounds good. As a summary, In any case, I believe passing an immutable copy of the {{failureReasonByEndpoint}} map to the constructor of Read/WriteFailureException would reduce the chances for the {{number of failures}} and the failure messages to be inconsistent. In addition to that, there's the remaining question of the behavior of ReadCallback when failures happen (do we fail fast? or do we wait for all responses to come back/timeout?). Depending on the outcome of that, the test that is flaky at the moment would need to be adjusted to expect 1 *or* 2 failures in the response. > flaky test > org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement > --- > > Key: CASSANDRA-15543 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15543 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: David Capwell >Assignee: Kevin Gallardo >Priority: Normal > Fix For: 4.0-alpha > > > This fails infrequently, last seen failure was on java 8 > {code} > junit.framework.AssertionFailedError > at > org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
[ https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17054990#comment-17054990 ] Kevin Gallardo edited comment on CASSANDRA-15543 at 3/9/20, 2:07 PM: - Sounds good. As a summary, In any case, I believe passing an immutable copy of the {{failureReasonByEndpoint}} map to the constructor of Read/WriteFailureException would reduce the chances for the {{number of failures}} and the failure messages to be inconsistent. In addition to that, there's the remaining question of the behavior of ReadCallback when failures happen (do we fail fast? or do we wait for all responses to come back/timeout?). Depending on the outcome of that, the test that is flaky at the moment would need to be adjusted to expect 1 *or* 2 failures in the response. was (Author: newkek): Sounds good. As a summary, In any case, I believe passing an immutable copy of the {{failureReasonByEndpoint}} map to the constructor of Read/WriteFailureException would reduce the chances for the {{number of failures}} and the failure messages to be inconsistent. In addition to that, there's the remaining question of the behavior of ReadCallback when failures happen (do we fail fast? or do we wait for all responses to come back/timeout?). Depending on the outcome of that the test that is flaky at the moment would need to be adjusted to expect 1 *or* 2 failures in the response. > flaky test > org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement > --- > > Key: CASSANDRA-15543 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15543 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: David Capwell >Assignee: Kevin Gallardo >Priority: Normal > Fix For: 4.0-alpha > > > This fails infrequently, last seen failure was on java 8 > {code} > junit.framework.AssertionFailedError > at > org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
[ https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17054990#comment-17054990 ] Kevin Gallardo commented on CASSANDRA-15543: Sounds good. As a summary, In any case, I believe passing an immutable copy of the {{failureReasonByEndpoint}} map to the constructor of Read/WriteFailureException would reduce the chances for the {{number of failures}} and the failure messages to be inconsistent. In addition to that, there's the remaining question of the behavior of ReadCallback when failures happen (do we fail fast? or do we wait for all responses to come back/timeout?). Depending on the outcome of that the test that is flaky at the moment would need to be adjusted to expect 1 *or* 2 failures in the response. > flaky test > org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement > --- > > Key: CASSANDRA-15543 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15543 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: David Capwell >Assignee: Kevin Gallardo >Priority: Normal > Fix For: 4.0-alpha > > > This fails infrequently, last seen failure was on java 8 > {code} > junit.framework.AssertionFailedError > at > org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
[ https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053733#comment-17053733 ] Kevin Gallardo commented on CASSANDRA-15543: {{RequestFailureException}} uses the map's size to find out the number of entries in it: {code} protected RequestFailureException(ExceptionCode code, ConsistencyLevel consistency, int received, int blockFor, Map failureReasonByEndpoint) { super(code, String.format("Operation failed - received %d responses and %d failures: %s", received, failureReasonByEndpoint.size(), buildFailureString(failureReasonByEndpoint))); {code} It does a copy of the map later in the constructor but I suppose there is still a chance the content of the map may be different between the call to {{size()}} and the entries when iterated in {{buildFailureString()}}. > flaky test > org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement > --- > > Key: CASSANDRA-15543 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15543 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: David Capwell >Assignee: Kevin Gallardo >Priority: Normal > Fix For: 4.0-alpha > > > This fails infrequently, last seen failure was on java 8 > {code} > junit.framework.AssertionFailedError > at > org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
[ https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053647#comment-17053647 ] Kevin Gallardo edited comment on CASSANDRA-15543 at 3/6/20, 7:45 PM: - Also, I suggested we wait for all responses because this is what the test expects, the test waits for the {{ReadFailureException}} to contain *both* failures without accounting for possible race conditions of the {{ReadCallback}} returning before all responses have come back. EDIT: I confused the number of received and the number of failures, I think what is below is incorrect. -Also noticed that in the current context, the number of {{failure}} and the list of failures can be inconsistent with each other, because we create the exception with- {code:java} new ReadFailureException(replicaPlan().consistencyLevel(), received, blockFor, resolver.isDataPresent(), failureReasonByEndpoint){code} -Concurrently, {{received}} can have been incremented but not the map of {{failureReasonByEndpoint}} and we could end up with:- {code:java} Operation failed - received 1 responses and 2 failures: INCOMPATIBLE_SCHEMA from 127.0.0.3:7012{code} -"2" failures but only 1 error message. (number of {{received}} is irrelevant)- -Conversely, since the reference to the map is passed to {{ReadFailureException}} but {{received}} is passed by value, we can end up in the following scenario:- - -1 error triggers awaitResults() to unblock- - -call to {{new ReadFailureException(received, failureReasonByEndpoint)}}- - -value of 1 for {{received}} copied when calling {{ReadFailureException}} constructor but only reference to {{failureReasonByEndpoint}} given- - -concurrently second failure calls to {{onFailure()}} which updates {{failures}} and the map- - -{{ReadFailureException}} has the {{1}} value for {{failures}} but the map has been updated to have 2 failures and constructs the error message- -We end up with the error message:- {code:java} Operation failed - received 1 responses and 1 failures: INCOMPATIBLE_SCHEMA from 127.0.0.3:7012, INCOMPATIBLE_SCHEMA from 127.0.0.2:7012{code} -"1" failure but 2 error messages. (number of {{received}} is irrelevant)- -One way to solve these issues would be to pass a copy of the {{failureReasonByEndpoint}} to the constructor of {{ReadFailureException}}.- -If {{await()}} waited until all messages came back or timeout the 2 issues wouldn't be problematic I think.- was (Author: newkek): Also, I suggested we wait for all responses because this is what the test expects, the test waits for the {{ReadFailureException}} to contain *both* failures without accounting for possible race conditions of the {{ReadCallback}} returning before all responses have come back. Also noticed that in the current context, the number of {{failure}} and the list of failures can be inconsistent with each other, because we create the exception with {code}new ReadFailureException(replicaPlan().consistencyLevel(), received, blockFor, resolver.isDataPresent(), failureReasonByEndpoint){code}. Concurrently, {{received}} can have been incremented but not the map of {{failureReasonByEndpoint}} and we could end up with: {code}Operation failed - received 1 responses and 2 failures: INCOMPATIBLE_SCHEMA from 127.0.0.3:7012{code} "2" failures but only 1 error message. (number of {{received}} is irrelevant) Conversely, since the reference to the map is passed to {{ReadFailureException}} but {{received}} is passed by value, we can end up in the following scenario: - 1 error triggers awaitResults() to unblock - call to {{new ReadFailureException(received, failureReasonByEndpoint)}} - value of 1 for {{received}} copied when calling {{ReadFailureException}} constructor but only reference to {{failureReasonByEndpoint}} given - concurrently second failure calls to {{onFailure()}} which updates {{failures}} and the map - {{ReadFailureException}} has the {{1}} value for {{failures}} but the map has been updated to have 2 failures and constructs the error message We end up with the error message: {code}Operation failed - received 1 responses and 1 failures: INCOMPATIBLE_SCHEMA from 127.0.0.3:7012, INCOMPATIBLE_SCHEMA from 127.0.0.2:7012{code} "1" failure but 2 error messages. (number of {{received}} is irrelevant) One way to solve these issues would be to pass a copy of the {{failureReasonByEndpoint}} to the constructor of {{ReadFailureException}}. If {{await()}} waited until all messages came back or timeout the 2 issues wouldn't be problematic I think. > flaky test > org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement > --- > > Key: CASSANDRA-15543 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15543 > Project:
[jira] [Commented] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
[ https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053724#comment-17053724 ] Kevin Gallardo commented on CASSANDRA-15543: Actually I just realized that the value passed in the {{ReadFailureException}} constructor is {{received}}, I confused and thought it was the number of failures. Need to re-think about that comment above > flaky test > org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement > --- > > Key: CASSANDRA-15543 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15543 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: David Capwell >Assignee: Kevin Gallardo >Priority: Normal > Fix For: 4.0-alpha > > > This fails infrequently, last seen failure was on java 8 > {code} > junit.framework.AssertionFailedError > at > org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
[ https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053647#comment-17053647 ] Kevin Gallardo edited comment on CASSANDRA-15543 at 3/6/20, 7:04 PM: - Also, I suggested we wait for all responses because this is what the test expects, the test waits for the {{ReadFailureException}} to contain *both* failures without accounting for possible race conditions of the {{ReadCallback}} returning before all responses have come back. Also noticed that in the current context, the number of {{failure}} and the list of failures can be inconsistent with each other, because we create the exception with {code}new ReadFailureException(replicaPlan().consistencyLevel(), received, blockFor, resolver.isDataPresent(), failureReasonByEndpoint){code}. Concurrently, {{received}} can have been incremented but not the map of {{failureReasonByEndpoint}} and we could end up with: {code}Operation failed - received 1 responses and 2 failures: INCOMPATIBLE_SCHEMA from 127.0.0.3:7012{code} "2" failures but only 1 error message. (number of {{received}} is irrelevant) Conversely, since the reference to the map is passed to {{ReadFailureException}} but {{received}} is passed by value, we can end up in the following scenario: - 1 error triggers awaitResults() to unblock - call to {{new ReadFailureException(received, failureReasonByEndpoint)}} - value of 1 for {{received}} copied when calling {{ReadFailureException}} constructor but only reference to {{failureReasonByEndpoint}} given - concurrently second failure calls to {{onFailure()}} which updates {{failures}} and the map - {{ReadFailureException}} has the {{1}} value for {{failures}} but the map has been updated to have 2 failures and constructs the error message We end up with the error message: {code}Operation failed - received 1 responses and 1 failures: INCOMPATIBLE_SCHEMA from 127.0.0.3:7012, INCOMPATIBLE_SCHEMA from 127.0.0.2:7012{code} "1" failure but 2 error messages. (number of {{received}} is irrelevant) One way to solve these issues would be to pass a copy of the {{failureReasonByEndpoint}} to the constructor of {{ReadFailureException}}. If {{await()}} waited until all messages came back or timeout the 2 issues wouldn't be problematic I think. was (Author: newkek): Also, I suggested we wait for all responses because this is what the test expects, the test waits for the {{ReadFailureException}} to contain *both* failures without accounting for possible race conditions of the {{ReadCallback}} returning before all responses have come back. Also noticed that in the current context, the number of {{failure}} and the list of failures can be inconsistent with each other, because we create the exception with {code}new ReadFailureException(replicaPlan().consistencyLevel(), received, blockFor, resolver.isDataPresent(), failureReasonByEndpoint){code}. Concurrently, {{received}} can have been incremented but not the map of {{failureReasonByEndpoint}} and we could end up with: {code}Operation failed - received 1 responses and 2 failures: INCOMPATIBLE_SCHEMA from 127.0.0.3:7012{code} "2" failures but only 1 error message. (number of {{received}} is irrelevant) Conversely, since the reference to the map is passed to {{ReadFailureException}} but {{received}} is passed by value, we can end up in the following scenario: - 1 error triggers awaitResults() to unblock - call to {{new ReadFailureException(received, failureReasonByEndpoint)}} - value of 1 for {{received}} copied when calling {{ReadFailureException}} constructor but only reference to {{failureReasonByEndpoint}} given - concurrently second failure calls to {{onFailure()}} which updates {{failures}} and the map - {{ReadFailureException}} has the {{1}} value for {{failures}} but the map has been updated to have 2 failures and constructs the error message We end up with the error message: {code}Operation failed - received 1 responses and 1 failures: INCOMPATIBLE_SCHEMA from 127.0.0.3:7012, INCOMPATIBLE_SCHEMA from 127.0.0.2:7012{code} "1" failure but 2 error messages. (number of {{received}} is irrelevant) One way to solve this second issue would be to pass a copy of the {{failureReasonByEndpoint}} to the constructor of {{ReadFailureException}}. If {{await()}} waited until all messages came back or timeout the 2 issues wouldn't be problematic I think. > flaky test > org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement > --- > > Key: CASSANDRA-15543 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15543 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: David Capwell >Assignee: Kevin Gallardo >
[jira] [Commented] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
[ https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053647#comment-17053647 ] Kevin Gallardo commented on CASSANDRA-15543: Also, I suggested we wait for all responses because this is what the test expects, the test waits for the {{ReadFailureException}} to contain *both* failures without accounting for possible race conditions of the {{ReadCallback}} returning before all responses have come back. Also noticed that in the current context, the number of {{failure}} and the list of failures can be inconsistent with each other, because we create the exception with {code}new ReadFailureException(replicaPlan().consistencyLevel(), received, blockFor, resolver.isDataPresent(), failureReasonByEndpoint){code}. Concurrently, {{received}} can have been incremented but not the map of {{failureReasonByEndpoint}} and we could end up with: {code}Operation failed - received 1 responses and 2 failures: INCOMPATIBLE_SCHEMA from 127.0.0.3:7012{code} "2" failures but only 1 error message. (number of {{received}} is irrelevant) Conversely, since the reference to the map is passed to {{ReadFailureException}} but {{received}} is passed by value, we can end up in the following scenario: - 1 error triggers awaitResults() to unblock - call to {{new ReadFailureException(received, failureReasonByEndpoint)}} - value of 1 for {{received}} copied when calling {{ReadFailureException}} constructor but only reference to {{failureReasonByEndpoint}} given - concurrently second failure calls to {{onFailure()}} which updates {{failures}} and the map - {{ReadFailureException}} has the {{1}} value for {{failures}} but the map has been updated to have 2 failures and constructs the error message We end up with the error message: {code}Operation failed - received 1 responses and 1 failures: INCOMPATIBLE_SCHEMA from 127.0.0.3:7012, INCOMPATIBLE_SCHEMA from 127.0.0.2:7012{code} "1" failure but 2 error messages. (number of {{received}} is irrelevant) One way to solve this second issue would be to pass a copy of the {{failureReasonByEndpoint}} to the constructor of {{ReadFailureException}}. If {{await()}} waited until all messages came back or timeout the 2 issues wouldn't be problematic I think. > flaky test > org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement > --- > > Key: CASSANDRA-15543 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15543 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: David Capwell >Assignee: Kevin Gallardo >Priority: Normal > Fix For: 4.0-alpha > > > This fails infrequently, last seen failure was on java 8 > {code} > junit.framework.AssertionFailedError > at > org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
[ https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053578#comment-17053578 ] Kevin Gallardo edited comment on CASSANDRA-15543 at 3/6/20, 4:36 PM: - I have been looking into this issue and trying to reproduce it locally. Did not have much luck, re-executed the test about 2500 times (let it run overnight) and only managed to reproduce something ~inconsistent in one case. The test setup is: execute a schema alteration request on 1 node, don't let the node propagate the schema changes to other nodes, then execute a select at Consistency = ALL on that same node and expect a failure, because the 2 other nodes did not get the schema alteration request. So the message we expect from the Select request would be the following: {code:java} Operation failed - received 1 responses and 2 failures: INCOMPATIBLE_SCHEMA from 127.0.0.3:7012, INCOMPATIBLE_SCHEMA from 127.0.0.2:7012{code} Node1's response is successful (because it has the schema changes), but all the others fail because they have a stale schema. The closest I got to reproducing something inconsistent to this was: {code:java} Operation failed - received 0 responses and 2 failures: INCOMPATIBLE_SCHEMA from 127.0.0.3:7012, INCOMPATIBLE_SCHEMA from 127.0.0.2:7012{code} But also let's consider the ones Brandon had observed. I have looked into some of the places where {{ReadFailureException}} are generated, and landed in the [{{ReadCallback}}|https://github.com/apache/cassandra/blob/aed15bfb01e753f9bac8b149b382c7a7c8d33183/src/java/org/apache/cassandra/service/reads/ReadCallback.java#L162] class. The logic of the code is: - call {{awaitResults()}} - in {{awaitResults()}}: -- block on {{await()}} -- when {{await()}} exits, check what responses we got and whether errors came back Also looked into {{onFailure()}} method, the logic is the following: {code:java} int n = waitingFor(from) ? failuresUpdater.incrementAndGet(this) : failures; failureReasonByEndpoint.put(from, failureReason); if (blockFor + n > replicaPlan().contacts().size()) { condition.signalAll(); // this signals as soon as 1 failure happens } {code} {{waitingFor(from)}} is supposed to filter out errors coming from non local nodes it seems. In this test it always returns true, for simplicity we'll consider it is. Same for {{blockFor}}, and {{replicaPlan().contacts().size()}} these are both {{3}} for the test, I supposed these may differ in some use case. As we can see above once 1 failure happens, the {{condition}} will be signaled to unblock waiting callers. When {{awaitResults()}} unblocks, it checks the value of {{failures}} to see how many errors happened and report it as a {{ReadFailureException}}. The code is designed so that the {{condition}} is freed when either all responses are successful, or one failure happened. Now my understanding of this is that, on certain circumstances, we end up returning an error message as soon as 1 error happens, and not wait for all responses to come back, and the values of {{failures}} and {{received}} are not guaranteed to be at a certain value, except we know in case of a failure that {{failures}} will be > 0. I am unclear on what the expected behavior should be here. >From my POV, I would think the most reasonable logic would be that {{await()}} >only exits once all responses have come back, failed or successful, or it >times out. Therefore I would replace the {{"SimpleCondition condition"}} with >a {{CountDownLatch}} initialized to the value of the number of expected >responses, and {{countDown()}} on {{onReceived()}} and {{onFailure()}} and >have {{await()}} use that countDownLatch. If the behavior of failing as soon as 1 failure happens is preferred, then the logic can stay that way, but the test needs to be modified to account for the possibility that only 1 error and No successful response may be returned in the {{ReadFailureException}}. [~brandon.williams] [~e.dimitrova] got any insights? was (Author: newkek): I have been looking into this issue and trying to reproduce it locally. Did not have much luck, re-executed the test about 2500 times (let it run overnight) and only managed to reproduce something ~inconsistent in one case. The test setup is: execute a schema alteration request on 1 node, don't let the node propagate the schema changes to other nodes, then execute a select at Consistency = ALL on that same node and expect a failure, because the 2 other nodes did not get the schema alteration request. So the message we expect from the Select request would be the following: {code:java} Operation failed - received 1 responses and 2 failures: INCOMPATIBLE_SCHEMA from 127.0.0.3:7012, INCOMPATIBLE_SCHEMA from 127.0.0.2:7012{code} Node1's response is successful (because it has the schema changes), but all the others fail because
[jira] [Commented] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
[ https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053578#comment-17053578 ] Kevin Gallardo commented on CASSANDRA-15543: I have been looking into this issue and trying to reproduce it locally. Did not have much luck, re-executed the test about 2500 times (let it run overnight) and only managed to reproduce something ~inconsistent in one case. The test setup is: execute a schema alteration request on 1 node, don't let the node propagate the schema changes to other nodes, then execute a select at Consistency = ALL on that same node and expect a failure, because the 2 other nodes did not get the schema alteration request. So the message we expect from the Select request would be the following: {code:java} Operation failed - received 1 responses and 2 failures: INCOMPATIBLE_SCHEMA from 127.0.0.3:7012, INCOMPATIBLE_SCHEMA from 127.0.0.2:7012{code} Node1's response is successful (because it has the schema changes), but all the others fail because they have a stale schema. The closest I got to reproducing something inconsistent to this was: {code:java} Operation failed - received 0 responses and 2 failures: INCOMPATIBLE_SCHEMA from 127.0.0.3:7012, INCOMPATIBLE_SCHEMA from 127.0.0.2:7012{code} But also let's consider the ones Brandon had observed. I have looked into some of the places where {{ReadFailureException}} are generated, and landed in the [{{ReadCallback}}|https://github.com/apache/cassandra/blob/aed15bfb01e753f9bac8b149b382c7a7c8d33183/src/java/org/apache/cassandra/service/reads/ReadCallback.java#L162] class. The logic of the code is: - call {{awaitResults()}} - in {{awaitResults()}}: -- block on {{await()}} -- when {{await()}} exits, check what responses we got and whether errors came back Also looked into {{onFailure()}} method, the logic is the following: {code:java} int n = waitingFor(from) ? failuresUpdater.incrementAndGet(this) : failures; failureReasonByEndpoint.put(from, failureReason); if (blockFor + n > replicaPlan().contacts().size()) { condition.signalAll(); // this signals as soon as 1 failure happens } {code} {{waitingFor(from)}} is supposed to filter out errors coming from non local nodes it seems. In this test it always returns true, for simplicity we'll consider it is. Same for {{blockFor}}, and {{replicaPlan().contacts().size()}} these are both {{3}} for the test, I supposed these may differ in some use case. As we can see above once 1 failure happens, the {{condition}} will be signaled to unblock waiting callers. When {{awaitResults()}} unblocks, it checks the value of {{failures}} to see how many errors happened and report it as a {{ReadFailureException}}. The code is designed so that the {{condition}} is freed when either all responses are successful, or one failure happened. Now my understanding of this is that, on certain circumstances, we end up returning an error message as soon as 1 error happens, and not wait for all responses to come back, and the values of {{failures}} and {{received}} are not guaranteed to be at a certain value, except we know in case of a failure that {{failures}} will be > 0. I am unclear on what the expected behavior should be here. >From my POV, I would think the most reasonable logic would be that {{await()}} >only exits once all responses have come back, failed or successful, or it >times out. Therefore I would replace the {{SimpleCondition condition}} with a >{{CountDownLatch}} initialized to the value of the number of expected >responses, and {{countDown()}} on {{onReceived}} and {{onFailure}} and have >{{await()}} use that countDownLatch. If the behavior of failing as soon as 1 failure happens is preferred, then the logic can stay that way, but the test needs to be modified to account for the possibility that only 1 error and No successful response may be returned in the `ReadFailureException`. [~brandon.williams] [~e.dimitrova] got any insights? > flaky test > org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement > --- > > Key: CASSANDRA-15543 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15543 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: David Capwell >Assignee: Kevin Gallardo >Priority: Normal > Fix For: 4.0-alpha > > > This fails infrequently, last seen failure was on java 8 > {code} > junit.framework.AssertionFailedError > at > org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
[ https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Gallardo reassigned CASSANDRA-15543: -- Assignee: Kevin Gallardo > flaky test > org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement > --- > > Key: CASSANDRA-15543 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15543 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: David Capwell >Assignee: Kevin Gallardo >Priority: Normal > Fix For: 4.0-alpha > > > This fails infrequently, last seen failure was on java 8 > {code} > junit.framework.AssertionFailedError > at > org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org