[ 
https://issues.apache.org/jira/browse/CASSANDRA-10171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14714428#comment-14714428
 ] 

Paulo Motta commented on CASSANDRA-10171:
-----------------------------------------

{{simple_repair_test}} and {{interrupt_build_process_test}} seem to have been 
fixed on 
[#29|http://cassci.datastax.com/view/win32/job/cassandra-3.0_dtest_win32/29/testReport/]
 after 
[e6a9afbb8a759fefc83334e470f5b8965f12a467|https://github.com/apache/cassandra/commit/e6a9afbb8a759fefc83334e470f5b8965f12a467].
 SInce these tests do not need hint functionality, I 
[disabled|https://github.com/riptano/cassandra-dtest/commit/f73a2d25141c156ae45bbe8f0f31e95787b8e657]
 hinted handoff for those tests, similar to what is done in other tests of the 
same class.

Both {{complex_repair_test}} and {{really_complex_repair_test}} were flakey on 
[Linux|http://cassci.datastax.com/view/cassandra-3.0/job/cassandra-3.0_dtest/lastCompletedBuild/testReport/junit/materialized_views_test/TestMaterializedViews/really_complex_repair_test/history/]
 and consistently failing on 
[Windows|http://cassci.datastax.com/view/win32/job/cassandra-3.0_dtest_win32/24/testReport/junit/materialized_views_test/history/]
 due to a timing problem explained in more detailed as follows. These tests had 
the following setup:
* 3 nodes
* RF=3
* node2 and node3 were stopped and the base table of the MV was updated on node1
* since materialized views require batch writes, that requires at least an 
additional live node to store batchlogs, node4 was created in dc2 with rf=0 to 
fulfifll that requirement

However, batchlog endpoints [must be in the same 
datacenter|https://github.com/apache/cassandra/blob/5b4393694760d530648a818b2b1d10429b95a0e4/src/java/org/apache/cassandra/service/StorageProxy.java#L1054],
 otherwise the batchlog request cannot succeed. So why were the tests passing, 
since the only other alive node (node4) was in another data center?

Well, there is a [60 
seconds|https://github.com/apache/cassandra/blob/03f556ffa8718754fe4eb329af2002d83ffc7147/src/java/org/apache/cassandra/locator/PropertyFileSnitch.java#L75]
 window before the topology file is reloaded where node4 was considered to be 
from the default datacenter (dc1), so inserts would succeed and the test was 
passing in fast enough nodes. However, in slower nodes (such as slower linux 
nodes or win32 nodes), the topology file would be reloaded after 60s, and node4 
would be considered from dc2, so the batchlog write fails with:

{noformat}code=1000 [Unavailable exception] message="Cannot achieve consistency 
level ONE" info={'required_replicas': 1, 'alive_replicas': 0, 'consistency': 
'ONE'}{noformat}

In addition to that, {{complex_repair_test}} was passing even after the 
{{repair()}} statements were removed, because the {{ALL}} consistency level was 
being used, always retrieving the most recent updates regardless if all nodes 
were consistent or not.

In order to address these issues I did a refactoring in both 
{{complex_repair_test}} and {{really_complex_repair_test}} while maintaing the 
essence of the tests. The most significant changes were:
 * Used 5 nodes and RF=5, to have a quorum of 3 and a subquorum of 2. This 
allowed to achieve the min number of 2 replicas for batchlogs while maitaining 
2 separate partitions to test inconsistencies.
 * Set the gc_grace_seconds of the base table to 1 second (It's not possible to 
set it to zero), to guarantee batchlogs would expire and there would be a 
mismatch between partitions before repair.
* Used CL {{QUORUM}} instead of {{ALL}} to verify inconsistencies.

The refactoring is available for review on this [cassandra-dtest 
PR|https://github.com/riptano/cassandra-dtest/pull/507]. Adding [~aboudreault] 
as reviewer.

> Windows dtest 3.0: materialized_views_test.py:TestMaterializedViews
> -------------------------------------------------------------------
>
>                 Key: CASSANDRA-10171
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10171
>             Project: Cassandra
>          Issue Type: Sub-task
>            Reporter: Paulo Motta
>            Assignee: Paulo Motta
>             Fix For: 3.0.x
>
>
> The following 3.0 dtests have been failing 
> [consistently|http://cassci.datastax.com/view/win32/job/cassandra-3.0_dtest_win32/24/testReport/junit/materialized_views_test/history/]
>  on Windows:
> * materialized_views_test.TestMaterializedViews.complex_repair_test
> * materialized_views_test.TestMaterializedViews.interrupt_build_process_test
> * materialized_views_test.TestMaterializedViews.really_complex_repair_test
> * materialized_views_test.TestMaterializedViews.simple_repair_test



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to