[
https://issues.apache.org/jira/browse/CASSANDRA-10938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15111916#comment-15111916
]
Stefania commented on CASSANDRA-10938:
--------------------------------------
I've examined more closely the failure on Jenkins since CASSANDRA-9303 was
committed and I've noted that:
* They happen more rarely and mostly on 2.1
* The problem is only with COPY TO, not COPY FROM so we cannot reduce the
ingest rate.
I've set-up an AWS box with the same specs as the ones used by Jenkings
(m3.2xlarge). I've run {{test_bulk_round_trip_blogposts}} 50 times with no
failures. There must be something else on Jenkins boxes that causes connections
to be rejected but I could not work it out.
So I decided to simulate a failed connection by setting
{{native_transport_max_concurrent_connections}} to limit the number of
connections accepted by hosts. It doesn't tell us what's happening on Jenkins
but at least it allows us to test COPY TO in the face of failed connections,
which is a good thing anyway and should hopefully ensure that the Jenkins
failures disappear. Note that just stopping replicas would not have easily
allowed testing this because the code selects only replicas that are up. I've
also increased the replication factor from 1 to 3 and the nodes from 3 to 5 for
{{test_bulk_round_trip_blogposts}} to give it more resilience.
I've changed the COPY TO connection logic to try multiple replicas one by one
in case of failure - previously we were giving multiple replicas to the load
balancing policy but the contact point was only the chosen replica. More
importantly, if all replicas fail, instead of killing the worker process -
which would halt the entire export - we return an error for that token - which
means that the token is tried again later for up to MAXATTEMPTS times.
New test code is
[here|https://github.com/stef1927/cassandra-dtest/commits/10938].
The [2.1 patch|https://github.com/stef1927/cassandra/commits/10938-2.1] is its
own patch, the [2.2
patch|https://github.com/stef1927/cassandra/commits/10938-2.2] is identical to
the 2.1 patch except for a conflict with the imports and it applies cleanly
upwards.
CI is still pending:
||2.1||2.2||3.0||3.3||trunk||
|[testall|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-10938-2.1-testall/]|[testall|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-10938-2.2-testall/]|[testall|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-10938-3.0-testall/]|[testall|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-10938-3.3-testall/]|[testall|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-10938-testall/]|
|[dtest|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-10938-2.1-dtest/]|[dtest|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-10938-2.2-dtest/]|[dtest|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-10938-3.0-dtest/]|[dtest|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-10938-3.3-dtest/]|[dtest|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-10938-dtest/]|
[~pauloricardomg] could you review the python changes? Sylvan has already noted
above that the change from NBHM to CHM is fine.
> test_bulk_round_trip_blogposts is failing occasionally
> ------------------------------------------------------
>
> Key: CASSANDRA-10938
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10938
> Project: Cassandra
> Issue Type: Sub-task
> Components: Tools
> Reporter: Stefania
> Assignee: Stefania
> Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
> Attachments: 6452.nps, 6452.png, 7300.nps, 7300a.png, 7300b.png,
> node1_debug.log, node2_debug.log, node3_debug.log, recording_127.0.0.1.jfr
>
>
> We get timeouts occasionally that cause the number of records to be incorrect:
> http://cassci.datastax.com/job/trunk_dtest/858/testReport/cqlsh_tests.cqlsh_copy_tests/CqlshCopyTest/test_bulk_round_trip_blogposts/
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)