[
https://issues.apache.org/jira/browse/CASSANDRA-10938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15080872#comment-15080872
]
Stefania edited comment on CASSANDRA-10938 at 1/4/16 12:19 PM:
---------------------------------------------------------------
The flight recorder file attached, _recording_127.0.0.1.jfr_, provides the best
information to understand the problem: about 15 shared pool worker threads are
busy copying the {{NonBlockingHashMap}} that we use to store the query states
in {{ServerConnection}}. This consumes 99% of the CPU on the machine (note that
I lowered the priority of the process when I recorded that file).
We store one entry per stream id and we never clean this map but this is not
the issue. When inserting data with cassandra-stress, we use up to 33k stream
ids whilst when inserting data with COPY FROM the python driver is careful to
reuse stream ids and we only use around 300 of them. So the map should not be
resized as much and yet the problem occurs with COPY FROM (approximately once
every twenty times) and never with cassandra-stress. The difference between the
two is probably that in COPY FROM we have more concurrent requests, hence a
higher concurrency level on the map.
Of all hot threads in the flight recorder file, only one is doing a
{{putIfAbsent}} whist the other ones are simply accessing a value via a
{{get}}. However the map is designed so that all threads help with the copy and
this is what's happening here. I suspect a bug that prevents threads from
making progress and keeps them spinning.
We are currently using the latest available version of {{NonBlockingHashMap}},
version 1.0.6, from [this
repository|https://github.com/boundary/high-scale-lib].
We have a number of options:
- Fix {{NonBlockingHashMap}}
- Replace it
- Instantiate it with an initial size to prevent resizing (4K fixes this
specific case).
was (Author: stefania):
The flight recorder file attached, _recording_127.0.0.1.jfr_, provides the best
information to understand the problem: about 15 shared pool worker threads are
busy copying the {{NonBlockingHashMap}} that we use to store the query states
in {{ServerConnection}}. This consumes 99% of the CPU on the machine (note that
I lowered the priority of the process when I recorded that file).
We store one entry per stream id and we never clean this map but this is not
the issue. When inserting data with cassandra-stress, we use up to 33k stream
ids whilst when inserting data with COPY FROM the python driver is careful to
reuse stream ids and we only use around 300 of them. So the map should not be
resized as much and yet the problem occurs with COPY FROM and not with
cassandra-stress. The difference between the two is probably that in COPY FROM
we have may more concurrent requests, hence a higher concurrency level on the
map.
Of all hot threads in the flight recorder file, only one is doing a
{{putIfAbsent}} whist the other ones are simply accessing a value via a
{{get}}. However the map is designed so that all threads help with the copy and
this is what's happening here. I suspect a bug that prevents threads from
making progress and keeps them spinning.
We are currently using the latest available version of {{NonBlockingHashMap}},
version 1.0.6, from [this
repository|https://github.com/boundary/high-scale-lib].
We have a number of options:
- Fix {{NonBlockingHashMap}}
- Replace it
- Instantiate it with an initial size to prevent resizing (4K fixes this
specific case).
> test_bulk_round_trip_blogposts is failing occasionally
> ------------------------------------------------------
>
> Key: CASSANDRA-10938
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10938
> Project: Cassandra
> Issue Type: Sub-task
> Components: Tools
> Reporter: Stefania
> Assignee: Stefania
> Fix For: 2.1.x
>
> Attachments: 6452.nps, 6452.png, 7300.nps, 7300a.png, 7300b.png,
> node1_debug.log, node2_debug.log, node3_debug.log, recording_127.0.0.1.jfr
>
>
> We get timeouts occasionally that cause the number of records to be incorrect:
> http://cassci.datastax.com/job/trunk_dtest/858/testReport/cqlsh_tests.cqlsh_copy_tests/CqlshCopyTest/test_bulk_round_trip_blogposts/
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)