[
https://issues.apache.org/jira/browse/CASSANDRA-12280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15421200#comment-15421200
]
Benjamin Roth commented on CASSANDRA-12280:
-------------------------------------------
Some traces of hanging repairs (better say hanging streams):
A repair that hung and ended in a broken pipe:
- Trace, netstats, compactionstats: http://pastebin.com/sFhe1NpZ
Trace of another run of the same range-repair, parallel, hung about 23 minutes,
finished successful:
- Trace: https://cl.ly/1s3b2F3o3900
Trace of another run of the same range, sequential, was run when network was
(artificially, using iperf) completely saturated:
- Network graphs: https://cl.ly/0A030X2m463z / https://cl.ly/2F2E412i2Q07
- Trace: https://cl.ly/2b3y1C1O243k
It completed much faster even though it was run sequential AND network was
fully saturated - had just shorter streaming lags.
These are only a few examples.
Is it possible that there exist some blocking / deadlock scenarios in
streaming?
I don't claim that our network stack ist 100% perfectly tuned but it is very
very unlikely that these pauses are caused by the network layer or overloaded
disks / cpus. I applied most of the suggested sysctl parameters from Al's
Tuning guide
(https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html). Also I am
able to easily shove 700-900 Mbit/s between the affected nodes additional to C*
running in normal operation.
To be sure that there is no filesystem issue, I copied all SSTables for that CF
over the network (around 13GB) to that host which is also part of the repair
job - worked as expected, throughput 90-100MB/s.
I am aware that streaming is much more than transferring some files. As far is
I know up to know, C* is using the normal dataflow during a stream (memtable >
sstable > compaction ...) but a stream that hangs around for many minutes
without an obvious reason is really obscure.
I also checked the CPU / Alloc stats of the affected nodes with sjk-plus. Also
here no obvious activity like StreamReceiverTask, Compaction, ... only normal
operation activity. It behaves just like if there is a stale lock lingering
around somewhere.
Anything more I can do?
> nodetool repair hangs
> ---------------------
>
> Key: CASSANDRA-12280
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12280
> Project: Cassandra
> Issue Type: Bug
> Reporter: Benjamin Roth
>
> nodetool repair hangs when repairing a keyspace, does not hang when
> repairting table/mv by table/mv.
> Command executed (both variants make it hang):
> nodetool repair likes like dislike_by_source_mv like_by_contact_mv
> match_valid_mv like_out dislike match match_by_contact_mv like_valid_mv
> like_out_by_source_mv
> OR
> nodetool repair likes
> Logs:
> https://gist.github.com/brstgt/bf8b20fa1942d29ab60926ede7340b75
> Nodetool output:
> https://gist.github.com/brstgt/3aa73662da4b0190630ac1aad6c90a6f
> Schema:
> https://gist.github.com/brstgt/3fd59e0166f86f8065085532e3638097
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)