[ 
https://issues.apache.org/jira/browse/CASSANDRA-12280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Roth reopened CASSANDRA-12280:
---------------------------------------
    Reproduced In: 3.7, 3.0.8, 3.9  (was: 3.0.8)

I encountered this issue again (and again and again).
Last time it even happened with the keyspace mentioned in the issue 
description, which contains exactly 6 records in table "dislike" and nothing 
else.
There are currently no reads or writes on that keyspace. Other keypaces in the 
cluster are already in production, so the cluster itself is a bit busy but far 
from being overloaded.
We use reaper for queuing the repairs. The repair that hung was a parallel 
repair with a token range on the whole keyspace.
The repair can not be cancelled by JMX (or by reaper using JMX again), the JMX 
call also hangs. Only restarting all the nodes with hanging repair helps.
I don't see any logs indicating a hard error like broken pipe, timeouts, ...
tpstats shows the hanging repairs, no compactions are ongoing or pending. 
netstats shows 1 or 2 pending messages all the time but it is hard to tell if 
they belong to the hanging repair.

To me it somehow smells of a deadlock situation caused by a race condition. May 
that maybe relate to MVs? Maybe if a base table and a related MV are repaired 
at the same time?

Sometimes I saw in the logs sth like "Could not create snapshot". But it is not 
so easy to tell if that was a cause or an effect.

Are there any tools to dig deeper? More detailed logging? A way to get a trace 
of the repair thread? I mean there are not so many ways to "hang" either the 
thread is waiting for IO or it is locked. It should be quite easy to find out 
whats going on when seeing the BT.

> nodetool repair hangs
> ---------------------
>
>                 Key: CASSANDRA-12280
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12280
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Benjamin Roth
>
> nodetool repair hangs when repairing a keyspace, does not hang when 
> repairting table/mv by table/mv.
> Command executed (both variants make it hang):
> nodetool repair likes like dislike_by_source_mv like_by_contact_mv 
> match_valid_mv like_out dislike match match_by_contact_mv like_valid_mv 
> like_out_by_source_mv
> OR
> nodetool repair likes
> Logs:
> https://gist.github.com/brstgt/bf8b20fa1942d29ab60926ede7340b75
> Nodetool output:
> https://gist.github.com/brstgt/3aa73662da4b0190630ac1aad6c90a6f
> Schema:
> https://gist.github.com/brstgt/3fd59e0166f86f8065085532e3638097



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to