[
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13922721#comment-13922721
]
Ben Chan commented on CASSANDRA-5483:
-------------------------------------
It was more involved than I thought, partly because of heisenbugs and the trace
state mysteriously not propagating (see {{v06-05}}).
Note: changing JMX can cause mysterious errors if you don't {{ant clean &&
ant}}. I ran into the same kinds of stack traces as you did. It's not
consistent. Sometimes I can make a JMX change and {{ant}} with no problem.
To make patches simpler, I'm posting full repro code. I also tried to simplify
the naming. Unfortunately, all the previous patches are in jumbled order due to
a naming convention that doesn't sort. Fortunately, JIRA seems to have an
easter egg where you can choose the attachment name by changing the url.
{noformat}
# Uncomment to exactly reproduce state.
#git checkout -b 5483-e30d6dc e30d6dc
# Download all needed patches with consistent names, apply patches, build.
W=https://issues.apache.org/jira/secure/attachment
for url in \
$W/12630490/5483-v02-01-Trace-filtering-and-tracestate-propagation.patch \
$W/12630491/5483-v02-02-Put-a-few-traces-parallel-to-the-repair-logging.patch
\
$W/12631967/5483-v03-03-Make-repair-tracing-controllable-via-nodetool.patch \
$W/12633153/5483-v06-04-Allow-tracing-ttl-to-be-configured.patch \
$W/12633154/5483-v06-05-Add-a-command-column-to-system_traces.events.patch \
$W/12633155/5483-v06-06-Fix-interruption-in-tracestate-propagation.patch \
$W/12633156/ccm-repair-test
do [ -e $(basename $url) ] || curl -sO $url; done &&
git apply 5483-v0[236]-*.patch &&
ant clean && ant
# put on a separate line because you should at least minimally inspect
# arbitrary code before running.
chmod +x ./ccm-repair-test && ./ccm-repair-test
{noformat}
{{ccm-repair-test}} has some options for convenience:
{noformat}
-k keep (don't delete) the created cluster after successful exit.
-r repair only
-R don't repair
-t do traced repair only
-T don't do traced repair (if neither, then do both traced and untraced repair)
{noformat}
The output of a test run:
{noformat}
Current cluster is now: test-5483-QiR
[2014-03-06 10:46:13,617] Nothing to repair for keyspace 'system'
[2014-03-06 10:46:13,646] Starting repair command #1, repairing 2 ranges for
keyspace s1 (seq=true, full=true)
[2014-03-06 10:46:16,999] Repair session 72648190-a546-11e3-a5f4-f94811c7b860
for range (-3074457345618258603,3074457345618258602] finished
[2014-03-06 10:46:17,465] Repair session 73ee2ed0-a546-11e3-a5f4-f94811c7b860
for range (3074457345618258602,-9223372036854775808] finished
[2014-03-06 10:46:17,465] Repair command #1 finished
[2014-03-06 10:46:17,485] Starting repair command #2, repairing 2 ranges for
keyspace system_traces (seq=true, full=true)
[2014-03-06 10:46:18,782] Repair session 74aaef20-a546-11e3-a5f4-f94811c7b860
for range (-3074457345618258603,3074457345618258602] finished
[2014-03-06 10:46:18,816] Repair session 74ff0290-a546-11e3-a5f4-f94811c7b860
for range (3074457345618258602,-9223372036854775808] finished
[2014-03-06 10:46:18,816] Repair command #2 finished
0 rows exported in 0.015 seconds.
test-5483-QiR-system_traces-events.txt
ok
[2014-03-06 10:46:24,128] Nothing to repair for keyspace 'system'
[2014-03-06 10:46:24,166] Starting repair command #3, repairing 2 ranges for
keyspace s1 (seq=true, full=true)
[2014-03-06 10:46:25,366] Repair session 78a6d4e0-a546-11e3-a5f4-f94811c7b860
for range (-3074457345618258603,3074457345618258602] finished
[2014-03-06 10:46:25,415] Repair session 79263e10-a546-11e3-a5f4-f94811c7b860
for range (3074457345618258602,-9223372036854775808] finished
[2014-03-06 10:46:25,415] Repair command #3 finished
[2014-03-06 10:46:25,485] Starting repair command #4, repairing 2 ranges for
keyspace system_traces (seq=true, full=true)
[2014-03-06 10:46:27,077] Repair session 796f7c10-a546-11e3-a5f4-f94811c7b860
for range (-3074457345618258603,3074457345618258602] finished
[2014-03-06 10:46:27,120] Repair session 79f240a0-a546-11e3-a5f4-f94811c7b860
for range (3074457345618258602,-9223372036854775808] finished
[2014-03-06 10:46:27,120] Repair command #4 finished
48 rows exported in 0.104 seconds.
test-5483-QiR-system_traces-events-tr.txt
found source: 127.0.0.1
found thread: Thread-15
found thread: AntiEntropySessions:1
found thread: RepairJobTask:1
found source: 127.0.0.2
found thread: AntiEntropyStage:1
found source: 127.0.0.3
found thread: AntiEntropySessions:2
found thread: Thread-16
found thread: AntiEntropySessions:3
found thread: AntiEntropySessions:4
unique sources traced: 3
unique threads traced: 8
All thread categories accounted for
ok
{noformat}
---
Patch comments:
- {{v06-04}} I did something similar to {{v03-03}}, (almost) no refactoring.
The implementation is a little messy architecturally.
- {{v06-05}} This is the suggestion you had to add a "command" column. I don't
know how to make it the last column. At least on my box, it's column 5 of 7
despite me putting it last in the cql. Note that {{ccm-repair-test}}'s checking
code will break if the column order changes.
- {{v06-06}} You need to submit {{Runnable}} s, etc. using
{{DebuggableThreadPoolExecutor}} if you want them to inherit tracestate.
Tracestate propagation is very easy to break under concurrency, so this is
probably the first thing to check if it ever happens again.
> Repair tracing
> --------------
>
> Key: CASSANDRA-5483
> URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
> Project: Cassandra
> Issue Type: Improvement
> Components: Tools
> Reporter: Yuki Morishita
> Assignee: Ben Chan
> Priority: Minor
> Labels: repair
> Attachments: 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch,
> 5483-v06-05-Add-a-command-column-to-system_traces.events.patch,
> 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch,
> ccm-repair-test, test-5483-system_traces-events.txt,
> trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch,
> trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
> tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt,
> [email protected],
> v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch,
> v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
> v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch
>
>
> I think it would be nice to log repair stats and results like query tracing
> stores traces to system keyspace. With it, you don't have to lookup each log
> file to see what was the status and how it performed the repair you invoked.
> Instead, you can query the repair log with session ID to see the state and
> stats of all nodes involved in that repair session.
--
This message was sent by Atlassian JIRA
(v6.2#6252)