[
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ben Chan updated CASSANDRA-5483:
--------------------------------
Attachment: 5483-v13-608fb03-May-14-trace-formatting-changes.patch
May 14 formatting changes in
[^5483-v13-01-608fb03-May-14-trace-formatting-changes.patch] (based off of
commit 608fb03).
{quote}
I think the session log messages are still confusing, especially since we use
the same term for "repairing a subrange" and "streaming data."
{quote}
Currently the "session" terminology is baked into the source code, in
{{StreamSession.java}} and {{RepairSession.java}}. If the messages are changed
to reflect different terminology, hopefully the source code can eventually be
changed to match (fewer special cases to remember). Perhaps the best thing is
to always qualify them, e.g. "stream session" and "repair session"?
{quote}
I don't actually see the session uuid being used in the logs except at
start/finish.
{quote}
Sorry, that was another inadvertent mixing of nodetool messages and trace
output. {{\[2014-05-13 23:49:52,283] Repair session
cd6aad80-db1a-11e3-b0e7-f94811c7b860 for range
(3074457345618258602,-9223372036854775808] finished}} is not a trace, but a
separate (pre-patch) sendNotification in {{StorageService.java}}. This message
(and some of the error messages, I think) is redundant when combined with trace
output. It should have been either one or the other, not both. In the trace
proper, the session UUID only shows up at the start.
But note: not all nodetool messages are rendered redundant by trace output.
Since we can't just suppress all non-trace sendNotification, how can we
unambiguously tell nodetool trace output from normal sendNotification messages?
I'm currently leaning towards just marking all sendNotification trace output
with a {{TRACE:}} tag.
The repair session UUIDs used to be prepended to everything, but were removed
in [^5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch]. Without
them, things are less verbose, but it's sometimes hard to unambiguously follow
traces for concurrent repair sessions. To make the point clearer, I've marked
each sub-task graphically in the nodetool trace output below (I've
cross-checked this with the logs, which do retain the UUIDs). If you cover up
the left side, it's harder to figure out which trace goes with which sub-task.
Real-world repair traces will probably be even more confusing.
Note: indentation here does not denote nesting; the column roughly indicates
task identity, though I reuse columns when it's not ambiguous.
{noformat}
1 [2014-05-15 11:31:37,839] Starting repair command #1, repairing 3
ranges for s1.users (seq=true, full=true)
x [2014-05-15 11:31:37,922] Syncing range
(-3074457345618258603,3074457345618258602]
x [2014-05-15 11:31:38,108] Requesting merkle trees for users from
[/127.0.0.2, /127.0.0.3, /127.0.0.1]
x [2014-05-15 11:31:38,833] /127.0.0.2: Sending completed merkle tree
to /127.0.0.1 for s1.users
x [2014-05-15 11:31:39,953] Received merkle tree for users from
/127.0.0.2
x [2014-05-15 11:31:40,939] /127.0.0.3: Sending completed merkle tree
to /127.0.0.1 for s1.users
x [2014-05-15 11:31:41,279] Received merkle tree for users from
/127.0.0.3
x [2014-05-15 11:31:42,632] Received merkle tree for users from
/127.0.0.1
x [2014-05-15 11:31:42,671] Syncing range
(-9223372036854775808,-3074457345618258603]
x [2014-05-15 11:31:42,766] Requesting merkle trees for users from
[/127.0.0.2, /127.0.0.3, /127.0.0.1]
x [2014-05-15 11:31:42,905] Endpoint /127.0.0.2 is consistent with
/127.0.0.3 for users
x [2014-05-15 11:31:43,044] Endpoint /127.0.0.2 is consistent with
/127.0.0.1 for users
x [2014-05-15 11:31:43,047] Endpoint /127.0.0.3 is consistent with
/127.0.0.1 for users
x [2014-05-15 11:31:43,084] Completed sync of range
(-3074457345618258603,3074457345618258602]
x [2014-05-15 11:31:43,251] /127.0.0.2: Sending completed merkle tree
to /127.0.0.1 for s1.users
x [2014-05-15 11:31:43,422] Received merkle tree for users from
/127.0.0.2
x [2014-05-15 11:31:44,495] /127.0.0.3: Sending completed merkle tree
to /127.0.0.1 for s1.users
x [2014-05-15 11:31:44,637] Received merkle tree for users from
/127.0.0.3
x [2014-05-15 11:31:45,474] Received merkle tree for users from
/127.0.0.1
x [2014-05-15 11:31:45,494] Syncing range
(3074457345618258602,-9223372036854775808]
x [2014-05-15 11:31:45,499] Endpoint /127.0.0.3 is consistent with
/127.0.0.1 for users
x [2014-05-15 11:31:45,520] Endpoint /127.0.0.2 is consistent with
/127.0.0.1 for users
x [2014-05-15 11:31:45,544] Endpoint /127.0.0.2 is consistent with
/127.0.0.3 for users
x [2014-05-15 11:31:45,564] Completed sync of range
(-9223372036854775808,-3074457345618258603]
x [2014-05-15 11:31:45,581] Requesting merkle trees for users from
[/127.0.0.2, /127.0.0.3, /127.0.0.1]
x [2014-05-15 11:31:46,214] Received merkle tree for users from
/127.0.0.2
x [2014-05-15 11:31:46,233] /127.0.0.2: Sending completed merkle tree
to /127.0.0.1 for s1.users
x [2014-05-15 11:31:46,459] Received merkle tree for users from
/127.0.0.3
x [2014-05-15 11:31:46,472] /127.0.0.3: Sending completed merkle tree
to /127.0.0.1 for s1.users
x [2014-05-15 11:31:46,545] Received merkle tree for users from
/127.0.0.1
x [2014-05-15 11:31:46,574] Endpoint /127.0.0.3 has 1 range(s) out of
sync with /127.0.0.1 for users
x [2014-05-15 11:31:46,575] Endpoint /127.0.0.2 is consistent with
/127.0.0.3 for users
x [2014-05-15 11:31:46,581] Endpoint /127.0.0.2 has 1 range(s) out of
sync with /127.0.0.1 for users
x [2014-05-15 11:31:46,601] Performing streaming repair of 1 ranges
with /127.0.0.2
x [2014-05-15 11:31:46,601] Performing streaming repair of 1 ranges
with /127.0.0.3
x [2014-05-15 11:31:46,910] Streaming session with /127.0.0.2 prepared.
x [2014-05-15 11:31:46,949] Streaming session with /127.0.0.3 prepared.
x [2014-05-15 11:31:46,991] 87/87 bytes (100%) received from
idx:0/127.0.0.3
x [2014-05-15 11:31:47,004] 87/87 bytes (100%) received from
idx:0/127.0.0.2
x [2014-05-15 11:31:47,053] Streaming session with /127.0.0.3 completed
successfully.
x [2014-05-15 11:31:47,112] Streaming session with /127.0.0.2 completed
successfully.
x [2014-05-15 11:31:47,131] Completed sync of range
(3074457345618258602,-9223372036854775808]
1 [2014-05-15 11:31:47,133] Repair command #1 finished
2 [2014-05-15 11:31:47,180] Starting repair command #2, repairing 2
ranges for system_traces.sessions, system_traces.events (seq=true, full=true)
x [2014-05-15 11:31:47,189] Syncing range
(-3074457345618258603,3074457345618258602]
x [2014-05-15 11:31:47,260] Requesting merkle trees for sessions from
[/127.0.0.3, /127.0.0.1]
x [2014-05-15 11:31:47,344] /127.0.0.3: Sending completed merkle tree
to /127.0.0.1 for system_traces.sessions
x [2014-05-15 11:31:47,400] Received merkle tree for sessions from
/127.0.0.3
x [2014-05-15 11:31:47,577] Received merkle tree for sessions from
/127.0.0.1
x [2014-05-15 11:31:47,617] Endpoint /127.0.0.3 is consistent with
/127.0.0.1 for sessions
x [2014-05-15 11:31:47,743] Requesting merkle trees for events from
[/127.0.0.3, /127.0.0.1]
x [2014-05-15 11:31:48,010] Received merkle tree for events from
/127.0.0.3
x [2014-05-15 11:31:48,027] /127.0.0.3: Sending completed merkle tree
to /127.0.0.1 for system_traces.events
x [2014-05-15 11:31:48,245] Received merkle tree for events from
/127.0.0.1
x [2014-05-15 11:31:48,253] Syncing range
(3074457345618258602,-9223372036854775808]
x [2014-05-15 11:31:48,294] Requesting merkle trees for sessions from
[/127.0.0.2, /127.0.0.1]
x [2014-05-15 11:31:48,313] Endpoint /127.0.0.3 is consistent with
/127.0.0.1 for events
x [2014-05-15 11:31:48,322] Completed sync of range
(-3074457345618258603,3074457345618258602]
x [2014-05-15 11:31:48,372] /127.0.0.2: Sending completed merkle tree
to /127.0.0.1 for system_traces.sessions
x [2014-05-15 11:31:48,424] Received merkle tree for sessions from
/127.0.0.2
x [2014-05-15 11:31:48,582] Received merkle tree for sessions from
/127.0.0.1
x [2014-05-15 11:31:48,613] Endpoint /127.0.0.2 is consistent with
/127.0.0.1 for sessions
x [2014-05-15 11:31:48,641] Requesting merkle trees for events from
[/127.0.0.2, /127.0.0.1]
x [2014-05-15 11:31:48,765] Received merkle tree for events from
/127.0.0.2
x [2014-05-15 11:31:48,792] /127.0.0.2: Sending completed merkle tree
to /127.0.0.1 for system_traces.events
x [2014-05-15 11:31:48,942] Received merkle tree for events from
/127.0.0.1
x [2014-05-15 11:31:48,973] Endpoint /127.0.0.2 is consistent with
/127.0.0.1 for events
x [2014-05-15 11:31:48,985] Completed sync of range
(3074457345618258602,-9223372036854775808]
2 [2014-05-15 11:31:48,990] Repair command #2 finished
{noformat}
> Repair tracing
> --------------
>
> Key: CASSANDRA-5483
> URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
> Project: Cassandra
> Issue Type: Improvement
> Components: Tools
> Reporter: Yuki Morishita
> Assignee: Ben Chan
> Priority: Minor
> Labels: repair
> Attachments: 5483-full-trunk.txt,
> 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch,
> 5483-v06-05-Add-a-command-column-to-system_traces.events.patch,
> 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch,
> 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
> 5483-v07-08-Fix-brace-style.patch,
> 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
> 5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch,
> 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch,
> 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch,
> 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch,
> 5483-v08-14-Poll-system_traces.events.patch,
> 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch,
> 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch,
> 5483-v10-17-minor-bugfixes-and-changes.patch,
> 5483-v10-rebased-and-squashed-471f5cc.patch, 5483-v11-01-squashed.patch,
> 5483-v11-squashed-nits.patch, 5483-v12-02-cassandra-yaml-ttl-doc.patch,
> 5483-v13-608fb03-May-14-trace-formatting-changes.patch, ccm-repair-test,
> cqlsh-left-justify-text-columns.patch, prerepair-vs-postbuggedrepair.diff,
> test-5483-system_traces-events.txt,
> trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch,
> trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
> tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt,
> [email protected],
> v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch,
> v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
> v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch
>
>
> I think it would be nice to log repair stats and results like query tracing
> stores traces to system keyspace. With it, you don't have to lookup each log
> file to see what was the status and how it performed the repair you invoked.
> Instead, you can query the repair log with session ID to see the state and
> stats of all nodes involved in that repair session.
--
This message was sent by Atlassian JIRA
(v6.2#6252)