from:"Ben Chan \(JIRA\)"

[jira] [Commented] (CASSANDRA-5483) Repair tracing

2014-12-07 Thread Ben Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237144#comment-14237144
 ] 

Ben Chan commented on CASSANDRA-5483:
-

Thanks for helping get this through the final (?) set of speed bumps. As fast 
as this codebase moves, I'm surprised that last rebase went so smoothly. (NRN)


 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Fix For: 3.0

 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
 5483-v08-14-Poll-system_traces.events.patch, 
 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch, 
 5483-v10-17-minor-bugfixes-and-changes.patch, 
 5483-v10-rebased-and-squashed-471f5cc.patch, 5483-v11-01-squashed.patch, 
 5483-v11-squashed-nits.patch, 5483-v12-02-cassandra-yaml-ttl-doc.patch, 
 5483-v13-608fb03-May-14-trace-formatting-changes.patch, 
 5483-v14-01-squashed.patch, 
 5483-v15-02-Hook-up-exponential-backoff-functionality.patch, 
 5483-v15-03-Exact-doubling-for-exponential-backoff.patch, 
 5483-v15-04-Re-add-old-StorageService-JMX-signatures.patch, 
 5483-v15-05-Move-command-column-to-system_traces.sessions.patch, 
 5483-v15.patch, 5483-v17-00.patch, 5483-v17-01.patch, 5483-v17.patch, 
 ccm-repair-test, cqlsh-left-justify-text-columns.patch, 
 prerepair-vs-postbuggedrepair.diff, test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-5483) Repair tracing

2014-12-02 Thread Ben Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231840#comment-14231840
 ] 

Ben Chan commented on CASSANDRA-5483:
-

Sorry; limited computer access last week, zero access to dev box (network setup 
issues), turkey.



Branch 5483_squashed (currently at commit 78686c61e38e, merges cleanly with 
trunk at 06f626acd27b) looks fine to me; looks like there were only mechanical 
changes needed to resolve the merge conflicts.

Ran my standard simplistic test, and no obvious problems (for both 78686c61e38e 
and again when merged with 06f626acd27b).

Any problem is likely to be something I did, or some unforseen interaction with 
code in the updated trunk. Hopefully neither of those possibilities is true.  
I'll keep watching this thread/issue just in case.


 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Fix For: 3.0

 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
 5483-v08-14-Poll-system_traces.events.patch, 
 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch, 
 5483-v10-17-minor-bugfixes-and-changes.patch, 
 5483-v10-rebased-and-squashed-471f5cc.patch, 5483-v11-01-squashed.patch, 
 5483-v11-squashed-nits.patch, 5483-v12-02-cassandra-yaml-ttl-doc.patch, 
 5483-v13-608fb03-May-14-trace-formatting-changes.patch, 
 5483-v14-01-squashed.patch, 
 5483-v15-02-Hook-up-exponential-backoff-functionality.patch, 
 5483-v15-03-Exact-doubling-for-exponential-backoff.patch, 
 5483-v15-04-Re-add-old-StorageService-JMX-signatures.patch, 
 5483-v15-05-Move-command-column-to-system_traces.sessions.patch, 
 5483-v15.patch, 5483-v17-00.patch, 5483-v17-01.patch, 5483-v17.patch, 
 ccm-repair-test, cqlsh-left-justify-text-columns.patch, 
 prerepair-vs-postbuggedrepair.diff, test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-5483) Repair tracing

2014-11-14 Thread Ben Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212771#comment-14212771
 ] 

Ben Chan commented on CASSANDRA-5483:
-

Updated https://github.com/usrbincc/cassandra/tree/5483-review (currently at 
commit 85d974753dc7f).

So a few changes to tracing got merged in on Nov 12, making a cut-and-dried 
manual rebase impossible.

- Use a default ttl on the system_traces.* CFs. Use that to clean up the code 
in Tracing, since you no longer have to specify ttls.
- Move a lot of row-insertion code out of Tracing into a new TraceKeyspace 
class.
- Since there is no more need to specify ttl, use CFRowAdder (which doesn't do 
ttl) for convenience.

Since the repair tracing patch requires a user-configurable ttl (or at the very 
least, a different ttl for repair tracing and query tracing), I needed to 
re-add ttl specification. Since it wasn't much more code, I decided to only use 
explicitly-expiring cells if the ttl didn't match the default ttl.  
  
This appears to save two native ints per column from a quick skim of the source 
code. Not sure if that's really enough to care about (especially since repair 
tracing, which is likely to insert more rows, has a ttl different from the 
default), but it was a simple enough change.



Merge conflicts are a real pain point with git. I attempted to make reviewing 
the merge conflict resolution changes easier by inserting an intermediate 
commit that includes the conflict markers from git, unmodified. Feel free to 
hide all the messiness with a {{git merge --squash}}.

- Fixed a race condition with repair trace sessions sometimes not being 
properly stopped (see commit 801b6fbf56771).
- Possible code no-no to note: I made two private fields (SESSIONS_TABLE and 
EVENTS_TABLE) in TraceKeyspace public. I only use EVENTS_TABLE, but made them 
both public for symmetry (good? bad? don't know). They're final fields, if 
that's any consolation.


 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Fix For: 3.0

 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
 5483-v08-14-Poll-system_traces.events.patch, 
 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch, 
 5483-v10-17-minor-bugfixes-and-changes.patch, 
 5483-v10-rebased-and-squashed-471f5cc.patch, 5483-v11-01-squashed.patch, 
 5483-v11-squashed-nits.patch, 5483-v12-02-cassandra-yaml-ttl-doc.patch, 
 5483-v13-608fb03-May-14-trace-formatting-changes.patch, 
 5483-v14-01-squashed.patch, 
 5483-v15-02-Hook-up-exponential-backoff-functionality.patch, 
 5483-v15-03-Exact-doubling-for-exponential-backoff.patch, 
 5483-v15-04-Re-add-old-StorageService-JMX-signatures.patch, 
 5483-v15-05-Move-command-column-to-system_traces.sessions.patch, 
 5483-v15.patch, 5483-v17-00.patch, 5483-v17-01.patch, 5483-v17.patch, 
 ccm-repair-test, cqlsh-left-justify-text-columns.patch, 
 prerepair-vs-postbuggedrepair.diff, test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair

[jira] [Commented] (CASSANDRA-5483) Repair tracing

2014-11-10 Thread Ben Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14205451#comment-14205451
 ] 

Ben Chan commented on CASSANDRA-5483:
-

Sorry; SMART and reallocated sectors. Which only really accounts for about 2-3 
days of the delay. But just to say that it wasn't merely a case of hangover 
from too much Halloween candy.



Updated https://github.com/usrbincc/cassandra/tree/5483-review (currently at 
commit 202a2e2e5e602).

Merges cleanly with trunk (at commit d286ac7d072fe), building and testing 
cleanly.

I decided to be a little opinionated and did some refactoring along the lines 
of my Oct 23 message.
- Used a TraceState#waitActivity function instead of TraceState#isDone 
(waitActivity gets closer to doing only what it says on the tin; makes it less 
hairy to comment).
- Moved all exponential backoff timeout code to StorageService#createQueryThread

In addition, I renamed TraceState#enableNotifications to 
TraceState#enableActivityNotification to attempt (naming is hard) to avoid 
confusion with TraceState#setNotificationHandle, which is entirely unrelated.

Note: beyond having made this opinionated edit, I'm not planning to be 
particularly opinionated about advocating for it. All of that code should 
eventually go away once there is some way to get notified about table updates 
instead of having to do all that messy polling.

Extra note: Cassandra triggers seem to be very close to what is needed, if only 
they could be specified to run on a given node (i.e. the node that is being 
repaired). The last time I checked on this, this wasn't possible.



Unfiltered traces:

- The extra traces are generic message send-receive traces that existed prior 
to this patch. They were originally there for query tracing, which benefits 
from more detailed tracing.
- These extra traces were filtered out for repair up until v16 of this patch. 
This means that any discussions of trace messages prior to that point are 
referring to the filtered traces.

But I can't say that they're doing any real harm. I mean, it's only 3x the 
traces (estimated), and not an order of magnitude or more.

It's probably fine as it is. I certainly can't unequivocally state that there's 
no use for those extra traces. Besides, extra information can always be 
filtered out at a higher level (assuming it's tagged appropriately).

 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Fix For: 3.0

 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
 5483-v08-14-Poll-system_traces.events.patch, 
 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch, 
 5483-v10-17-minor-bugfixes-and-changes.patch, 
 5483-v10-rebased-and-squashed-471f5cc.patch, 5483-v11-01-squashed.patch, 
 5483-v11-squashed-nits.patch, 5483-v12-02-cassandra-yaml-ttl-doc.patch, 
 5483-v13-608fb03-May-14-trace-formatting-changes.patch, 
 5483-v14-01-squashed.patch, 
 5483-v15-02-Hook-up-exponential-backoff-functionality.patch, 
 5483-v15-03-Exact-doubling-for-exponential-backoff.patch, 
 5483-v15-04-Re-add-old-StorageService-JMX-signatures.patch, 
 5483-v15-05-Move-command-column-to-system_traces.sessions.patch, 
 5483-v15.patch, 5483-v17-00.patch, 5483-v17-01.patch, 5483-v17.patch, 
 ccm-repair-test, cqlsh-left-justify-text-columns.patch, 
 prerepair-vs-postbuggedrepair.diff, test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,

[jira] [Commented] (CASSANDRA-5483) Repair tracing

2014-10-28 Thread Ben Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14187618#comment-14187618
 ] 

Ben Chan commented on CASSANDRA-5483:
-

Communication seems to be off by a phase. My Oct 23 message may not have made 
it across. Either that or I wasn't writing clearly enough (But in my defense, 
natural language has a fiddly REPL and almost-useless error messages. And 
there's no source code.).

Summary:
- Forget about the removal of the addCallback + sameThreadExecutor code before. 
 It turns out I didn't need to do that. The problem (I think) was some 
corruption in my build.
- Commenting isDone results in a convoluted comment. I was considering whether 
or not to do a refactor in order to make the comment require less complication.



For now I've gone ahead and commented isDone as best I can and fixed the naming 
that resulted in {{wait(wait);}}, along with various other fixes.

See https://github.com/usrbincc/cassandra/tree/5483-review (currently at commit 
15d8ec0a9fbbf).



Closing comments:
- For future reference: If you don't care about handling multiple consumers (as 
the current isDone does not), a large chunk of isDone (and its supporting code) 
could easily be replaced with a BlockingQueue and its {{offer}} and {{poll}} 
methods.
- Can I get a confirmation that the unfiltered repair traces (see my Oct 22 
message) in the current patch is acceptable?


 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Fix For: 3.0

 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
 5483-v08-14-Poll-system_traces.events.patch, 
 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch, 
 5483-v10-17-minor-bugfixes-and-changes.patch, 
 5483-v10-rebased-and-squashed-471f5cc.patch, 5483-v11-01-squashed.patch, 
 5483-v11-squashed-nits.patch, 5483-v12-02-cassandra-yaml-ttl-doc.patch, 
 5483-v13-608fb03-May-14-trace-formatting-changes.patch, 
 5483-v14-01-squashed.patch, 
 5483-v15-02-Hook-up-exponential-backoff-functionality.patch, 
 5483-v15-03-Exact-doubling-for-exponential-backoff.patch, 
 5483-v15-04-Re-add-old-StorageService-JMX-signatures.patch, 
 5483-v15-05-Move-command-column-to-system_traces.sessions.patch, 
 5483-v15.patch, 5483-v17-00.patch, 5483-v17-01.patch, 5483-v17.patch, 
 ccm-repair-test, cqlsh-left-justify-text-columns.patch, 
 prerepair-vs-postbuggedrepair.diff, test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-5483) Repair tracing

2014-10-23 Thread Ben Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181643#comment-14181643
 ] 

Ben Chan commented on CASSANDRA-5483:
-

Review branch up at:

https://github.com/usrbincc/cassandra/tree/5483-review



StorageService#createRepairTask:

So it turns out that my mistaken assumption was this: 
Tracing.instance.stopSession() needs to run in the main repair thread. (This 
was not one of the assumptions that I listed explicitly above).

Why did I believe this? Because repair kept on hanging at the end until I made 
this change as a sanity check.

Why does it not hang currently? I don't know, but I suspect there was something 
odd in the build that got cleaned out when I did an ant clean. Or when I 
manually cleaned out the build/lib/jars directory (there were multiple versions 
of the same library in there; ant clean was not taking care of this).

So I put back the original addCallback code structure.



Commenting isDone:

I started trying to do this, but it quickly devolved into just badly parroting 
the code. It would simplify commenting if I separated this function into 
multiple parts, maybe something like this:
{code:java}
// TraceState
public enum Status { IDLE, ACTIVE, STOPPED; }
private Status status;
/*  tentative doc comment 
 * Wait (with timeout) until status is ACTIVE or STOPPED. Reading an ACTIVE
 * status resets it to IDLE. Due to the status reset logic, this function only
 * works properly with a single consumer.
 * @param timeout timeout in milliseconds
 * @return activity status
 */
public synchronized Status getActivity(long timeout)
{
// Note: This is just a placeholder for the general gist.
while (status == Status.IDLE)
wait(timeout); // TODO: we need to update timeout if this actually 
loops.
Status retStatus = status;
status = status == Status.ACTIVE ? Status.IDLE : status;
return retStatus;
}

// createQueryThread's returned Runnable#run
while ((r = getActivity(timeout)) != TraceState.Status.STOPPED)
{
   if (r == TraceState.Status.IDLE) // double timeout if second time through 
with this timeout
   { ... }
   else // reset timeout to minWait
   { ... }
   // ...
}
{code}

Note that this would end up being structured like a cleaner version of 
[^5483-v14-01-squashed.patch], with further simplifications due to not trying 
to handle multiple consumers.


 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Fix For: 3.0

 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
 5483-v08-14-Poll-system_traces.events.patch, 
 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch, 
 5483-v10-17-minor-bugfixes-and-changes.patch, 
 5483-v10-rebased-and-squashed-471f5cc.patch, 5483-v11-01-squashed.patch, 
 5483-v11-squashed-nits.patch, 5483-v12-02-cassandra-yaml-ttl-doc.patch, 
 5483-v13-608fb03-May-14-trace-formatting-changes.patch, 
 5483-v14-01-squashed.patch, 
 5483-v15-02-Hook-up-exponential-backoff-functionality.patch, 
 5483-v15-03-Exact-doubling-for-exponential-backoff.patch, 
 5483-v15-04-Re-add-old-StorageService-JMX-signatures.patch, 
 5483-v15-05-Move-command-column-to-system_traces.sessions.patch, 
 5483-v15.patch, 5483-v17-00.patch, 5483-v17-01.patch, 5483-v17.patch, 
 ccm-repair-test, cqlsh-left-justify-text-columns.patch, 
 prerepair-vs-postbuggedrepair.diff, test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,

[jira] [Updated] (CASSANDRA-5483) Repair tracing

2014-10-22 Thread Ben Chan (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Chan updated CASSANDRA-5483:

Attachment: 5483-v17-00.patch

 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Fix For: 3.0

 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
 5483-v08-14-Poll-system_traces.events.patch, 
 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch, 
 5483-v10-17-minor-bugfixes-and-changes.patch, 
 5483-v10-rebased-and-squashed-471f5cc.patch, 5483-v11-01-squashed.patch, 
 5483-v11-squashed-nits.patch, 5483-v12-02-cassandra-yaml-ttl-doc.patch, 
 5483-v13-608fb03-May-14-trace-formatting-changes.patch, 
 5483-v14-01-squashed.patch, 
 5483-v15-02-Hook-up-exponential-backoff-functionality.patch, 
 5483-v15-03-Exact-doubling-for-exponential-backoff.patch, 
 5483-v15-04-Re-add-old-StorageService-JMX-signatures.patch, 
 5483-v15-05-Move-command-column-to-system_traces.sessions.patch, 
 5483-v15.patch, 5483-v17-00.patch, 5483-v17.patch, ccm-repair-test, 
 cqlsh-left-justify-text-columns.patch, prerepair-vs-postbuggedrepair.diff, 
 test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (CASSANDRA-5483) Repair tracing

2014-10-22 Thread Ben Chan (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Chan updated CASSANDRA-5483:

Attachment: 5483-v17-01.patch

 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Fix For: 3.0

 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
 5483-v08-14-Poll-system_traces.events.patch, 
 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch, 
 5483-v10-17-minor-bugfixes-and-changes.patch, 
 5483-v10-rebased-and-squashed-471f5cc.patch, 5483-v11-01-squashed.patch, 
 5483-v11-squashed-nits.patch, 5483-v12-02-cassandra-yaml-ttl-doc.patch, 
 5483-v13-608fb03-May-14-trace-formatting-changes.patch, 
 5483-v14-01-squashed.patch, 
 5483-v15-02-Hook-up-exponential-backoff-functionality.patch, 
 5483-v15-03-Exact-doubling-for-exponential-backoff.patch, 
 5483-v15-04-Re-add-old-StorageService-JMX-signatures.patch, 
 5483-v15-05-Move-command-column-to-system_traces.sessions.patch, 
 5483-v15.patch, 5483-v17-00.patch, 5483-v17-01.patch, 5483-v17.patch, 
 ccm-repair-test, cqlsh-left-justify-text-columns.patch, 
 prerepair-vs-postbuggedrepair.diff, test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-5483) Repair tracing

2014-10-22 Thread Ben Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14180004#comment-14180004
 ] 

Ben Chan commented on CASSANDRA-5483:
-

Thanks; node repair -full (in conjunction with ccm clear) does indeed force 
streaming repair.

[^5483-v17-00.patch] fixes a few trace calls that got lost/misplaced in the 
rebase.
[^5483-v17-01.patch] reorganizes some code; functionally identical.

{noformat}
# This bash code should work out-of-the-box with a stock setup.
JMXGET='/jmx_port/{p=$2;} /binary/{split($2,a,/\047/);h=a[2];}
  END{printf(bin/nodetool -h %s -p %s\n,h,p,cmd);}'
ccm_nodetool() { local N=$1; shift; $(ccm $N show | awk -F= $JMXGET) $@; }
dl_apply_maybe() { for url; do { [ -e $(basename $url) ] || curl -sO $url; } 
  ! [ ${url%.patch} = $url ]  git apply $(basename $url); done; }

NEW_BRANCH=$(date +5483-17--%Y%m%d-%H%M%S)
W=https://issues.apache.org/jira/secure/attachment

git checkout -b $NEW_BRANCH 49833b9 
dl_apply_maybe \
  $W/12633156/ccm-repair-test \
  $W/12675963/5483-v17.patch \
  $W/12675963/5483-v17-00.patch \
  $W/12676340/5483-v17-01.patch 
ant clean  ant 
chmod +x ./ccm-repair-test  ./ccm-repair-test -kR 
ccm node1 stop  ccm node1 clear  ccm node1 start 
ccm_nodetool node1 repair -tr -full 
ccm node1 showlog | grep Performing streaming repair
{noformat}



Note that I switched some code to a different thread in order to facilitate 
trace handling -- see the diff hunk near the end of 
StorageService#createRepairTask. I needed certain calls to happen in the parent 
repair thread, and this seemed to be the simplest way.

As far as I can tell, there shouldn't be any differences in functionality or 
concurrency level (i.e. number of tasks that are executing concurrently), but 
someone should examine that section just to make sure.



The issue with unfiltered traces (see my previous message) still remains, but 
if push comes to shove, you can consider this as just a special case of having 
a trace call where it's not needed, or missing where it *is* needed.

In other words, a cosmetic problem. Fixing this should not require any changes 
or additions to the over-wire protocol or the JMX interface.


 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Fix For: 3.0

 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
 5483-v08-14-Poll-system_traces.events.patch, 
 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch, 
 5483-v10-17-minor-bugfixes-and-changes.patch, 
 5483-v10-rebased-and-squashed-471f5cc.patch, 5483-v11-01-squashed.patch, 
 5483-v11-squashed-nits.patch, 5483-v12-02-cassandra-yaml-ttl-doc.patch, 
 5483-v13-608fb03-May-14-trace-formatting-changes.patch, 
 5483-v14-01-squashed.patch, 
 5483-v15-02-Hook-up-exponential-backoff-functionality.patch, 
 5483-v15-03-Exact-doubling-for-exponential-backoff.patch, 
 5483-v15-04-Re-add-old-StorageService-JMX-signatures.patch, 
 5483-v15-05-Move-command-column-to-system_traces.sessions.patch, 
 5483-v15.patch, 5483-v17-00.patch, 5483-v17-01.patch, 5483-v17.patch, 
 ccm-repair-test, cqlsh-left-justify-text-columns.patch, 
 prerepair-vs-postbuggedrepair.diff, test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to

[jira] [Commented] (CASSANDRA-5483) Repair tracing

2014-10-22 Thread Ben Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14180236#comment-14180236
 ] 

Ben Chan commented on CASSANDRA-5483:
-

I'm not planning on making any more changes pre-review. Feel free to go ahead.

(Though it would have been fine even if you had started prior to 5483-v17-00 
and 5483-v17-01; the differences are not substantial)


 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Fix For: 3.0

 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
 5483-v08-14-Poll-system_traces.events.patch, 
 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch, 
 5483-v10-17-minor-bugfixes-and-changes.patch, 
 5483-v10-rebased-and-squashed-471f5cc.patch, 5483-v11-01-squashed.patch, 
 5483-v11-squashed-nits.patch, 5483-v12-02-cassandra-yaml-ttl-doc.patch, 
 5483-v13-608fb03-May-14-trace-formatting-changes.patch, 
 5483-v14-01-squashed.patch, 
 5483-v15-02-Hook-up-exponential-backoff-functionality.patch, 
 5483-v15-03-Exact-doubling-for-exponential-backoff.patch, 
 5483-v15-04-Re-add-old-StorageService-JMX-signatures.patch, 
 5483-v15-05-Move-command-column-to-system_traces.sessions.patch, 
 5483-v15.patch, 5483-v17-00.patch, 5483-v17-01.patch, 5483-v17.patch, 
 ccm-repair-test, cqlsh-left-justify-text-columns.patch, 
 prerepair-vs-postbuggedrepair.diff, test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-5483) Repair tracing

2014-10-22 Thread Ben Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14180236#comment-14180236
 ] 

Ben Chan edited comment on CASSANDRA-5483 at 10/22/14 5:53 PM:
---

I'm not planning on making any more changes pre-review. Feel free to go ahead.

(Though it would have been fine even if you had started prior to 5483-v17-00 
and 5483-v17-01; the differences are not substantial. Edit: even if they had 
been substantial, you were well within your rights to start reviewing as soon 
as I posted up the initial patch.)



was (Author: usrbincc):
I'm not planning on making any more changes pre-review. Feel free to go ahead.

(Though it would have been fine even if you had started prior to 5483-v17-00 
and 5483-v17-01; the differences are not substantial)


 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Fix For: 3.0

 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
 5483-v08-14-Poll-system_traces.events.patch, 
 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch, 
 5483-v10-17-minor-bugfixes-and-changes.patch, 
 5483-v10-rebased-and-squashed-471f5cc.patch, 5483-v11-01-squashed.patch, 
 5483-v11-squashed-nits.patch, 5483-v12-02-cassandra-yaml-ttl-doc.patch, 
 5483-v13-608fb03-May-14-trace-formatting-changes.patch, 
 5483-v14-01-squashed.patch, 
 5483-v15-02-Hook-up-exponential-backoff-functionality.patch, 
 5483-v15-03-Exact-doubling-for-exponential-backoff.patch, 
 5483-v15-04-Re-add-old-StorageService-JMX-signatures.patch, 
 5483-v15-05-Move-command-column-to-system_traces.sessions.patch, 
 5483-v15.patch, 5483-v17-00.patch, 5483-v17-01.patch, 5483-v17.patch, 
 ccm-repair-test, cqlsh-left-justify-text-columns.patch, 
 prerepair-vs-postbuggedrepair.diff, test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-5483) Repair tracing

2014-10-22 Thread Ben Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14180257#comment-14180257
 ] 

Ben Chan commented on CASSANDRA-5483:
-

Hm. Does that mean you've already got some feedback ready?

I'll be here on and off for a little while. Maybe we can get in one full 
iteration (or n?) today.


 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Fix For: 3.0

 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
 5483-v08-14-Poll-system_traces.events.patch, 
 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch, 
 5483-v10-17-minor-bugfixes-and-changes.patch, 
 5483-v10-rebased-and-squashed-471f5cc.patch, 5483-v11-01-squashed.patch, 
 5483-v11-squashed-nits.patch, 5483-v12-02-cassandra-yaml-ttl-doc.patch, 
 5483-v13-608fb03-May-14-trace-formatting-changes.patch, 
 5483-v14-01-squashed.patch, 
 5483-v15-02-Hook-up-exponential-backoff-functionality.patch, 
 5483-v15-03-Exact-doubling-for-exponential-backoff.patch, 
 5483-v15-04-Re-add-old-StorageService-JMX-signatures.patch, 
 5483-v15-05-Move-command-column-to-system_traces.sessions.patch, 
 5483-v15.patch, 5483-v17-00.patch, 5483-v17-01.patch, 5483-v17.patch, 
 ccm-repair-test, cqlsh-left-justify-text-columns.patch, 
 prerepair-vs-postbuggedrepair.diff, test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-5483) Repair tracing

2014-10-22 Thread Ben Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14180613#comment-14180613
 ] 

Ben Chan commented on CASSANDRA-5483:
-

I'm going to dissapoint you and say that a lot of the reasons for how things 
are is historical, or due to code transformations intended to precisely 
maintain some aspect of the old behavior, without trying to question the logic 
behind that particular behavior.

In other words, I've just been trying to add this one particular behavior 
(repair tracing) without otherwise changing program behavior. This is probably 
most obvious around the trace calls, where (hopefully) the log messages are 
unchanged, even when they're different from the trace messages. I really need 
to go back through that with a fine-tooth comb, because I think there were some 
places where I may have changed capitalization. Or worse.



Most of your comments were straightforward. Just need some clarification on 
these:

{quote}
DebuggableThreadPoolExecutor:

Do we intend to create w/Integer.MAX_VALUE as 'size' on this method?
{quote}

Yes, in order to preserve the behavior of the previous code (see my 2014-03-08 
comment and the discussion surrounding it).

{quote}
If so, name should reflect it like other methods in file.
{quote}

Just to confirm (though I'm mostly sure this is what you mean), Do you mean 
that the function name should be changed to something like 
{{createCachedThreadpoolWithMaximumPoolSize}}?

{quote}
OutboundTcpConnection:

Why do we assume that if TraceType is null the type is QUERY?
{quote}

If an older server version ever initiates a trace, then traceType could be 
null. And if it *is* null, that means that it's a QUERY traceType (since that's 
the only kind of trace available in previous versions). If a newer server 
version ever sends a null traceType, that should probably be an error, though 
nothing like that is being done yet.

{quote}
StorageService:

Swallowing Throwable on line 2828 w/out any logging or comment as to why is 
unclear. Using exception handling as control flow is step back from the 
sameThreadExecutor error handling approach taken prior as we lose naming and 
intention information. Does this gain us something that I'm not aware of?
{quote}

Before saying anything else: yes, I probably should have commented the 
exception-swallowing.

I touched on this code change earlier in the paragraph that begins Note that I 
switched some code. In a nutshell: preserving behavior of previous code.

Whether I succeeded or not is a different issue, but at least let me run 
through my thought process so that you can help me find the holes in it. Here 
is a redacted version of the code:

{noformat}
// Previous version:
Futures.addCallback(allSessions, new FutureCallbackObject()
{
public void onSuccess(@Nullable Object result)
{
// onSuccess stuff ...
repairComplete();
}

public void onFailure(Throwable t)
{
// onFailure stuff (empty, except for repairComplete) ...
repairComplete();
}

private void repairComplete()
{
// repairComplete stuff ...
}
}, MoreExecutors.sameThreadExecutor());
{noformat}

Here are my assumptions:
- {{// onSuccess stuff ...}} does not throw any exceptions out of its code 
block.
- {{// repairComplete stuff ...}} works the same no matter which thread it runs 
in (assuming it's called at the proper time, of course).
- The guava library used corresponds to the source code found at 
{{build/lib/sources/guava-16.0.1-sources.jar}}.

{noformat}
// Review version:
try
{
allSessions.get();
// onSuccess stuff ...
}
catch (Throwable t)
{
// onFailure stuff (empty, except for repairComplete) ...
}
// repairComplete stuff ...
{noformat}

Since {{allSessions.get()}} is the only thing that can throw to the catch 
block, then its success or failure should (almost, see below) be the only way 
to trigger the onSuccess and onFailure branches that correspond to the 
original code.

Intended difference:
- {{// repairComplete stuff ...}} should now run in the main repair thread, 
instead of in the thread of the last exiting subtask in allSessions.

Unintended differences:
- Futures#addCallback uses Uninterruptibles#getUninterruptibly instead of a 
plain Future#get, the main difference being that it will keep retrying 
Future#get even in the face of InterruptedException. This may be important, so 
I could do this also.

{quote}
TraceState

Right now isDone is only called by the query thread but there's nothing in the 
method signature or comments of that method documenting the 
doubling-every-second-run algorithm it's using.
{quote}

The doubling behavior used to be contained in the query thread; isDone used 
to be a (more or less) simple wait with timeout.

Agree on needing comments. In the interests of not gratuitously rearranging 
code, I'll merely see about

[jira] [Updated] (CASSANDRA-5483) Repair tracing

2014-10-20 Thread Ben Chan (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Chan updated CASSANDRA-5483:

Attachment: 5483-v17.patch

 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Fix For: 3.0

 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
 5483-v08-14-Poll-system_traces.events.patch, 
 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch, 
 5483-v10-17-minor-bugfixes-and-changes.patch, 
 5483-v10-rebased-and-squashed-471f5cc.patch, 5483-v11-01-squashed.patch, 
 5483-v11-squashed-nits.patch, 5483-v12-02-cassandra-yaml-ttl-doc.patch, 
 5483-v13-608fb03-May-14-trace-formatting-changes.patch, 
 5483-v14-01-squashed.patch, 
 5483-v15-02-Hook-up-exponential-backoff-functionality.patch, 
 5483-v15-03-Exact-doubling-for-exponential-backoff.patch, 
 5483-v15-04-Re-add-old-StorageService-JMX-signatures.patch, 
 5483-v15-05-Move-command-column-to-system_traces.sessions.patch, 
 5483-v15.patch, 5483-v17.patch, ccm-repair-test, 
 cqlsh-left-justify-text-columns.patch, prerepair-vs-postbuggedrepair.diff, 
 test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-5483) Repair tracing

2014-10-20 Thread Ben Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177629#comment-14177629
 ] 

Ben Chan commented on CASSANDRA-5483:
-

Some nice changes to the JMX interface in the last 3 months; I no longer have 
to add lots of boilerplate function overloads.

{noformat}
read -r JMXGET E
/jmx_port/{p=\$2;} /binary/{split(\$2,a,/\047/);h=a[2];} \
END{printf(bin/nodetool -h %s -p %s\n,h,p,cmd);}
E

ccm_nodetool() { local N=$1; shift; $(ccm $N show | awk -F= $JMXGET) $@; }

# git checkout 49833b9
W=https://issues.apache.org/jira/secure/attachment
for url in $W/12675963/5483-v17.patch; do
  { [ -e $(basename $url) ] || curl -sO $url; }  git apply $(basename $url)
done 
ant clean  ant 
./ccm-repair-test -kR 
ccm node1 stop 
ccm node1 clear 
ccm node1 start 
ccm_nodetool node1 repair -tr
{noformat}



Issues encountered during testing:
- My old method for forcing a streaming repair (yukim's ccm clear suggestion) 
no longer seems to work. Which means I can no longer test tracing of streaming. 
Is there something else I have to do now to force a streaming repair?
- 5483-16 and 5483-17 no longer filter traces based on traceType. Using my 
standard simple test, I count 182 traces, versus 61 from a test run fom 
2014-06, about 200% more traces (from the query-tracing code paths), though I 
don't know how this percentage will change with longer repairs. Some 
representative extraneous traces:

{noformat}
Sending message to /127.0.0.2
Sending message to /127.0.0.3
Message received from /127.0.0.3
Processing response from /127.0.0.3
Message received from /127.0.0.2
Processing response from /127.0.0.2

Parsing SELECT * FROM system.sstable_activity WHERE keyspace_name=? and 
columnfamily_name=? and generation=?
Preparing statement
Executing single-partition query on sstable_activity
Acquiring sstable references
Merging memtable tombstones
Skipped 0/0 non-slice-intersecting sstables, included 0 due to tombstones
{noformat}


 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Fix For: 3.0

 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
 5483-v08-14-Poll-system_traces.events.patch, 
 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch, 
 5483-v10-17-minor-bugfixes-and-changes.patch, 
 5483-v10-rebased-and-squashed-471f5cc.patch, 5483-v11-01-squashed.patch, 
 5483-v11-squashed-nits.patch, 5483-v12-02-cassandra-yaml-ttl-doc.patch, 
 5483-v13-608fb03-May-14-trace-formatting-changes.patch, 
 5483-v14-01-squashed.patch, 
 5483-v15-02-Hook-up-exponential-backoff-functionality.patch, 
 5483-v15-03-Exact-doubling-for-exponential-backoff.patch, 
 5483-v15-04-Re-add-old-StorageService-JMX-signatures.patch, 
 5483-v15-05-Move-command-column-to-system_traces.sessions.patch, 
 5483-v15.patch, 5483-v17.patch, ccm-repair-test, 
 cqlsh-left-justify-text-columns.patch, prerepair-vs-postbuggedrepair.diff, 
 test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA

[jira] [Commented] (CASSANDRA-5483) Repair tracing

2014-10-12 Thread Ben Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168742#comment-14168742
 ] 

Ben Chan commented on CASSANDRA-5483:
-

Okay, that covers all the questions that I had.

TODO list:
- Rebase to trunk (hopefully the last 3 months haven't made this too painful).
- Miscellaneous remaining cruft from the 5483-16 cleanup. Possibly some code 
simplification (low-hanging fruit only).
- Guard out-of-bounds-array access in TraceState#deserialize. The simplest 
thing I can think of is to have a NONE tracetype and return that for any 
unknown tracetype.
- Fix format string Sending completed merkle tree to %s for %s.%s being 
traced verbatim.

Barring anything unforseen, is that about it before it's good enough for a 
merge?

It may be a few days before I actually get around to this. I had some computer 
trouble recently, so I'm going to have to set up my build environment from 
scratch.


 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Fix For: 3.0

 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
 5483-v08-14-Poll-system_traces.events.patch, 
 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch, 
 5483-v10-17-minor-bugfixes-and-changes.patch, 
 5483-v10-rebased-and-squashed-471f5cc.patch, 5483-v11-01-squashed.patch, 
 5483-v11-squashed-nits.patch, 5483-v12-02-cassandra-yaml-ttl-doc.patch, 
 5483-v13-608fb03-May-14-trace-formatting-changes.patch, 
 5483-v14-01-squashed.patch, 
 5483-v15-02-Hook-up-exponential-backoff-functionality.patch, 
 5483-v15-03-Exact-doubling-for-exponential-backoff.patch, 
 5483-v15-04-Re-add-old-StorageService-JMX-signatures.patch, 
 5483-v15-05-Move-command-column-to-system_traces.sessions.patch, 
 5483-v15.patch, ccm-repair-test, cqlsh-left-justify-text-columns.patch, 
 prerepair-vs-postbuggedrepair.diff, test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-5483) Repair tracing

2014-07-10 Thread Ben Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057548#comment-14057548
 ] 

Ben Chan commented on CASSANDRA-5483:
-

A few things to note about 5483-16:

Just to emphasize: these are corner-cases and deoptimizations that may not ever 
matter in practice. Even if the decision is to not fix them, I wanted to note 
their existence.

*(1)*
If remote nodes have different config values for {{tracetype_repair_ttl}}, then 
remote traces will end up with a different ttl. Having the trace for a given 
session partially disappear may be surprising to the user.

The approach 5483-15 took was to use the ttl configured on the node being 
repaired.

*(2)*
I don't know whether this will ever arise in practice, but I thought I'd note: 
If server versions are ever mixed, then an unknown trace type could be sent 
over the wire, causing an out-of-bounds array access during deserialization.

The approach 5483-15 took was almost a no-op on the remote end (the tracing 
session was created, but nothing was traced). This is still not quite right. 
Better would be to log an error and not create the session at all.

It might still possibly make sense to create the session anyway in the case of 
A-B-C tracestate propagation, but that might be stretching things.

*(3)*
Activity notification for every local repair trace starts a new exponential 
backoff polling cycle more often than necessary.

Of course, I can't say whether this extra resource usage will actually matter 
in practice (only practice can tell).



Okay, so what you meant was to change tracetype to be an actual Enum type (I 
can be a little dense sometimes, and code leaves far less room for ambiguity).

One thing I was thinking forward to with bitmasks was flexible subsetting of 
traces and notifications. Tracing may not ever need trace levels or flexible 
subsetting, but I thought I'd note the idea for future reference.

{noformat}
// pseudocode; please excuse the bad naming sense.
REPAIR_0 = 0b001  MIN_SHIFT;
REPAIR_1 = 0b010  MIN_SHIFT;
REPAIR_2 = 0b100  MIN_SHIFT;

// read zero and up, etc
REPAIR_0U = REPAIR_0 | REPAIR_1 | REPAIR_2;
REPAIR_1U =REPAIR_1 | REPAIR_2;

NOTIFY_0 = 0b001;
NOTIFY_1 = 0b010;
NOTIFY_2 = 0b100;
NOTIFY_0U = ...;
NOTIFY_1U = ...;

val = config.get_val();

trace(REPAIR_0, only traced for REPAIR_0; superseded in the more verbose 
traces);
trace(REPAIR_0U, always traced for repair);
trace(REPAIR_1U, traced at REPAIR_1 and up);
trace(REPAIR_2 | val, only at REPAIR_2, or during the tracetypes specified in 
val);
trace(REPAIR_1U  ~val, only at REPAIR_1 and up, excluding the tracetypes 
specified in val);
trace(REPAIR_2 | REPAIR_0, for some reason, we don't want to trace this at 
REPAIR_1);

// you could even have semi-orthogonal subsetting and levels
trace(REPAIR_0U | NOTIFY_1, notify at NOTIFY_1 only);
trace(REPAIR_1U | NOTIFY_0U, notify at all NOTIFY levels, but only when 
tracing REPAIR_1 and up);

awaitNotification(NOTIFY_1U); // notify me for NOTIFY_1 and up.
{noformat}

I believe log4j does something similar with what they call markers.

Of course, you can do this with Enum and EnumSet too (though more verbosely).



Aside from lyubent's nit (and possibly the issues in section 1), are there any 
other issues that prevent this patch from being minimally useful?

Or even better, is there at least an alpha period where {{trunk}} is still 
allowed to make breaking changes to the api, protocol, etc? It's hard to forsee 
everything.


 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
 5483-v08-14-Poll-system_traces.events.patch, 
 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch, 
 5483-v10-17-minor-bugfixes-and-changes.patch, 
 5483-v10-rebased-and-squashed-471f5cc.patch, 5483-v11-01-squashed.patch,

[jira] [Updated] (CASSANDRA-5483) Repair tracing

2014-06-12 Thread Ben Chan (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Chan updated CASSANDRA-5483:


Attachment: 5483-v15-02-Hook-up-exponential-backoff-functionality.patch
5483-v15-03-Exact-doubling-for-exponential-backoff.patch
5483-v15-04-Re-add-old-StorageService-JMX-signatures.patch
5483-v15-05-Move-command-column-to-system_traces.sessions.patch

 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
 5483-v08-14-Poll-system_traces.events.patch, 
 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch, 
 5483-v10-17-minor-bugfixes-and-changes.patch, 
 5483-v10-rebased-and-squashed-471f5cc.patch, 5483-v11-01-squashed.patch, 
 5483-v11-squashed-nits.patch, 5483-v12-02-cassandra-yaml-ttl-doc.patch, 
 5483-v13-608fb03-May-14-trace-formatting-changes.patch, 
 5483-v14-01-squashed.patch, 
 5483-v15-02-Hook-up-exponential-backoff-functionality.patch, 
 5483-v15-03-Exact-doubling-for-exponential-backoff.patch, 
 5483-v15-04-Re-add-old-StorageService-JMX-signatures.patch, 
 5483-v15-05-Move-command-column-to-system_traces.sessions.patch, 
 5483-v15.patch, ccm-repair-test, cqlsh-left-justify-text-columns.patch, 
 prerepair-vs-postbuggedrepair.diff, test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-5483) Repair tracing

2014-06-12 Thread Ben Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030076#comment-14030076
 ] 

Ben Chan commented on CASSANDRA-5483:
-

{quote}
the exponential backoff is maintained, it's only defined more simply (as far as 
I could read you doubled the wait time every second timeout, which is what 
happens now also).
{quote}

Sorry, I was speed-reading the code; I saw {{wait();}} (i.e. with no timeout), 
so I immediately knew there couldn't be a timeout, exponential or otherwise.

I mostly ignored what wouldn't affect execution (stupid mental code optimizer).

If you have {{wait(wait);}} and reset {{wait}} (the {{long}}) to 
{{minWaitMillis}} in {{if (hasNotifications)}} then that should be about right. 
I've attached a patch that does this 
([^5483-v15-02-Hook-up-exponential-backoff-functionality.patch]).

I apologize for reading-code-while-distracted, thus causing me to go on about 
the exponential backoff. I'll try to learn from this experience.

---

I liked your exponential backoff calculation code; I took the liberty of 
tweaking it to double exactly in 
[^5483-v15-03-Exact-doubling-for-exponential-backoff.patch].

---

{quote}
As the code is defined now we explicitly create a repair session and spawn a 
specific thread to process it, so we have a guarantee there's only one thread. 
It doesn't matter if the consumer is in its method when notifyAll is called; if 
it isn't it will receive the notification as soon as it next enters the method.
{quote}

I agree with you.

When I said only ever before, I was talking about future code, not the code 
in its current state. Had I considered the alternate interpretation of edge 
case in the current code, I would have clarified better, but there we are.

Just to reiterate, I'm reasonably cognizant that for the code in its current 
state, only a single thread is accessing the notification for a given 
TraceState, and how this allows for a simpler implementation.

All I was trying to say in my previous comment was that notification upon 
activity is a general capability, and in future code, should there ever be a 
use case for multiple threads accessing the same activity notification, then 
the implementation will have to be changed. (This also holds, should there ever 
be a use case that requires something other than exponential backoff.)

Our differences turn out to be philosophical only, and I'm not all that 
attached to mine.

I've already accepted [^5483-v15.patch] and have built my fixes for (some of) 
the Jun 10 comments on top of it. I need some more clarification before I can 
be sure what is required in order to resolve the remaining issues.

---

Test code, modified and simplified because of the updated system_traces.* 
schemas (I may upload an updated {{ccm-repair-test}} if we have many more patch 
iterations).

{noformat}
read -r JMXGET E
/jmx_port/{p=\$2;} \
/binary/{split(\$2,a,/\047/);h=a[2];} \
END{printf(bin/nodetool -h %s -p %s\n,h,p,cmd);}
E

ccm_nodetool() {
  local N=$1
  shift
  $(ccm $N show | awk -F= $JMXGET) $@
}

# git checkout 85956ae
W=https://issues.apache.org/jira/secure/attachment
for url in \
  $W/12649785/5483-v15.patch \
  $W/12650175/5483-v15-02-Hook-up-exponential-backoff-functionality.patch \
  $W/12650174/5483-v15-03-Exact-doubling-for-exponential-backoff.patch \
  $W/12650173/5483-v15-04-Re-add-old-StorageService-JMX-signatures.patch \
  $W/12650172/5483-v15-05-Move-command-column-to-system_traces.sessions.patch
do
  { [ -e $(basename $url) ] || curl -sO $url; }  git apply $(basename $url)
done 
ant clean  ant 
./ccm-repair-test -kR 
ccm node1 stop 
ccm node1 clear 
ccm node1 start 
ccm_nodetool node1 repair -tr
{noformat}


 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
 5483-v08-14-Poll-system_traces.events.patch, 
 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch,

[jira] [Commented] (CASSANDRA-5483) Repair tracing

2014-06-11 Thread Ben Chan (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14027905#comment-14027905
]

Ben Chan commented on CASSANDRA-5483:
-

I'm sorry I can't think this through very deeply at the moment, so please allow
a little slack if I say something incorrect. I'm writing this during a small
window of time (where I can't do anything else) in a crisis I'm dealing with.

bq. Why is command part of events instead of sessions?

It was a somewhat arbitrary design decision. I can move it over to sessions.
The only real reason was historical (see the Mar 3 comment); it was a
proof-of-concept that never got followed up upon until just now.

bq. Also: should use an enum internally. Logging as string representation is
fine.

Just to be clear, you mean Java code should work with an enum, and the actual
cql table column is fine as a string?

The code actually does use an enum (of sorts; not an Enum proper), the
traceType. The traceType is passed to Tracing#getCommandName to look up the
String for command name.

bq. It makes people grumpy when we break JMX signatures. Can we add a new
overload instead, preserving the old? This should cut down on some of the code
churn in StorageService as well.

I will admit that I didn't really consider the entire ecosystem of tools that
use JMX to control Cassandra (though note that I did mention the JMX api change
in a Mar 8 comment ... the server isn't going to work with old versions of
nodetool. And a newer nodetool still won't work with older server versions.).

bq. It's a minor thing to get hung up on, but I'm not wild about all the work
needed to propagate TTLs around. Is it really super important to persist repair
traces much longer than queries? If so, what if we used a separate table and
just allowed users to modify the default ttl? (The original trace code predates
default TTLs, I think, or we would have made use of it there.)

I guess the question is how many different types of things (bootstrap, repair,
query, decommission, ... anything else?) might eventually end up being traced.
If n is small, then having n tables may be fine.

The logic was this (see Mar 1-3 discussion): Repair can take a long time, so 24
hours may be too short of a ttl.

I recall reading about problematic repairs taking several days, which wouldn't
mix well with a 24 hour ttl.

bq. Also have a nagging feeling that the notify/wait logic in StorageService +
TraceState is more complex than necessary.

If there is guaranteed to only be one consumer of notifications at a time, then
the updated v15 logic seems fine. But if there are ever two traces going on
(either of different or the same type; are you allowed to have two simultaneous
repairs of different keyspaces?) which require update notifications, then there
could be dropped notifications. The problem (I believe) is that all consumers
may not be in a wait state at the moment when notifyAll is signalled. This
means a notification could be missed, right? I'm not experienced in Java
concurrency, and this isn't the best time for me to slowly think things
through, so it's quite possible I'm wrong here.

However, if you can be completely sure there will never be concurrent repair
traces happening on the same node, or any other trace types (whatever types are
added in the future) that require update notifications in order to implement
on-the-fly reporting, then that issue is moot, and v15 should work fine, as far
as my cursory inspection goes.

bq. I should note I've made no attempt to corroborate this behaviour is
sensible; I've only simplified it.

Any feedback would be welcome. As I've said before, heuristics are messy. I
talked about the reasoning behind my design decisions, and a possibility for an
alternate implementation (with attendant tradeoffs) in a Mar 17 comment. I
honestly thought I'd get more comments on it at the time, but it's possible the
patch had already gotten into TL; DR territory even then.

---

Okay my short break from reality is over. Time to panic.

Repair tracing
--

Key: CASSANDRA-5483
URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
Project: Cassandra
Issue Type: Improvement
Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
Labels: repair
Attachments: 5483-full-trunk.txt,
5483-v06-04-Allow-tracing-ttl-to-be-configured.patch,
5483-v06-05-Add-a-command-column-to-system_traces.events.patch,
5483-v06-06-Fix-interruption-in-tracestate-propagation.patch,
5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
5483-v07-08-Fix-brace-style.patch,
5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,

[jira] [Comment Edited] (CASSANDRA-5483) Repair tracing

2014-06-11 Thread Ben Chan (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14027905#comment-14027905
]

Ben Chan edited comment on CASSANDRA-5483 at 6/11/14 5:37 PM:
--

bq. Why is command part of events instead of sessions?

bq. Also: should use an enum internally. Logging as string representation is
fine.

Just to be clear, you mean Java code should work with an enum, and the actual
cql table column is fine as a string?

The code actually does use an enum (of sorts; not an Enum proper), the
traceType. The traceType is passed to Tracing#getCommandName to look up the
String for command name.

bq. It makes people grumpy when we break JMX signatures. Can we add a new
overload instead, preserving the old? This should cut down on some of the code
churn in StorageService as well.

The logic was this (see Mar 1-3 discussion): Repair can take a long time, so 24
hours may be too short of a ttl.

I recall reading about problematic repairs taking several days, which wouldn't
mix well with a 24 hour ttl.

bq. Also have a nagging feeling that the notify/wait logic in StorageService +
TraceState is more complex than necessary.

If there is guaranteed to only ever be one consumer of notifications at a time,
then the updated v15 logic seems fine. But if there are ever two threads
polling the same TraceState, then there could be dropped notifications. The
problem (I believe) is that all consumers may not be in a wait state at the
moment when notifyAll is signalled. This means a notification could be missed,
right? I'm not experienced in Java concurrency, and this isn't the best time
for me to slowly think things through, so it's quite possible I'm wrong here.

But it does seem reasonable there will only ever be one polling thread for any
given tracing session, so the v15 code should work fine in that respect, as far
as my cursory inspection goes.

Note, however, that the polling in this case is a heuristic only. Meaning that
it's *likely* that an external trace happened somewhere around this time plus
or minus (as far as I know, there is no way in Cassandra to be notified of cf
updates). Which means that the actual trace may only arrive *after* the
notification, meaning that for two notifications ~maxWait seconds apart, your
logging of events might be maxWait seconds late:

{noformat}
time actionresult
----
0 receive notification no events found
10 event A
1000 receive notification sendNotification(event A)
{noformat}

This is why I had an exponential backoff. Because I wanted to poll with high
frequency for a likely event, polling less and less often as the notification
recedes into the past. There are, of course, endless tweaks possible to this
basic algorithm. It depends upon what you can assume about the likely time
distribution of the events hinted at by the notification.

{noformat}
time actionresult
----
0 receive notification no events found
2 poll no events found
4 poll no events found
8 poll no events found
10 event A
16 poll sendNotification(event A)
32 poll no events found
1000 receive notification no events found
{noformat}

bq. I should note I've made no attempt to corroborate this behaviour is
sensible; I've only simplified it.

Any feedback would be welcome. As I've said before, heuristics are messy.

[jira] [Updated] (CASSANDRA-5483) Repair tracing

2014-06-06 Thread Ben Chan (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Chan updated CASSANDRA-5483:


Attachment: 5483-v14-01-squashed.patch

 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
 5483-v08-14-Poll-system_traces.events.patch, 
 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch, 
 5483-v10-17-minor-bugfixes-and-changes.patch, 
 5483-v10-rebased-and-squashed-471f5cc.patch, 5483-v11-01-squashed.patch, 
 5483-v11-squashed-nits.patch, 5483-v12-02-cassandra-yaml-ttl-doc.patch, 
 5483-v13-608fb03-May-14-trace-formatting-changes.patch, 
 5483-v14-01-squashed.patch, ccm-repair-test, 
 cqlsh-left-justify-text-columns.patch, prerepair-vs-postbuggedrepair.diff, 
 test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-5483) Repair tracing

2014-06-06 Thread Ben Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14020510#comment-14020510
 ] 

Ben Chan commented on CASSANDRA-5483:
-

{noformat}
# git checkout a61ef51
W=https://issues.apache.org/jira/secure/attachment
for url in \
  $W/12648746/5483-v14-01-squashed.patch
do
  { [ -e $(basename $url) ] || curl -sO $url; }  git apply $(basename $url)
done 
ant clean  ant 
./ccm-repair-test -kR 
ccm node1 stop 
ccm node1 clear 
ccm node1 start 
./ccm-repair-test -rt
{noformat}

No changes, just a (mostly) straightforward rebase onto a61ef51 (recent trunk). 
Some issues only show up under actual use, so I fully expect this will need 
some followup.

An idea for the timely tracing of exceptions: you might be able to kludge most 
of it by modifying WrappedRunnable to trace exceptions. I haven't looked too 
closely at it, though.


 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
 5483-v08-14-Poll-system_traces.events.patch, 
 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch, 
 5483-v10-17-minor-bugfixes-and-changes.patch, 
 5483-v10-rebased-and-squashed-471f5cc.patch, 5483-v11-01-squashed.patch, 
 5483-v11-squashed-nits.patch, 5483-v12-02-cassandra-yaml-ttl-doc.patch, 
 5483-v13-608fb03-May-14-trace-formatting-changes.patch, 
 5483-v14-01-squashed.patch, ccm-repair-test, 
 cqlsh-left-justify-text-columns.patch, prerepair-vs-postbuggedrepair.diff, 
 test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (CASSANDRA-5483) Repair tracing

2014-05-22 Thread Ben Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998924#comment-13998924
 ] 

Ben Chan edited comment on CASSANDRA-5483 at 5/22/14 12:01 PM:
---

May 14 formatting changes in 
[^5483-v13-608fb03-May-14-trace-formatting-changes.patch] (based off of commit 
608fb03).

{quote}
I think the session log messages are still confusing, especially since we use 
the same term for repairing a subrange and streaming data.
{quote}

Currently the session terminology is baked into the source code, in 
{{StreamSession.java}} and {{RepairSession.java}}. If the messages are changed 
to reflect different terminology, hopefully the source code can eventually be 
changed to match (fewer special cases to remember). Perhaps the best thing is 
to always qualify them, e.g. stream session and repair session?

{quote}
I don't actually see the session uuid being used in the logs except at 
start/finish.
{quote}

Sorry, that was another inadvertent mixing of nodetool messages and trace 
output. {{\[2014-05-13 23:49:52,283] Repair session 
cd6aad80-db1a-11e3-b0e7-f94811c7b860 for range 
(3074457345618258602,-9223372036854775808] finished}} is not a trace, but a 
separate (pre-patch) sendNotification in {{StorageService.java}}. This message 
(and some of the error messages, I think) is redundant when combined with trace 
output. It should have been either one or the other, not both. In the trace 
proper, the session UUID only shows up at the start.

But note: not all nodetool messages are rendered redundant by trace output. 
Since we can't just suppress all non-trace sendNotification, how can we 
unambiguously tell nodetool trace output from normal sendNotification messages? 
I'm currently leaning towards just marking all sendNotification trace output 
with a {{TRACE:}} tag.

The repair session UUIDs used to be prepended to everything, but were removed 
in [^5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch]. Without 
them, things are less verbose, but it's sometimes hard to unambiguously follow 
traces for concurrent repair sessions. To make the point clearer, I've marked 
each sub-task graphically in the nodetool trace output below (I've 
cross-checked this with the logs, which do retain the UUIDs). If you cover up 
the left side, it's harder to figure out which trace goes with which sub-task. 
Real-world repair traces will probably be even more confusing.

Note: indentation here does not denote nesting; the column roughly indicates 
task identity, though I reuse columns when it's not ambiguous.

{noformat}
1 [2014-05-15 11:31:37,839] Starting repair command #1, repairing 3 
ranges for s1.users (seq=true, full=true)
  x   [2014-05-15 11:31:37,922] Syncing range 
(-3074457345618258603,3074457345618258602]
  x   [2014-05-15 11:31:38,108] Requesting merkle trees for users from 
[/127.0.0.2, /127.0.0.3, /127.0.0.1]
  x   [2014-05-15 11:31:38,833] /127.0.0.2: Sending completed merkle tree 
to /127.0.0.1 for s1.users
  x   [2014-05-15 11:31:39,953] Received merkle tree for users from 
/127.0.0.2
  x   [2014-05-15 11:31:40,939] /127.0.0.3: Sending completed merkle tree 
to /127.0.0.1 for s1.users
  x   [2014-05-15 11:31:41,279] Received merkle tree for users from 
/127.0.0.3
  x   [2014-05-15 11:31:42,632] Received merkle tree for users from 
/127.0.0.1
x [2014-05-15 11:31:42,671] Syncing range 
(-9223372036854775808,-3074457345618258603]
x [2014-05-15 11:31:42,766] Requesting merkle trees for users from 
[/127.0.0.2, /127.0.0.3, /127.0.0.1]
  x   [2014-05-15 11:31:42,905] Endpoint /127.0.0.2 is consistent with 
/127.0.0.3 for users
  x   [2014-05-15 11:31:43,044] Endpoint /127.0.0.2 is consistent with 
/127.0.0.1 for users
  x   [2014-05-15 11:31:43,047] Endpoint /127.0.0.3 is consistent with 
/127.0.0.1 for users
  x   [2014-05-15 11:31:43,084] Completed sync of range 
(-3074457345618258603,3074457345618258602]
x [2014-05-15 11:31:43,251] /127.0.0.2: Sending completed merkle tree 
to /127.0.0.1 for s1.users
x [2014-05-15 11:31:43,422] Received merkle tree for users from 
/127.0.0.2
x [2014-05-15 11:31:44,495] /127.0.0.3: Sending completed merkle tree 
to /127.0.0.1 for s1.users
x [2014-05-15 11:31:44,637] Received merkle tree for users from 
/127.0.0.3
x [2014-05-15 11:31:45,474] Received merkle tree for users from 
/127.0.0.1
  x   [2014-05-15 11:31:45,494] Syncing range 
(3074457345618258602,-9223372036854775808]
x [2014-05-15 11:31:45,499] Endpoint /127.0.0.3 is consistent with 
/127.0.0.1 for users
x [2014-05-15 11:31:45,520] Endpoint /127.0.0.2 is consistent with 
/127.0.0.1 for users
x [2014-05-15 11:31:45,544] Endpoint /127.0.0.2 is consistent with 
/127.0.0.3 for users
x [2014-05-15 11:31:45,564] Completed sync of range

[jira] [Commented] (CASSANDRA-5483) Repair tracing

2014-05-22 Thread Ben Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005853#comment-14005853
 ] 

Ben Chan commented on CASSANDRA-5483:
-

{quote}
Is that because the traces are asynchronous? Because I think session 2 only 
starts after session 1 finishes.
{quote}

Here's my high level understanding.

* Repair command #1 and #2, etc are serial.
* Each repair session Syncing range ... is technically concurrent, since 
each is submitted to a ThreadPoolExecutor.
** However, differencing is serialized, so if there is no streaming going on, 
you won't see very much overlap between the sessions, except at the beginning 
and end (which is exactly what we see with these simple tests).
** Conversely, this means you will see much more interleaving when heavy 
streaming is going on.

So at the very least, it might be good to eventually disambiguate the streaming 
portion.

{quote}
The easiest thing would be to make them non-redundant. Can we make the tracing 
extra detail on top of the normal ones instead of competing with them?
{quote}

I think it may be a conceptual block on my part. I tend to think of traces as a 
kind of profiling mechanism.

* Most of the sendNotification calls in StorageService#createRepairTask consist 
of reporting any errors from the results of RepairFuture objects. So the timing 
on those is not really that useful for profiling. They're not really what I'd 
usually think of as a trace.
* Some are request validation reporting before the repair proper even starts.
* The rest are informational sendNotification messages which are redundant when 
tracing is active (this is the easy case).

In pseudocode:

{noformat}
if (some error #1 in repair request)
  sendNotification(NO #1!);
if (some error #2 in repair request)
  sendNotification(NO #2!);
for (r : ranges)
{
  f = something.submitRepairSession(new RepairSession(r));
  futures.add(f);
  try
  {
// this serializes the differencing part.
f.waitForDifferencing()
  }
  catch (SomeException)
  {
// handle, sendNotification
  }
}

try
{
  for (f : futures)
  {
r = f.get();
sendNotification(done: %s, r);
  }
}
catch (ExecutionException ee)
{
  // handle, sendNotification
}
catch (Exception e)
{
  // handle, sendNotification
}
{noformat}

The main point being that I can't be sure that every single interesting 
exception is caught and traced in the thread where it's thrown, then rethrown. 
Most likely, this is not the case, and some exceptions are only reported at the 
StorageService#createRepairTask level. I believe most \(?) cases are already 
caught and traced, though.

So after going through all that, I'm thinking that the easiest thing is to just 
accept the possibility of redundancy and delayed reporting, and just trace all 
sendNotification in StorageService#createRepairTask (unless it's demonstrably 
redundant, or already being traced through some other mechanism).


 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
 5483-v08-14-Poll-system_traces.events.patch, 
 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch, 
 5483-v10-17-minor-bugfixes-and-changes.patch, 
 5483-v10-rebased-and-squashed-471f5cc.patch, 5483-v11-01-squashed.patch, 
 5483-v11-squashed-nits.patch, 5483-v12-02-cassandra-yaml-ttl-doc.patch, 
 5483-v13-608fb03-May-14-trace-formatting-changes.patch, ccm-repair-test, 
 cqlsh-left-justify-text-columns.patch, prerepair-vs-postbuggedrepair.diff, 
 test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt,

[jira] [Updated] (CASSANDRA-5483) Repair tracing

2014-05-16 Thread Ben Chan (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Chan updated CASSANDRA-5483:


Attachment: 5483-v13-608fb03-May-14-trace-formatting-changes.patch

May 14 formatting changes in 
[^5483-v13-01-608fb03-May-14-trace-formatting-changes.patch] (based off of 
commit 608fb03).

{quote}
I think the session log messages are still confusing, especially since we use 
the same term for repairing a subrange and streaming data.
{quote}

Currently the session terminology is baked into the source code, in 
{{StreamSession.java}} and {{RepairSession.java}}. If the messages are changed 
to reflect different terminology, hopefully the source code can eventually be 
changed to match (fewer special cases to remember). Perhaps the best thing is 
to always qualify them, e.g. stream session and repair session?

{quote}
I don't actually see the session uuid being used in the logs except at 
start/finish.
{quote}

Sorry, that was another inadvertent mixing of nodetool messages and trace 
output. {{\[2014-05-13 23:49:52,283] Repair session 
cd6aad80-db1a-11e3-b0e7-f94811c7b860 for range 
(3074457345618258602,-9223372036854775808] finished}} is not a trace, but a 
separate (pre-patch) sendNotification in {{StorageService.java}}. This message 
(and some of the error messages, I think) is redundant when combined with trace 
output. It should have been either one or the other, not both. In the trace 
proper, the session UUID only shows up at the start.

But note: not all nodetool messages are rendered redundant by trace output. 
Since we can't just suppress all non-trace sendNotification, how can we 
unambiguously tell nodetool trace output from normal sendNotification messages? 
I'm currently leaning towards just marking all sendNotification trace output 
with a {{TRACE:}} tag.

The repair session UUIDs used to be prepended to everything, but were removed 
in [^5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch]. Without 
them, things are less verbose, but it's sometimes hard to unambiguously follow 
traces for concurrent repair sessions. To make the point clearer, I've marked 
each sub-task graphically in the nodetool trace output below (I've 
cross-checked this with the logs, which do retain the UUIDs). If you cover up 
the left side, it's harder to figure out which trace goes with which sub-task. 
Real-world repair traces will probably be even more confusing.

Note: indentation here does not denote nesting; the column roughly indicates 
task identity, though I reuse columns when it's not ambiguous.

{noformat}
1 [2014-05-15 11:31:37,839] Starting repair command #1, repairing 3 
ranges for s1.users (seq=true, full=true)
  x   [2014-05-15 11:31:37,922] Syncing range 
(-3074457345618258603,3074457345618258602]
  x   [2014-05-15 11:31:38,108] Requesting merkle trees for users from 
[/127.0.0.2, /127.0.0.3, /127.0.0.1]
  x   [2014-05-15 11:31:38,833] /127.0.0.2: Sending completed merkle tree 
to /127.0.0.1 for s1.users
  x   [2014-05-15 11:31:39,953] Received merkle tree for users from 
/127.0.0.2
  x   [2014-05-15 11:31:40,939] /127.0.0.3: Sending completed merkle tree 
to /127.0.0.1 for s1.users
  x   [2014-05-15 11:31:41,279] Received merkle tree for users from 
/127.0.0.3
  x   [2014-05-15 11:31:42,632] Received merkle tree for users from 
/127.0.0.1
x [2014-05-15 11:31:42,671] Syncing range 
(-9223372036854775808,-3074457345618258603]
x [2014-05-15 11:31:42,766] Requesting merkle trees for users from 
[/127.0.0.2, /127.0.0.3, /127.0.0.1]
  x   [2014-05-15 11:31:42,905] Endpoint /127.0.0.2 is consistent with 
/127.0.0.3 for users
  x   [2014-05-15 11:31:43,044] Endpoint /127.0.0.2 is consistent with 
/127.0.0.1 for users
  x   [2014-05-15 11:31:43,047] Endpoint /127.0.0.3 is consistent with 
/127.0.0.1 for users
  x   [2014-05-15 11:31:43,084] Completed sync of range 
(-3074457345618258603,3074457345618258602]
x [2014-05-15 11:31:43,251] /127.0.0.2: Sending completed merkle tree 
to /127.0.0.1 for s1.users
x [2014-05-15 11:31:43,422] Received merkle tree for users from 
/127.0.0.2
x [2014-05-15 11:31:44,495] /127.0.0.3: Sending completed merkle tree 
to /127.0.0.1 for s1.users
x [2014-05-15 11:31:44,637] Received merkle tree for users from 
/127.0.0.3
x [2014-05-15 11:31:45,474] Received merkle tree for users from 
/127.0.0.1
  x   [2014-05-15 11:31:45,494] Syncing range 
(3074457345618258602,-9223372036854775808]
x [2014-05-15 11:31:45,499] Endpoint /127.0.0.3 is consistent with 
/127.0.0.1 for users
x [2014-05-15 11:31:45,520] Endpoint /127.0.0.2 is consistent with 
/127.0.0.1 for users
x [2014-05-15 11:31:45,544] Endpoint /127.0.0.2 is consistent with 
/127.0.0.3 for users
x [2014-05-15 11:31:45,564] Completed sync of range

[jira] [Commented] (CASSANDRA-5483) Repair tracing

2014-05-15 Thread Ben Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997265#comment-13997265
 ] 

Ben Chan commented on CASSANDRA-5483:
-

I seem to be having problems with JIRA email notifications. May 12 arrived 
fine, May 7 never arrived, May 8 arrived on May 13, but was dated May 10. 
Moving on...

{quote}
Just skip system keyspace entirely and save the logspam (use Keyspace.nonSystem 
instead of Keyspace.all)
{quote}

This patch has grown enough parts to be a little unwieldy. Just to be clear, 
the output from [https://gist.github.com/lyubent/bfc133fe92ef1afb9dd4] is the 
verbose output from {{nodetool}}, which means there is some extra output aside 
from the traces themselves. (TODO to self, I need to make the nodetool verbose 
output optional.). That particular message comes from 
{{src/java/org/apache/cassandra/tools/NodeProbe.java}}, in a part of the code 
untouched by this patch. I can go ahead and nuke that particular message for 
the system keyspace.

{quote}
How does {{Endpoints /127.0.0.2 and /127.0.0.1 are consistent for events}} 
scale up to more replicas? Should we switch to using {{\[..]}} notation instead?
{quote}

{{n * (n - 1)}} differences calculated for {{n}} replicas, so {{n * (n - 1)}} 
are consistent messages. I haven't dug deep enough into the code to be 
certain, but on the face of it, it seems like there should be some (possibly 
not-simple) way to reduce this to {{O(n * log\(n))}}. Enough speculation, 
though.

One edge case for the proposed notation would be a consistency partition:

{noformat}
A == B == C
A != D
D == E == F

=

# We need a separate message for each partition.
Endpoints [A, B, C] are consistent for events
Endpoints [D, E, F] are consistent for events
{noformat}

Even with the edge case, it seems messy, but doable. You do lose trace timing 
information on the calculation of individual differences (the consistent ones, 
at least). On the other hand, comparing matching merkle trees should be a 
consistently fast operation, so you're probably not missing out on too much 
information.

{quote}
I'm a little lost in the commands and sessions, e.g. does {{\[2014-05-08 
23:27:45,368] Session completed successfully}} refer to session 
3617e3f0-d6ef-11e3-a493-7d438369d7fc or 36a49390-d6ef-11e3-a493-7d438369d7fc? 
Is there exactly one session per command? If so let's merge the starting 
repair command + new session output, and the completed + finished.
{quote}

Each repair command seems to consist of multiple repair sessions (one per 
range). The sessions go semi-sequentially; there's a semi-random overlap 
between the end of one session and the start of another, like so (using small 
integers instead of UUIDs, and some labels on the left for readability):

{noformat}
[command #1   ] Starting repair command #1
[command #1, session 1] New session 1 will sync range ...
[command #1, session 1] Requesting merkle tree for ...
[command #1, session 1] Received merkle tree for ...
[command #1, session 2] New session 2 will sync range ...
[command #1, session 2] Requesting merkle tree for ...
[command #1, session 1] Endpoints ... consistent.
[command #1, session 1] Session 1 completed successfully
[command #1, session 2] Received merkle tree for ...
[command #1, session 2] Endpoints ... consistent.
[command #1, session 3] New session 3 will sync range ...
[command #1, session 2] Session 2 completed successfully
[command #1, session 3] Requesting merkle tree for ...
[command #1, session 3] Received merkle tree for ...
[command #1, session 3] Endpoints ... consistent.
[command #1, session 3] Session 3 completed successfully
[command #1   ] Repair command #1 finished
{noformat}

Most of the time it's obvious from context, but during that overlap, having the 
repair session UUID helps to disambiguate. I suspect the overlap is even 
greater (and more confusing) when you have heavy streaming.

{quote}
Why do we log Repair command #1 finished with no merkle trees requested for 
db.tweet? Is it because all sstables are already repaired? If so we should log 
that.
{quote}

I've never encountered a trace like that in my testing. I always seem to get 
merkle trees exchanged (see the log below), even if no streaming is needed. I'm 
hoping lyubent can provide enough information for me to be able to recreate 
this situation locally.

{quote}
Does this actually show any streaming? If so I'm missing it.
{quote}

lyubent's sample run didn't need streaming, so no streaming to trace. Here's 
how I usually test streaming (using [^ccm-repair-test] and yukim's method.):

{noformat}
./ccm-repair-test -kR 
ccm node1 stop 
ccm node1 clear 
ccm node1 start 
./ccm-repair-test -rt
{noformat}

Note that this sample run uses the codebase as of 
[^5483-v12-02-cassandra-yaml-ttl-doc.patch]; I haven't got around to doing the 
May 12 changes yet.

I should also warn you (if recent

[jira] [Updated] (CASSANDRA-5483) Repair tracing

2014-05-02 Thread Ben Chan (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Chan updated CASSANDRA-5483:


Attachment: 5483-v11-01-squashed.patch

 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
 5483-v08-14-Poll-system_traces.events.patch, 
 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch, 
 5483-v10-17-minor-bugfixes-and-changes.patch, 
 5483-v10-rebased-and-squashed-471f5cc.patch, 5483-v11-01-squashed.patch, 
 ccm-repair-test, cqlsh-left-justify-text-columns.patch, 
 prerepair-vs-postbuggedrepair.diff, test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-5483) Repair tracing

2014-05-02 Thread Ben Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13987979#comment-13987979
 ] 

Ben Chan commented on CASSANDRA-5483:
-

Patch updated to work with current trunk (08d5f26).

{noformat}
# git checkout 08d5f26
W=https://issues.apache.org/jira/secure/attachment
for url in \
  $W/12643077/5483-v11-01-squashed.patch
do
  { [ -e $(basename $url) ] || curl -sO $url; }  git apply $(basename $url)
done 
ant clean  ant 
./ccm-repair-test -kR 
ccm node1 stop 
ccm node1 clear 
ccm node1 start 
./ccm-repair-test -rt
{noformat}


 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
 5483-v08-14-Poll-system_traces.events.patch, 
 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch, 
 5483-v10-17-minor-bugfixes-and-changes.patch, 
 5483-v10-rebased-and-squashed-471f5cc.patch, 5483-v11-01-squashed.patch, 
 ccm-repair-test, cqlsh-left-justify-text-columns.patch, 
 prerepair-vs-postbuggedrepair.diff, test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-7000) Assertion in SSTableReader during repair.

2014-04-11 Thread Ben Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13966551#comment-13966551
 ] 

Ben Chan commented on CASSANDRA-7000:
-

Just to confirm, repair works fine with
{noformat}
# current trunk
git checkout 471f5cc34c99
git apply 7000-2.1-v2.txt 7000.supplement.txt
{noformat}

I think SSTableReader#close as it currently stands still doesn't quite make 
sense. But since it isn't actually used anywhere (post-patch), it may be easier 
to just slap a TODO on it.

{noformat}
// Or how about this? Works if the last reference was released, and fails in
// tidy() otherwise.
public void close()
{
references.decrementAndGet();
tidy(false);
}
{noformat}


 Assertion in SSTableReader during repair.
 -

 Key: CASSANDRA-7000
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7000
 Project: Cassandra
  Issue Type: Bug
Reporter: Ben Chan
Assignee: Ben Chan
 Attachments: 7000-2.1-v2.txt, 7000.supplement.txt, 
 sstablereader-assertion-bisect-helper, 
 sstablereader-assertion-bisect-helper-v2, sstablereader-assertion.patch


 I ran a {{git bisect run}} using the attached bisect script. Repro code:
 {noformat}
 # 5dfe241: trunk as of my git bisect run
 # 345772d: empirically determined good commit.
 git bisect start 5dfe241 345772d
 git bisect run ./sstablereader-assertion-bisect-helper-v2
 {noformat}
 The first failing commit is 5ebadc1 (first parent of {{refs/bisect/bad}}).
 Prior to 5ebadc1, SSTableReader#close() never checked reference count. After 
 5ebadc1, there was an assertion for {{references.get() == 0}}. However, since 
 the reference count is initialized to 1, a SSTableReader#close() was always 
 guaranteed to either throw an AssertionError or to be a second call to 
 SSTableReader#tidy() on the same SSTableReader.
 The attached patch chooses an in-between behavior. It requires the reference 
 count to match the initialization value of 1 for SSTableReader#close(), and 
 the same behavior as 5ebadc1 otherwise.
 This allows repair to finish successfully, but I'm not 100% certain what the 
 desired behavior is for SSTableReader#close(). Should it close without regard 
 to reference count, as it did pre-5ebadc1?
 Edit: accidentally uploaded a flawed version of 
 {{sstablereader-assertion-bisect-helper}} (doesn't work out-of-the-box with 
 {{git bisect}}).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (CASSANDRA-5483) Repair tracing

2014-04-11 Thread Ben Chan (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Chan updated CASSANDRA-5483:


Attachment: 5483-v10-rebased-and-squashed-471f5cc.patch
5483-v10-17-minor-bugfixes-and-changes.patch

 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
 5483-v08-14-Poll-system_traces.events.patch, 
 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch, 
 5483-v10-17-minor-bugfixes-and-changes.patch, 
 5483-v10-rebased-and-squashed-471f5cc.patch, ccm-repair-test, 
 cqlsh-left-justify-text-columns.patch, prerepair-vs-postbuggedrepair.diff, 
 test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-5483) Repair tracing

2014-04-11 Thread Ben Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13966778#comment-13966778
 ] 

Ben Chan commented on CASSANDRA-5483:
-

I made some additional changes. Everything is included in 
[^5483-v10-rebased-and-squashed-471f5cc.patch], but I attached 
[^5483-v10-17-minor-bugfixes-and-changes.patch] to make it more more convenient 
to review.

Repair fails without 
[7000-2.1-v2.txt|https://issues.apache.org/jira/secure/attachment/12639633/7000-2.1-v2.txt]
 so I've included that patch in the test code.

Overview:
 * Limit exponential backoff.
 * Handle the case where traceType is negative.
 * Reimplement log2 in BitUtil style.
 * Forgot to add a trace parameter.

{noformat}
# rebased against trunk @ 471f5cc
# git checkout 471f5cc
W=https://issues.apache.org/jira/secure/attachment
for url in \
  $W/12639821/5483-v10-rebased-and-squashed-471f5cc.patch \
  $W/12639633/7000-2.1-v2.txt
do
  { [ -e $(basename $url) ] || curl -sO $url; }  git apply $(basename $url)
done 
ant clean  ant 
./ccm-repair-test -kR 
ccm node1 stop 
ccm node1 clear 
ccm node1 start 
./ccm-repair-test -rt
{noformat}


 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
 5483-v08-14-Poll-system_traces.events.patch, 
 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch, 
 5483-v10-17-minor-bugfixes-and-changes.patch, 
 5483-v10-rebased-and-squashed-471f5cc.patch, ccm-repair-test, 
 cqlsh-left-justify-text-columns.patch, prerepair-vs-postbuggedrepair.diff, 
 test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (CASSANDRA-5483) Repair tracing

2014-04-11 Thread Ben Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13966778#comment-13966778
 ] 

Ben Chan edited comment on CASSANDRA-5483 at 4/11/14 5:42 PM:
--

I made some additional changes. Everything is included in 
[^5483-v10-rebased-and-squashed-471f5cc.patch], but I attached 
[^5483-v10-17-minor-bugfixes-and-changes.patch] to make it more more convenient 
to review.

-Repair fails without 
[7000-2.1-v2.txt|https://issues.apache.org/jira/secure/attachment/12639633/7000-2.1-v2.txt]
 so I've included that patch in the test code.-

Overview:
 * Limit exponential backoff.
 * Handle the case where traceType is negative.
 * Reimplement log2 in BitUtil style.
 * Forgot to add a trace parameter.

{noformat}
# rebased against trunk @ 471f5cc, tested against cbb3c8f
# git checkout cbb3c8f
W=https://issues.apache.org/jira/secure/attachment
for url in \
  $W/12639821/5483-v10-rebased-and-squashed-471f5cc.patch
do
  { [ -e $(basename $url) ] || curl -sO $url; }  git apply $(basename $url)
done 
ant clean  ant 
./ccm-repair-test -kR 
ccm node1 stop 
ccm node1 clear 
ccm node1 start 
./ccm-repair-test -rt
{noformat}

Edit: 7000 landed.



was (Author: usrbincc):
I made some additional changes. Everything is included in 
[^5483-v10-rebased-and-squashed-471f5cc.patch], but I attached 
[^5483-v10-17-minor-bugfixes-and-changes.patch] to make it more more convenient 
to review.

Repair fails without 
[7000-2.1-v2.txt|https://issues.apache.org/jira/secure/attachment/12639633/7000-2.1-v2.txt]
 so I've included that patch in the test code.

Overview:
 * Limit exponential backoff.
 * Handle the case where traceType is negative.
 * Reimplement log2 in BitUtil style.
 * Forgot to add a trace parameter.

{noformat}
# rebased against trunk @ 471f5cc
# git checkout 471f5cc
W=https://issues.apache.org/jira/secure/attachment
for url in \
  $W/12639821/5483-v10-rebased-and-squashed-471f5cc.patch \
  $W/12639633/7000-2.1-v2.txt
do
  { [ -e $(basename $url) ] || curl -sO $url; }  git apply $(basename $url)
done 
ant clean  ant 
./ccm-repair-test -kR 
ccm node1 stop 
ccm node1 clear 
ccm node1 start 
./ccm-repair-test -rt
{noformat}


 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
 5483-v08-14-Poll-system_traces.events.patch, 
 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch, 
 5483-v10-17-minor-bugfixes-and-changes.patch, 
 5483-v10-rebased-and-squashed-471f5cc.patch, ccm-repair-test, 
 cqlsh-left-justify-text-columns.patch, prerepair-vs-postbuggedrepair.diff, 
 test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (CASSANDRA-7000) Assertion in SSTableReader during repair.

2014-04-08 Thread Ben Chan (JIRA)

Ben Chan created CASSANDRA-7000:
---

 Summary: Assertion in SSTableReader during repair.
 Key: CASSANDRA-7000
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7000
 Project: Cassandra
  Issue Type: Bug
Reporter: Ben Chan
 Attachments: sstablereader-assertion-bisect-helper, 
sstablereader-assertion.patch

I ran a {{git bisect run}} using the attached bisect script. Repro code:

{noformat}
# 5dfe241: trunk as of my git bisect run
# 345772d: empirically determined good commit.
git bisect start 5dfe241 345772d
git bisect run ./sstablereader-assertion-bisect-helper
{noformat}

The first failing commit is 5ebadc1 (first parent of {{refs/bisect/bad}}).

Prior to 5ebadc1, SSTableReader#close() never checked reference count. After 
5ebadc1, there was an assertion for {{references.get() == 0}}. However, since 
the reference count is initialized to 1, a SSTableReader#close() was always 
guaranteed to either throw an AssertionError or to be a second call to 
SSTableReader#tidy() on the same SSTableReader.

The attached patch chooses an in-between behavior. It requires the reference 
count to match the initialization value of 1 for SSTableReader#close(), and the 
same behavior as 5ebadc1 otherwise.

This allows repair to finish successfully, but I'm not 100% certain what the 
desired behavior is for SSTableReader#close(). Should it close without regard 
to reference count, as it did pre-5ebadc1?




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (CASSANDRA-7000) Assertion in SSTableReader during repair.

2014-04-08 Thread Ben Chan (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Chan updated CASSANDRA-7000:


Attachment: sstablereader-assertion-bisect-helper-v2

 Assertion in SSTableReader during repair.
 -

 Key: CASSANDRA-7000
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7000
 Project: Cassandra
  Issue Type: Bug
Reporter: Ben Chan
Assignee: Ben Chan
 Attachments: sstablereader-assertion-bisect-helper, 
 sstablereader-assertion-bisect-helper-v2, sstablereader-assertion.patch


 I ran a {{git bisect run}} using the attached bisect script. Repro code:
 {noformat}
 # 5dfe241: trunk as of my git bisect run
 # 345772d: empirically determined good commit.
 git bisect start 5dfe241 345772d
 git bisect run ./sstablereader-assertion-bisect-helper
 {noformat}
 The first failing commit is 5ebadc1 (first parent of {{refs/bisect/bad}}).
 Prior to 5ebadc1, SSTableReader#close() never checked reference count. After 
 5ebadc1, there was an assertion for {{references.get() == 0}}. However, since 
 the reference count is initialized to 1, a SSTableReader#close() was always 
 guaranteed to either throw an AssertionError or to be a second call to 
 SSTableReader#tidy() on the same SSTableReader.
 The attached patch chooses an in-between behavior. It requires the reference 
 count to match the initialization value of 1 for SSTableReader#close(), and 
 the same behavior as 5ebadc1 otherwise.
 This allows repair to finish successfully, but I'm not 100% certain what the 
 desired behavior is for SSTableReader#close(). Should it close without regard 
 to reference count, as it did pre-5ebadc1?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (CASSANDRA-7000) Assertion in SSTableReader during repair.

2014-04-08 Thread Ben Chan (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Chan updated CASSANDRA-7000:


Description: 
I ran a {{git bisect run}} using the attached bisect script. Repro code:

{noformat}
# 5dfe241: trunk as of my git bisect run
# 345772d: empirically determined good commit.
git bisect start 5dfe241 345772d
git bisect run ./sstablereader-assertion-bisect-helper-v2
{noformat}

The first failing commit is 5ebadc1 (first parent of {{refs/bisect/bad}}).

Prior to 5ebadc1, SSTableReader#close() never checked reference count. After 
5ebadc1, there was an assertion for {{references.get() == 0}}. However, since 
the reference count is initialized to 1, a SSTableReader#close() was always 
guaranteed to either throw an AssertionError or to be a second call to 
SSTableReader#tidy() on the same SSTableReader.

The attached patch chooses an in-between behavior. It requires the reference 
count to match the initialization value of 1 for SSTableReader#close(), and the 
same behavior as 5ebadc1 otherwise.

This allows repair to finish successfully, but I'm not 100% certain what the 
desired behavior is for SSTableReader#close(). Should it close without regard 
to reference count, as it did pre-5ebadc1?

Edit: accidentally uploaded a flawed version of 
{{sstablereader-assertion-bisect-helper}} (doesn't work out-of-the-box with 
{{git bisect}}).


  was:
I ran a {{git bisect run}} using the attached bisect script. Repro code:

{noformat}
# 5dfe241: trunk as of my git bisect run
# 345772d: empirically determined good commit.
git bisect start 5dfe241 345772d
git bisect run ./sstablereader-assertion-bisect-helper
{noformat}

The first failing commit is 5ebadc1 (first parent of {{refs/bisect/bad}}).

Prior to 5ebadc1, SSTableReader#close() never checked reference count. After 
5ebadc1, there was an assertion for {{references.get() == 0}}. However, since 
the reference count is initialized to 1, a SSTableReader#close() was always 
guaranteed to either throw an AssertionError or to be a second call to 
SSTableReader#tidy() on the same SSTableReader.

The attached patch chooses an in-between behavior. It requires the reference 
count to match the initialization value of 1 for SSTableReader#close(), and the 
same behavior as 5ebadc1 otherwise.

This allows repair to finish successfully, but I'm not 100% certain what the 
desired behavior is for SSTableReader#close(). Should it close without regard 
to reference count, as it did pre-5ebadc1?



 Assertion in SSTableReader during repair.
 -

 Key: CASSANDRA-7000
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7000
 Project: Cassandra
  Issue Type: Bug
Reporter: Ben Chan
Assignee: Ben Chan
 Attachments: sstablereader-assertion-bisect-helper, 
 sstablereader-assertion-bisect-helper-v2, sstablereader-assertion.patch


 I ran a {{git bisect run}} using the attached bisect script. Repro code:
 {noformat}
 # 5dfe241: trunk as of my git bisect run
 # 345772d: empirically determined good commit.
 git bisect start 5dfe241 345772d
 git bisect run ./sstablereader-assertion-bisect-helper-v2
 {noformat}
 The first failing commit is 5ebadc1 (first parent of {{refs/bisect/bad}}).
 Prior to 5ebadc1, SSTableReader#close() never checked reference count. After 
 5ebadc1, there was an assertion for {{references.get() == 0}}. However, since 
 the reference count is initialized to 1, a SSTableReader#close() was always 
 guaranteed to either throw an AssertionError or to be a second call to 
 SSTableReader#tidy() on the same SSTableReader.
 The attached patch chooses an in-between behavior. It requires the reference 
 count to match the initialization value of 1 for SSTableReader#close(), and 
 the same behavior as 5ebadc1 otherwise.
 This allows repair to finish successfully, but I'm not 100% certain what the 
 desired behavior is for SSTableReader#close(). Should it close without regard 
 to reference count, as it did pre-5ebadc1?
 Edit: accidentally uploaded a flawed version of 
 {{sstablereader-assertion-bisect-helper}} (doesn't work out-of-the-box with 
 {{git bisect}}).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-7000) Assertion in SSTableReader during repair.

2014-04-08 Thread Ben Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13963638#comment-13963638
 ] 

Ben Chan commented on CASSANDRA-7000:
-

It was against 5ebadc1, but applies cleanly against the current trunk (ccef061 
as of this moment). I'm getting a problem with ccef061 with a failure at {{ccm 
start}}, but here is an (admittedly simple) before/after test for e1e91be:

{noformat}
W=https://issues.apache.org/jira/secure/attachment
for url in \
  $W/12639219/sstablereader-assertion.patch \
  $W/12639294/sstablereader-assertion-bisect-helper-v2
do [ -e $(basename $url) ] || curl -sO $url; done 
chmod +x sstablereader-assertion-bisect-helper-v2

git checkout e1e91be

# sanity check downloaded scripts as always
./sstablereader-assertion-bisect-helper-v2  echo ok

# should fail before and work after
git apply sstablereader-assertion.patch 
./sstablereader-assertion-bisect-helper-v2  echo ok
{noformat}

(You can ignore all the errors due to patches that don't apply; I needed to fix 
a build problem in order to successfully run the bisect.)


 Assertion in SSTableReader during repair.
 -

 Key: CASSANDRA-7000
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7000
 Project: Cassandra
  Issue Type: Bug
Reporter: Ben Chan
Assignee: Ben Chan
 Attachments: sstablereader-assertion-bisect-helper, 
 sstablereader-assertion-bisect-helper-v2, sstablereader-assertion.patch


 I ran a {{git bisect run}} using the attached bisect script. Repro code:
 {noformat}
 # 5dfe241: trunk as of my git bisect run
 # 345772d: empirically determined good commit.
 git bisect start 5dfe241 345772d
 git bisect run ./sstablereader-assertion-bisect-helper-v2
 {noformat}
 The first failing commit is 5ebadc1 (first parent of {{refs/bisect/bad}}).
 Prior to 5ebadc1, SSTableReader#close() never checked reference count. After 
 5ebadc1, there was an assertion for {{references.get() == 0}}. However, since 
 the reference count is initialized to 1, a SSTableReader#close() was always 
 guaranteed to either throw an AssertionError or to be a second call to 
 SSTableReader#tidy() on the same SSTableReader.
 The attached patch chooses an in-between behavior. It requires the reference 
 count to match the initialization value of 1 for SSTableReader#close(), and 
 the same behavior as 5ebadc1 otherwise.
 This allows repair to finish successfully, but I'm not 100% certain what the 
 desired behavior is for SSTableReader#close(). Should it close without regard 
 to reference count, as it did pre-5ebadc1?
 Edit: accidentally uploaded a flawed version of 
 {{sstablereader-assertion-bisect-helper}} (doesn't work out-of-the-box with 
 {{git bisect}}).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (CASSANDRA-6991) cql3 grammar generation can fail due to ANTLR timeout.

2014-04-07 Thread Ben Chan (JIRA)

Ben Chan created CASSANDRA-6991:
---

 Summary: cql3 grammar generation can fail due to ANTLR timeout.
 Key: CASSANDRA-6991
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6991
 Project: Cassandra
  Issue Type: Improvement
Reporter: Ben Chan
Assignee: Ben Chan
Priority: Trivial
 Attachments: xconversiontimeout.patch

Because of the technique used in {{Cql.g}} to tokenize both case-insensitive 
keywords and case-sensitive identifiers, builds can fail randomly for computers 
at some arbitrary speed boundary. This is because ANTLR has a feature where it 
times out if DFA generation takes longer than a preset amount of time. An easy 
workaround is to use the {{-Xconversiontimeout}} option (patch attached).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-5483) Repair tracing

2014-04-05 Thread Ben Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13961081#comment-13961081
 ] 

Ben Chan commented on CASSANDRA-5483:
-

Rebased, but can't test properly, because the trunk as of 64bc45849fd2 (my 
{{git rebase}} base) is throwing an error:

{noformat}
java.lang.ExceptionInInitializerError: null
at 
org.apache.cassandra.config.KSMetaData.systemKeyspace(KSMetaData.java:92) 
~[main/:na]
at 
org.apache.cassandra.config.DatabaseDescriptor.applyConfig(DatabaseDescriptor.java:545)
 ~[main/:na]
at 
org.apache.cassandra.config.DatabaseDescriptor.clinit(DatabaseDescriptor.java:128)
 ~[main/:na]
at 
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:109) 
[main/:na]
at 
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:454) 
[main/:na]
at 
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:543) 
[main/:na]
Caused by: java.lang.RuntimeException: 
org.apache.cassandra.exceptions.SyntaxException: line 1:127 mismatched 
character 'T' expecting set null
at org.apache.cassandra.config.CFMetaData.compile(CFMetaData.java:518) 
~[main/:na]
at org.apache.cassandra.config.CFMetaData.compile(CFMetaData.java:501) 
~[main/:na]
at org.apache.cassandra.config.CFMetaData.clinit(CFMetaData.java:99) 
~[main/:na]
... 6 common frames omitted
Caused by: org.apache.cassandra.exceptions.SyntaxException: line 1:127 
mismatched character 'T' expecting set null
at 
org.apache.cassandra.cql3.CqlLexer.throwLastRecognitionError(CqlLexer.java:201) 
~[main/:na]
at 
org.apache.cassandra.cql3.QueryProcessor.parseStatement(QueryProcessor.java:348)
 ~[main/:na]
at org.apache.cassandra.config.CFMetaData.compile(CFMetaData.java:509) 
~[main/:na]
... 8 common frames omitted
{noformat}

Things are currently hectic, so I don't know when I'll have the chance to have 
a proper look at it. I'm currently doing a {{git bisect run}} so maybe 
something obvious will turn up.


 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
 5483-v08-14-Poll-system_traces.events.patch, 
 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch, ccm-repair-test, 
 cqlsh-left-justify-text-columns.patch, prerepair-vs-postbuggedrepair.diff, 
 test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (CASSANDRA-5483) Repair tracing

2014-03-30 Thread Ben Chan (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Chan updated CASSANDRA-5483:


Attachment: 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch

 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
 5483-v08-14-Poll-system_traces.events.patch, 
 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch, ccm-repair-test, 
 cqlsh-left-justify-text-columns.patch, prerepair-vs-postbuggedrepair.diff, 
 test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (CASSANDRA-5483) Repair tracing

2014-03-30 Thread Ben Chan (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Chan updated CASSANDRA-5483:


Attachment: (was: 
5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch)

 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
 5483-v08-14-Poll-system_traces.events.patch, 
 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
 ccm-repair-test, cqlsh-left-justify-text-columns.patch, 
 prerepair-vs-postbuggedrepair.diff, test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-5483) Repair tracing

2014-03-30 Thread Ben Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13954746#comment-13954746
 ] 

Ben Chan commented on CASSANDRA-5483:
-

Ouch. After running a before and after test, I'm 99% sure this was the problem. 
There was some obviously wrong code in {{waitActivity}} (an older version used 
0 instead of -1 to signify done; I apparently forgot to update everything 
when I changed this).

Sorry about removing the previous patch. It didn't have the correct {{git diff 
-p}} parameters.

For convenience:

{noformat}
W=https://issues.apache.org/jira/secure/attachment
for url in \
  $W/12637720/5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch
do [ -e $(basename $url) ] || curl -sO $url; done 
git apply 5483-v09-*.patch 
ant clean  ant
{noformat}

Here's what I used to test with; I get slower and slower repairs, with a hang 
on the 5th repair with the before code, and consistent 10-second repairs with 
the after code.

{noformat}
cat  ccm-nodetool E
#!/bin/sh

# ccm doesn't let us call nodetool with options, but we still need to get the
# host and port config from it.
read -r JMXGET E
/jmx_port/{p=\$2;} \
/binary/{split(\$2,a,/\047/);h=a[2];} \
END{printf(bin/nodetool -h %s -p %s\n,h,p,cmd);}
E

NODETOOL=$(ccm $1 show | awk -F= $JMXGET)
shift
$NODETOOL $@
E
chmod +x ccm-nodetool
for x in $(seq 3); do 
  for y in $(seq 2); do
echo repair node$x \#$y
./ccm-nodetool node$x repair -tr
  done
done
{noformat}


 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
 5483-v08-14-Poll-system_traces.events.patch, 
 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch, ccm-repair-test, 
 cqlsh-left-justify-text-columns.patch, prerepair-vs-postbuggedrepair.diff, 
 test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (CASSANDRA-5483) Repair tracing

2014-03-30 Thread Ben Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13954746#comment-13954746
 ] 

Ben Chan edited comment on CASSANDRA-5483 at 3/30/14 5:24 PM:
--

Ouch. After running a before and after test, I'm 99% sure this was the problem. 
There was some obviously wrong code in {{waitActivity}} (an older version used 
0 instead of -1 to signify done; I apparently forgot to update everything 
when I changed this).

Sorry about removing the previous patch. It didn't have the correct {{git diff 
-p}} parameters.

For convenience:

{noformat}
W=https://issues.apache.org/jira/secure/attachment
for url in \
  $W/12637720/5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch
do [ -e $(basename $url) ] || curl -sO $url; done 
git apply 5483-v09-*.patch 
ant clean  ant
{noformat}

Here's what I used to test with; I get slower and slower repairs, with a hang 
on the 5th repair with the before code, and consistent 10-second repairs with 
the after code.

{noformat}
cat  ccm-nodetool EE
#!/bin/sh

# ccm doesn't let us call nodetool with options, but we still need to get the
# host and port config from it.
read -r JMXGET E
/jmx_port/{p=\$2;} \
/binary/{split(\$2,a,/\047/);h=a[2];} \
END{printf(bin/nodetool -h %s -p %s\n,h,p);}
E

NODETOOL=$(ccm $1 show | awk -F= $JMXGET)
shift
$NODETOOL $@
EE

chmod +x ccm-nodetool

for x in $(seq 3); do 
  for y in $(seq 2); do
echo repair node$x \#$y
./ccm-nodetool node$x repair -tr
  done
done
{noformat}

edit: minor awk code cleanup, properly nest heredocs.



was (Author: usrbincc):
Ouch. After running a before and after test, I'm 99% sure this was the problem. 
There was some obviously wrong code in {{waitActivity}} (an older version used 
0 instead of -1 to signify done; I apparently forgot to update everything 
when I changed this).

Sorry about removing the previous patch. It didn't have the correct {{git diff 
-p}} parameters.

For convenience:

{noformat}
W=https://issues.apache.org/jira/secure/attachment
for url in \
  $W/12637720/5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch
do [ -e $(basename $url) ] || curl -sO $url; done 
git apply 5483-v09-*.patch 
ant clean  ant
{noformat}

Here's what I used to test with; I get slower and slower repairs, with a hang 
on the 5th repair with the before code, and consistent 10-second repairs with 
the after code.

{noformat}
cat  ccm-nodetool E
#!/bin/sh

# ccm doesn't let us call nodetool with options, but we still need to get the
# host and port config from it.
read -r JMXGET E
/jmx_port/{p=\$2;} \
/binary/{split(\$2,a,/\047/);h=a[2];} \
END{printf(bin/nodetool -h %s -p %s\n,h,p,cmd);}
E

NODETOOL=$(ccm $1 show | awk -F= $JMXGET)
shift
$NODETOOL $@
E
chmod +x ccm-nodetool
for x in $(seq 3); do 
  for y in $(seq 2); do
echo repair node$x \#$y
./ccm-nodetool node$x repair -tr
  done
done
{noformat}


 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
 5483-v08-14-Poll-system_traces.events.patch, 
 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch, ccm-repair-test, 
 cqlsh-left-justify-text-columns.patch, prerepair-vs-postbuggedrepair.diff, 
 test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing

[jira] [Updated] (CASSANDRA-5483) Repair tracing

2014-03-17 Thread Ben Chan (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Chan updated CASSANDRA-5483:


Attachment: 
5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch
5483-v08-14-Poll-system_traces.events.patch

5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch

5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch
5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch

 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
 5483-v08-14-Poll-system_traces.events.patch, 
 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
 ccm-repair-test, cqlsh-left-justify-text-columns.patch, 
 test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-5483) Repair tracing

2014-03-17 Thread Ben Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13937976#comment-13937976
 ] 

Ben Chan commented on CASSANDRA-5483:
-

{noformat}
# tested with branch 5483 @ bce0c2c555a3; should also work following successful
#git apply 5483-full-trunk.txt
W=https://issues.apache.org/jira/secure/attachment
for url in \
  $W/12635094/5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch   
   \
  
$W/12635095/5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch
 \
  
$W/12635096/5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch 
 \
  $W/12635097/5483-v08-14-Poll-system_traces.events.patch   
   \
  
$W/12635098/5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch
do [ -e $(basename $url) ] || curl -sO $url; done 
git apply 5483-v08-*.patch 
ant clean  ant

./ccm-repair-test -kR 
ccm node1 stop 
ccm node1 clear 
ccm node1 start 
./ccm-repair-test -rt
{noformat}

* {{v08-11}} There was an error in one of the log formats in Differencer, which 
made my grep for out of sync in the logs fruitless.
* {{v08-12}} I ended up using the handleStreamEvent of StreamingRepairTask 
instead of implementing and registering my own StreamEventHandler. The new 
trace messages may need adjusting, especially for ProgressEvent, which is 
essentially just a toString currently.
* {{v08-13}} This works by adding a guarded sendNotification to 
TraceState#trace.
* {{v08-14}} This works by starting a thread to poll {{system_traces.events}}, 
and by adding notify functionality to TraceState. There is some jitter in the 
ordering between local and remote traces. An easy fix would be to have the 
query thread handle all sendNotification of traces. You have to accept latency 
in sendNotification of local traces in order to get better ordering. It might 
be necessary to delay all trace sendNotification by a few seconds to make it 
more likely that remote traces have arrived.
* {{v08-15}} Even more added TraceState functionality. All to try to reduce the 
amount of polling without hurting latency too much. There are only a few local 
traces that you would expect to be followed by a remote trace, so only wake up 
for those. Poll with an exponential backoff after each notification.

---

Heuristics are messy, and I expect plenty of opinions on {{v08-14}} and 
{{v08-15}}. I'm not especially proud of that code, but I can't think of 
anything better at the moment, given the (self-imposed?) constraints.

I may have reinvented the wheel with synchronization primitives. I checked 
{{java.util.concurrent.*}} and {{SimpleCondition}}, but not much beyond that. I 
could have missed something; I don't fully understand some of the classes. What 
I wanted was to be woken up (with a timeout) if anything has changed since the 
last time I checked. Theoretically, it should work for multiple consumers (As 
long as no one waits for longer than {{Integer.MAX_VALUE}} updates), though 
that's not really necessary here, if that would simplify the code.

The code seems to work reasonably well for small-scale tests. I can convince 
myself that it won't blow up for long repairs, but haven't done a full test yet.


 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
 5483-v08-14-Poll-system_traces.events.patch, 
 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
 ccm-repair-test, cqlsh-left-justify-text-columns.patch, 
 test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch,

[jira] [Commented] (CASSANDRA-5483) Repair tracing

2014-03-15 Thread Ben Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13936173#comment-13936173
 ] 

Ben Chan commented on CASSANDRA-5483:
-

About the {{system_traces}} feedback loop during repair. I don't have a way to 
consistently reproduce it. Normally these test repairs take no longer than 1-2 
minutes, but occasionally, it gets into the 10-20 minute range, with lots of 
merkle trees for {{system_traces}} getting sent ({{system_traces.events}} at 
that point is in the tens of thousands of rows.), at which point I usually just 
stop the repair, and start fresh with a new cluster. I'll pay closer attention 
next time to see if I can get a good repro case.

 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 ccm-repair-test, cqlsh-left-justify-text-columns.patch, 
 test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (CASSANDRA-5483) Repair tracing

2014-03-13 Thread Ben Chan (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Chan updated CASSANDRA-5483:


Attachment: cqlsh-left-justify-text-columns.patch

Public TODO list; please comment if any should be not-TODO:
* Trace streaming and/or lack thereof (I think hooking {{Differencer#run}} and 
related threads should be enough).
* Maybe exclude {{system_traces}} from repair if a repair trace is going on. 
There seems to be a feedback loop triggering multiple repair commands otherwise.
* Maybe add a placeholder row with a null {{duration}} for ongoing repair 
sessions. Makes it easier to find the {{session_id}} for queries. Update with 
the final duration at the end.
* Populate {{started_at}}, {{request}}, etc in {{system_traces.sessions}}.
* Send the {{session_id}} back to nodetool.
* Shorten/simplify trace messages.
* Verbose option; dump all traces to nodetool.

Implementation thoughts follow; please warn of potential problems.

---

Verbose option:

To send local traces back to nodetool, adding a parallel {{sendNotification}} 
is easy enough. Getting the remote traces seems like it would involve 
monitoring updates to {{system_traces.events}}.

At first I thought triggers, but the docs say that triggers run on the 
coordinator node, which is not necessarily the node you're repairing. So that 
leaves polling the table with heuristics that are hopefully good enough to 
reduce the amount of extra work.

---

Simplify trace messages:

Skipping to the point of difference:

It looks like each sub-RepairSession has a unique session id (a timeuuid but 
different from either {{session_id}} or {{event_id}}). Here is a section of the 
select above aligned and simplified to increase SNR. The redacted parts are 
identical.
{noformat}
[repair #fedc3790-...] Received merkle tree for events from /127.0.0.1
[repair #fef40550-...] new session: will sync /127.0.0.1, /127.0.0.2 on range 
(3074457345618258602,-9223372036854775808] for system_traces.[sessions, events]
[repair #fef40550-...] requesting merkle trees for sessions (to [/127.0.0.2, 
/127.0.0.1])
[repair #fedc3790-...] session completed successfully
[repair #fef40550-...] Sending completed merkle tree to /127.0.0.1 for 
system_traces/sessions
{noformat}
In the example above, you can see some overlap in the repair session traces, so 
the sub-session_id (so to speak) has some use in distinguishing these. Since 
this sub-session_id only has to be unique for a particular repair session, 
maybe it would be worth it to map each one to a small integer?

For convenience, I attached a small, not-very-pretty patch that left-justifies 
columns of type text in cqlsh (makes it easier to read the traces).

---

Trace streaming:

Is there a simple way to create a situation where a repair requires streaming? 
Here is what I'm currently doing, but it doesn't work.

{noformat}
#/bin/sh
ccm create $(mktemp -u 5483-XXX) 
ccm populate -n 3 
ccm updateconf --no-hinted-handoff 
ccm start 
ccm node1 cqlsh E
CREATE SCHEMA s1
WITH replication = { 'class' : 'SimpleStrategy', 'replication_factor' : 2 };

CREATE TABLE s1.users (
  user_id varchar PRIMARY KEY,
  first varchar,
  last varchar,
  age int)
WITH read_repair_chance = 0.0;

INSERT INTO s1.users (user_id, first, last, age)
  VALUES ('jsmith', 'John', 'Smith', 42);
E

ccm node1 stop 
python - E | ccm node2 cqlsh
import random as r
fs=[John,Art,Skip,Doug,Koala]
ls=[Jackson,Jacobs,Jefferson,Smythe]
for (f, l) in [(f,l) for f in fs for l in ls]:
  print (
insert into s1.users (user_id, age, first, last) 
values('%s', %d, '%s', '%s');
  ) % ((f[0]+l).lower(), r.randint(10,100), f, l)
E
ccm node2 cqlsh E
select count(*) from s1.users;
E
ccm node1 start
ccm node1 cqlsh E
select count(*) from s1.users;
E
nodetool -p $(ccm node1 show | awk -F= '/jmx_port/{print $2}') repair -tr s1
{noformat}

The problem is that despite disabling hinted handoff and setting 
{{read_repair_chance}} to 0, the endpoints are still reported as consistent in 
{{Differencer#run}}. Yet node1 is clearly missing some rows prior to the 
repair, and has them at the end. Somehow the streaming repair is getting done 
somewhere other than {{Differencer#run}}. Is some sort of handoff still being 
done somewhere? I'm sure there is something simple, but I'm missing it.


 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch,

[jira] [Updated] (CASSANDRA-5483) Repair tracing

2014-03-11 Thread Ben Chan (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Chan updated CASSANDRA-5483:


Attachment: 
5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch

5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch
5483-v07-08-Fix-brace-style.patch

5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch

 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Attachments: 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 ccm-repair-test, test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-5483) Repair tracing

2014-03-08 Thread Ben Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13924874#comment-13924874
 ] 

Ben Chan commented on CASSANDRA-5483:
-

Thanks for the catch. Hopefully the fix is as easy as:

{noformat}
// using OpenJDK's source code for Executors.newCachedThreadPool() as a 
reference
new DebuggableThreadPoolExecutor(0, Integer.MAX_VALUE, 60L, TimeUnit.SECONDS,
 new SynchronousQueueRunnable(),
 new NamedThreadFactory(RepairJobTask));
{noformat}

Also note that {{v02-01}} added {{CompactionExecutor extends 
DebuggableThreadPoolExecutor}}, so that is something else to watch out for.

---

Re: {
Oops (I'm normally KR).

Re: github
Move to a github workflow, or some sort of hybrid github/JIRA?

---

I guess the last caveat here is that most (I may have left in a few) of the old 
non-{{trace}} function overloads are gone, so the server isn't going to work 
with old versions of {{nodetool}}.

And a newer {{nodetool}} still won't work with older server versions. There may 
be some tricks you can do with JMX, but I don't have deep knowledge there. 
Maybe adding function overloads from the future (I noticed a similar trick in 
{{MessagingService.java}}).

And there is a lot of repetitious parallel code with the tracing and logging 
together. But that shouldn't affect the external functionality.


 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Attachments: 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 ccm-repair-test, test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (CASSANDRA-5483) Repair tracing

2014-03-06 Thread Ben Chan (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Chan updated CASSANDRA-5483:


Attachment: ccm-repair-test
5483-v06-06-Fix-interruption-in-tracestate-propagation.patch
5483-v06-05-Add-a-command-column-to-system_traces.events.patch
5483-v06-04-Allow-tracing-ttl-to-be-configured.patch

 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Attachments: 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 ccm-repair-test, test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-5483) Repair tracing

2014-03-06 Thread Ben Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922721#comment-13922721
 ] 

Ben Chan commented on CASSANDRA-5483:
-

It was more involved than I thought, partly because of heisenbugs and the trace 
state mysteriously not propagating (see {{v06-05}}).

Note: changing JMX can cause mysterious errors if you don't {{ant clean  
ant}}. I ran into the same kinds of stack traces as you did. It's not 
consistent. Sometimes I can make a JMX change and {{ant}} with no problem.

To make patches simpler, I'm posting full repro code. I also tried to simplify 
the naming. Unfortunately, all the previous patches are in jumbled order due to 
a naming convention that doesn't sort. Fortunately, JIRA seems to have an 
easter egg where you can choose the attachment name by changing the url.

{noformat}
# Uncomment to exactly reproduce state.
#git checkout -b 5483-e30d6dc e30d6dc

# Download all needed patches with consistent names, apply patches, build.
W=https://issues.apache.org/jira/secure/attachment
for url in \
  $W/12630490/5483-v02-01-Trace-filtering-and-tracestate-propagation.patch \
  $W/12630491/5483-v02-02-Put-a-few-traces-parallel-to-the-repair-logging.patch 
\
  $W/12631967/5483-v03-03-Make-repair-tracing-controllable-via-nodetool.patch \
  $W/12633153/5483-v06-04-Allow-tracing-ttl-to-be-configured.patch \
  $W/12633154/5483-v06-05-Add-a-command-column-to-system_traces.events.patch \
  $W/12633155/5483-v06-06-Fix-interruption-in-tracestate-propagation.patch \
  $W/12633156/ccm-repair-test
do [ -e $(basename $url) ] || curl -sO $url; done 
git apply 5483-v0[236]-*.patch 
ant clean  ant

# put on a separate line because you should at least minimally inspect
# arbitrary code before running.
chmod +x ./ccm-repair-test  ./ccm-repair-test
{noformat}

{{ccm-repair-test}} has some options for convenience:
{noformat}
-k keep (don't delete) the created cluster after successful exit.
-r repair only
-R don't repair
-t do traced repair only
-T don't do traced repair (if neither, then do both traced and untraced repair)
{noformat}

The output of a test run:

{noformat}
Current cluster is now: test-5483-QiR
[2014-03-06 10:46:13,617] Nothing to repair for keyspace 'system'
[2014-03-06 10:46:13,646] Starting repair command #1, repairing 2 ranges for 
keyspace s1 (seq=true, full=true)
[2014-03-06 10:46:16,999] Repair session 72648190-a546-11e3-a5f4-f94811c7b860 
for range (-3074457345618258603,3074457345618258602] finished
[2014-03-06 10:46:17,465] Repair session 73ee2ed0-a546-11e3-a5f4-f94811c7b860 
for range (3074457345618258602,-9223372036854775808] finished
[2014-03-06 10:46:17,465] Repair command #1 finished
[2014-03-06 10:46:17,485] Starting repair command #2, repairing 2 ranges for 
keyspace system_traces (seq=true, full=true)
[2014-03-06 10:46:18,782] Repair session 74aaef20-a546-11e3-a5f4-f94811c7b860 
for range (-3074457345618258603,3074457345618258602] finished
[2014-03-06 10:46:18,816] Repair session 74ff0290-a546-11e3-a5f4-f94811c7b860 
for range (3074457345618258602,-9223372036854775808] finished
[2014-03-06 10:46:18,816] Repair command #2 finished
0 rows exported in 0.015 seconds.
test-5483-QiR-system_traces-events.txt
ok
[2014-03-06 10:46:24,128] Nothing to repair for keyspace 'system'
[2014-03-06 10:46:24,166] Starting repair command #3, repairing 2 ranges for 
keyspace s1 (seq=true, full=true)
[2014-03-06 10:46:25,366] Repair session 78a6d4e0-a546-11e3-a5f4-f94811c7b860 
for range (-3074457345618258603,3074457345618258602] finished
[2014-03-06 10:46:25,415] Repair session 79263e10-a546-11e3-a5f4-f94811c7b860 
for range (3074457345618258602,-9223372036854775808] finished
[2014-03-06 10:46:25,415] Repair command #3 finished
[2014-03-06 10:46:25,485] Starting repair command #4, repairing 2 ranges for 
keyspace system_traces (seq=true, full=true)
[2014-03-06 10:46:27,077] Repair session 796f7c10-a546-11e3-a5f4-f94811c7b860 
for range (-3074457345618258603,3074457345618258602] finished
[2014-03-06 10:46:27,120] Repair session 79f240a0-a546-11e3-a5f4-f94811c7b860 
for range (3074457345618258602,-9223372036854775808] finished
[2014-03-06 10:46:27,120] Repair command #4 finished
48 rows exported in 0.104 seconds.
test-5483-QiR-system_traces-events-tr.txt
found source: 127.0.0.1
found thread: Thread-15
found thread: AntiEntropySessions:1
found thread: RepairJobTask:1
found source: 127.0.0.2
found thread: AntiEntropyStage:1
found source: 127.0.0.3
found thread: AntiEntropySessions:2
found thread: Thread-16
found thread: AntiEntropySessions:3
found thread: AntiEntropySessions:4
unique sources traced: 3
unique threads traced: 8
All thread categories accounted for
ok
{noformat}

---

Patch comments:

- {{v06-04}} I did something similar to {{v03-03}}, (almost) no refactoring. 
The implementation is a little messy architecturally.
- {{v06-05}} This is the suggestion you had to add a command

[jira] [Updated] (CASSANDRA-5483) Repair tracing

2014-03-01 Thread Ben Chan (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Chan updated CASSANDRA-5483:


Attachment: 
v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch

v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch

v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch

These next few patches are different variations on adding {{nodetool}} control 
of tracing (Use {{nodetool repair -tr}}), and are all based off of the 
{{v02-0002}} patch, hence the naming and the {{0003}} numbering. An overview:

* {{v03}} simply adds a {{trace}} boolean to all of the repair functions.
* {{v04}} does not actually work, due to complications with sending an EnumSet 
parameter via JMX.
* {{v05}} has the same structure as {{V04}}, except it uses a {{long}} and 
traditional bitflags.

The idea for {{v04}} and {{v05}} was to consolidate all of the boolean options 
into a single parameter. That way, adding or removing boolean options doesn't 
require you to modify the entire call chain. It also makes binary compatibility 
easier going forward (no need to maintain an ever growing list of function 
overloads). I'm personally leaning towards {{v05}}.

---

About tracing to a separate table: an earlier comment mentioned wanting to 
trace bootstrap and decommission. I wonder if these would go into that same 
table. If so, I am thinking of calling the new table something generic like 
{{system_traces.trace_logs}}. I also assume, that like 
{{system_traces.events}}, the rows in this table should expire, though perhaps 
not as fast as 24 hours. Thoughts on the naming and the use of the 
{{system_traces}} schema?

---

One last thing I wanted to ask is about the possibility of trace log levels. 
What is the minimum amount of trace log information you would find useful, the 
next amount, and so on? Should it just follow the loglevel? (One possible 
problem with that is you can't change loglevel without a server restart.)

I don't run Cassandra in any sort of production environment, so I'm not as 
familiar with the use cases as I would like.


 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Attachments: test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (CASSANDRA-5483) Repair tracing

2014-02-22 Thread Ben Chan (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Chan updated CASSANDRA-5483:


Attachment: 
trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch

trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch

Sorry; I forgot to check email for a few days. Attached new patch version 
rebased onto trunk commit {{4620823}}. The previous minimal test still works 
(despite gratuitous use of {{cat}}).


 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Attachments: test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (CASSANDRA-5483) Repair tracing

2014-01-13 Thread Ben Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13870131#comment-13870131
 ] 

Ben Chan commented on CASSANDRA-5483:
-

Thus far, I've been going under the assumption that it's mostly meant to be 
used as a combination of performance profiling and some sort of globally 
accessible error log (for that particular repair session), though there will 
probably easily be enough information there to be able to extract some stats.

I could use some suggestions on what things (and what level of detail) would be 
good to trace. In addition to what I already have, I can only think of a few 
places in {{Differencer}} and {{StreamingRepairTask}}, along with a more 
complete covering of the repair-related error logs.


 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Attachments: test-5483-system_traces-events.txt, 
 tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (CASSANDRA-5483) Repair tracing

2014-01-11 Thread Ben Chan (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Chan updated CASSANDRA-5483:


Attachment: test-5483-system_traces-events.txt
tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt

tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt

This patch contains only the minimum necessary to get a proof-of-concept 
working, but I thought I'd get some feedback before going any further.

Currently it reproduces some of the repair logs into {{system_traces.events}}. 
For the simple test below, it's not too bad, but the traces seem to get large 
fast as the amount of data grows. The same technique used for repair seems to 
work for bootstrap and decommission with the caveat that if you try to trace 
too far into the decommission process, it doesn't get propagated to the other 
nodes.

For convenience:
{noformat}
# optional; patch should apply with any recent trunk
#git checkout -b 5483-repair-filtering 8ebeee1
git apply 
tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt
git apply tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt
ant

# simple test
ccm create test-5483
ccm populate -n 3
ccm start

cat EOT | ccm node1 cqlsh
CREATE SCHEMA s1
WITH replication = { 'class' : 'SimpleStrategy', 'replication_factor' : 2 };

USE s1;

CREATE TABLE users (
  user_id varchar PRIMARY KEY,
  first varchar,
  last varchar,
  age int
);

INSERT INTO users (user_id, first, last, age)
  VALUES ('jsmith', 'John', 'Smith', 42);
EOT

ccm node1 repair

cat EOT | ccm node1 cqlsh
copy system_traces.events to 'test-5483-system_traces-events.txt';
EOT
{noformat}


 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
Reporter: Yuki Morishita
Priority: Minor
  Labels: repair
 Attachments: test-5483-system_traces-events.txt, 
 tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

55 matches

Mail list logo