[jira] [Commented] (CASSANDRA-8341) Expose time spent in each thread pool
[ https://issues.apache.org/jira/browse/CASSANDRA-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16379093#comment-16379093 ] Chris Lohfink commented on CASSANDRA-8341: -- Patch changed output of tpstats to look like: {code} Pool NameActive Pending Completed Blocked AllTimeBlocked CPU[ms/sec] Allocations[mb/s] AntiEntropyStage 0 0 0 0 0 0 0 CacheCleanupExecutor 0 0 0 0 0 0 0 CompactionExecutor 0 0 1013 0 0 0 0 CounterMutationStage 0 0 0 0 0 0 0 GossipStage 0 0 0 0 0 0 0 HintsDispatcher 0 0 0 0 0 0 0 InternalResponseStage0 0 0 0 0 0 0 MemtableFlushWriter 0 0 1 0 0 0 0 MemtablePostFlush0 0 2 0 0 0 0 MemtableReclaimMemory0 0 1 0 0 0 0 MigrationStage 0 0 0 0 0 0 0 MiscStage0 0 0 0 0 0 0 MutationStage0 0 1367500 0 27 2 Native-Transport-Requests11 0 36191532 0 0 1839566 PendingRangeCalculator 0 0 2 0 0 0 0 PerDiskMemtableFlushWriter_0 0 0 1 0 0 0 0 ReadRepairStage 0 0 0 0 0 0 0 ReadStage0 0 22662142 0 0 349 58 Repair-Task 0 0 0 0 0 0 0 RequestResponseStage 0 0 0 0 0 0 0 Sampler 0 0 0 0 0 0 0 SecondaryIndexManagement 0 0 0 0 0 0 0 ValidationExecutor 0 0 0 0 0 0 0 ViewBuildExecutor0 0 0 0 0 0 0 ViewMutationStage0 0 0 0 0 0 0 {code} > Expose time spent in each thread pool > - > > Key: CASSANDRA-8341 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8341 > Project: Cassandra > Issue Type: New Feature >Reporter: Chris Lohfink >Assignee: Chris Lohfink >Priority: Minor > Labels: metrics > Attachments: 8341.patch, 8341v2.txt > > > Can increment a counter with time spent in each queue. This can provide > context on how much time is spent percentage wise in each stage. > Additionally can be used with littles law in future if ever want to try to > tune the size of the pools. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8341) Expose time spent in each thread pool
[ https://issues.apache.org/jira/browse/CASSANDRA-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16379090#comment-16379090 ] ASF GitHub Bot commented on CASSANDRA-8341: --- GitHub user clohfink opened a pull request: https://github.com/apache/cassandra/pull/200 Add tpstats cpu and alloc rate tracking for CASSANDRA-8341 You can merge this pull request into a Git repository by running: $ git pull https://github.com/clohfink/cassandra 8341 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/cassandra/pull/200.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #200 commit 1d47a63f5c334998cb8e948f5114c1e3cbc32103 Author: Chris LohfinkDate: 2018-02-27T18:04:39Z Add tpstats cpu and alloc rate tracking for CASSANDRA-8341 > Expose time spent in each thread pool > - > > Key: CASSANDRA-8341 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8341 > Project: Cassandra > Issue Type: New Feature >Reporter: Chris Lohfink >Assignee: Chris Lohfink >Priority: Minor > Labels: metrics > Attachments: 8341.patch, 8341v2.txt > > > Can increment a counter with time spent in each queue. This can provide > context on how much time is spent percentage wise in each stage. > Additionally can be used with littles law in future if ever want to try to > tune the size of the pools. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8341) Expose time spent in each thread pool
[ https://issues.apache.org/jira/browse/CASSANDRA-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14229712#comment-14229712 ] Robert Stupp commented on CASSANDRA-8341: - Note: {{TheadMXBean.getCurrentThread...()}} definitely performs better than {{System.nanoTime()}} - at least on OSX with a single 8-core CPU. Expose time spent in each thread pool - Key: CASSANDRA-8341 URL: https://issues.apache.org/jira/browse/CASSANDRA-8341 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Chris Lohfink Priority: Minor Labels: metrics Attachments: 8341.patch, 8341v2.txt Can increment a counter with time spent in each queue. This can provide context on how much time is spent percentage wise in each stage. Additionally can be used with littles law in future if ever want to try to tune the size of the pools. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8341) Expose time spent in each thread pool
[ https://issues.apache.org/jira/browse/CASSANDRA-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14229780#comment-14229780 ] Benedict commented on CASSANDRA-8341: - SEPWorker already grabs the nanoTime on exiting and entering its spin phase, so tracking this would be pretty much free (we'd need to check it once if we swapped the executor we're working on without entering a spinning state). Flushing pent up data is pretty trivial; you can set a max time to buffer, so it ensures it's never more than a few seconds (or millis) out of date, say. Enough to keep the cost too small to measure. I'm a little dubious about tracking two completely different properties as the same thing though. CPUTime cannot be composed with nanoTime sensibly, so we either want to track one or the other across all executors. Since the other executors are all the ones that do infrequent expensive work (which is explicitly why they haven't been transitioned to SEP), tracking nanoTime on them won't be an appreciable cost. Expose time spent in each thread pool - Key: CASSANDRA-8341 URL: https://issues.apache.org/jira/browse/CASSANDRA-8341 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Chris Lohfink Priority: Minor Labels: metrics Attachments: 8341.patch, 8341v2.txt Can increment a counter with time spent in each queue. This can provide context on how much time is spent percentage wise in each stage. Additionally can be used with littles law in future if ever want to try to tune the size of the pools. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8341) Expose time spent in each thread pool
[ https://issues.apache.org/jira/browse/CASSANDRA-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14229796#comment-14229796 ] T Jake Luciani commented on CASSANDRA-8341: --- I added some code to do this in a branch and have been meaning to break it out. It adds LatencyMetrics to AbstractTracingAwareExecutorService and tracks time spent in queue for each FutureTask. The code is here https://github.com/tjake/cassandra/compare/new-executor#diff-9860b2ae7a7e9e05e2165fd319f1398eL26 The SEPExecutor would need something similar added. Expose time spent in each thread pool - Key: CASSANDRA-8341 URL: https://issues.apache.org/jira/browse/CASSANDRA-8341 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Chris Lohfink Priority: Minor Labels: metrics Attachments: 8341.patch, 8341v2.txt Can increment a counter with time spent in each queue. This can provide context on how much time is spent percentage wise in each stage. Additionally can be used with littles law in future if ever want to try to tune the size of the pools. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8341) Expose time spent in each thread pool
[ https://issues.apache.org/jira/browse/CASSANDRA-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14229805#comment-14229805 ] Benedict commented on CASSANDRA-8341: - Ah, that's a good question: are we talking about queue latency or time spent processing each queue? The two are very different, and it sounded like we were discussing the latter, but the ticket description does sound more like the former. Expose time spent in each thread pool - Key: CASSANDRA-8341 URL: https://issues.apache.org/jira/browse/CASSANDRA-8341 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Chris Lohfink Priority: Minor Labels: metrics Attachments: 8341.patch, 8341v2.txt Can increment a counter with time spent in each queue. This can provide context on how much time is spent percentage wise in each stage. Additionally can be used with littles law in future if ever want to try to tune the size of the pools. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8341) Expose time spent in each thread pool
[ https://issues.apache.org/jira/browse/CASSANDRA-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14229859#comment-14229859 ] Chris Lohfink commented on CASSANDRA-8341: -- I want to see time spent processing per pool (user time may be more appropriate than walltime(nanotime) or cpu's system time). This way we can track where cpu burn is occurring, and display a % of cpu in tpstats/opscenter by pool. So while latencyutils or a histogram would certainly be interesting its more than necessary for this task. A simple counter or meter would be sufficient. Expose time spent in each thread pool - Key: CASSANDRA-8341 URL: https://issues.apache.org/jira/browse/CASSANDRA-8341 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Chris Lohfink Priority: Minor Labels: metrics Attachments: 8341.patch, 8341v2.txt Can increment a counter with time spent in each queue. This can provide context on how much time is spent percentage wise in each stage. Additionally can be used with littles law in future if ever want to try to tune the size of the pools. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8341) Expose time spent in each thread pool
[ https://issues.apache.org/jira/browse/CASSANDRA-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14229861#comment-14229861 ] T Jake Luciani commented on CASSANDRA-8341: --- Ah, I was confused by the summary then. I'll open a separate ticket for showing time waiting in queue. Expose time spent in each thread pool - Key: CASSANDRA-8341 URL: https://issues.apache.org/jira/browse/CASSANDRA-8341 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Chris Lohfink Priority: Minor Labels: metrics Attachments: 8341.patch, 8341v2.txt Can increment a counter with time spent in each queue. This can provide context on how much time is spent percentage wise in each stage. Additionally can be used with littles law in future if ever want to try to tune the size of the pools. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8341) Expose time spent in each thread pool
[ https://issues.apache.org/jira/browse/CASSANDRA-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14229872#comment-14229872 ] Benedict commented on CASSANDRA-8341: - That is difficult, since we have stages that perform work that does not consume CPU. The RPC stage (for thrift or cql) both spend the majority of their time _waiting_ for the relevant work stage to complete. The proposed approaches would count this as busy time. The read and write stages also can block on IO, the former more often than the latter, but in either case we would count erroneously. Expose time spent in each thread pool - Key: CASSANDRA-8341 URL: https://issues.apache.org/jira/browse/CASSANDRA-8341 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Chris Lohfink Priority: Minor Labels: metrics Attachments: 8341.patch, 8341v2.txt Can increment a counter with time spent in each queue. This can provide context on how much time is spent percentage wise in each stage. Additionally can be used with littles law in future if ever want to try to tune the size of the pools. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8341) Expose time spent in each thread pool
[ https://issues.apache.org/jira/browse/CASSANDRA-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230164#comment-14230164 ] Chris Lohfink commented on CASSANDRA-8341: -- I'll write up something in a little bit, but a scheduled task to walk the threads and capture the User time from the ThreadMXBean could give a good picture of which thread pools are burning cpu (including cpu burnt waiting). Can capture it from the SEPWorker when changing assigned executors. A problem is (thanks [~benedict]) in the maybeExecuteImmediately when not adding the task, that would need to be wrapped in an expensive call which adds latency to read/writes. Expose time spent in each thread pool - Key: CASSANDRA-8341 URL: https://issues.apache.org/jira/browse/CASSANDRA-8341 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Chris Lohfink Priority: Minor Labels: metrics Attachments: 8341.patch, 8341v2.txt Can increment a counter with time spent in each queue. This can provide context on how much time is spent percentage wise in each stage. Additionally can be used with littles law in future if ever want to try to tune the size of the pools. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8341) Expose time spent in each thread pool
[ https://issues.apache.org/jira/browse/CASSANDRA-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219174#comment-14219174 ] Robert Stupp commented on CASSANDRA-8341: - Yea - that {{TheadMXBean.getCurrentThread...()}} hopefully directly accesses the CPU registers for that. Not sure about that since the implementations are very OS specific (in {{hotspot/os/*/vm/os_*.cpp}} in OpenJDK source) and as such also specific to the CPU model. If all the implementations access the CPU registers for that, it definitely better than {{System.nanoTime()}, which is consistent over CPU codes (- NUMA - expensive). Note: I don't want to make this ticket more complicated as necessary - just want to mention that metrics have their own overhead and should be used with care in sensitive areas like thread scheduling. Expose time spent in each thread pool - Key: CASSANDRA-8341 URL: https://issues.apache.org/jira/browse/CASSANDRA-8341 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Chris Lohfink Priority: Minor Labels: metrics Attachments: 8341.patch, 8341v2.txt Can increment a counter with time spent in each queue. This can provide context on how much time is spent percentage wise in each stage. Additionally can be used with littles law in future if ever want to try to tune the size of the pools. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8341) Expose time spent in each thread pool
[ https://issues.apache.org/jira/browse/CASSANDRA-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219374#comment-14219374 ] Chris Lohfink commented on CASSANDRA-8341: -- yep understandable. Id like this if possible though so let me know if theres anything I can do. A lot of the metrics library work happens after the task is complete, in cases like read/write's they send back to a request response stage so at least the post task work shouldn't affect the read/write latency until the thread pool is saturated. I am not really sure if I know anything I can do to counter the ~100ns-1us overhead though - any ideas? Expose time spent in each thread pool - Key: CASSANDRA-8341 URL: https://issues.apache.org/jira/browse/CASSANDRA-8341 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Chris Lohfink Priority: Minor Labels: metrics Attachments: 8341.patch, 8341v2.txt Can increment a counter with time spent in each queue. This can provide context on how much time is spent percentage wise in each stage. Additionally can be used with littles law in future if ever want to try to tune the size of the pools. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8341) Expose time spent in each thread pool
[ https://issues.apache.org/jira/browse/CASSANDRA-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219444#comment-14219444 ] Chris Lohfink commented on CASSANDRA-8341: -- In the SEPExecutor since the same thread will continually execute tasks if theres work, I think its possible to just collect the start time before starting any work and then the end time once theres a gap in processing. This way if the thread pool is saturated it should incur no overhead from monitoring, but once there is a break in the work or a executor has run out of task permits it would update the cpu time. Then on a saturated system, the metric may be out of date but would have less of an overhead. I am not completely familiar with the SEPWorker so I need to do a little more research, but would that be adequate? The type of work in tasks using the ThreadPoolExecutor (i.e. migration, repairs, commitlog archiver etc) I don't think would be impacted from the metrics overhead so can probably stick with impl used in the patched JMXEnabledThreadPoolExecutor with cpu time instead of nanotime. Expose time spent in each thread pool - Key: CASSANDRA-8341 URL: https://issues.apache.org/jira/browse/CASSANDRA-8341 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Chris Lohfink Priority: Minor Labels: metrics Attachments: 8341.patch, 8341v2.txt Can increment a counter with time spent in each queue. This can provide context on how much time is spent percentage wise in each stage. Additionally can be used with littles law in future if ever want to try to tune the size of the pools. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8341) Expose time spent in each thread pool
[ https://issues.apache.org/jira/browse/CASSANDRA-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14218111#comment-14218111 ] Chris Lohfink commented on CASSANDRA-8341: -- Attached a first pass patch for an idea. I just recorded the wall time but may want to record cpu time as well ([getCurrentThreadCpuTime getCurrentThreadUserTime|https://docs.oracle.com/javase/7/docs/api/java/lang/management/ThreadMXBean.html#getCurrentThreadCpuTime()]). May be worth recording it with a histogram instead of just a counter as well, but for the purpose of exposing % of time I think the total is sufficient. I added an insertion meter as well for easier time in estimating different pool sizes (just easier then adding up the pending/completed/active and give sense of rate). Expose time spent in each thread pool - Key: CASSANDRA-8341 URL: https://issues.apache.org/jira/browse/CASSANDRA-8341 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Chris Lohfink Priority: Minor Labels: metrics Attachments: 8341.patch Can increment a counter with time spent in each queue. This can provide context on how much time is spent percentage wise in each stage. Additionally can be used with littles law in future if ever want to try to tune the size of the pools. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8341) Expose time spent in each thread pool
[ https://issues.apache.org/jira/browse/CASSANDRA-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14218149#comment-14218149 ] Robert Stupp commented on CASSANDRA-8341: - FWIW it's not necessary to add another {{ThreadLocal}} to trace work-unit start time. Wrapping the {{Runnable}} using a static class containing the start-time feels cheaper. Adding to metrics code before or after work-unit execution transparently extends work-unit execution latency, which isn't measured. (Note that System.nanoTime() [may introduce latency|http://shipilev.net/blog/2014/nanotrusting-nanotime/]). Expose time spent in each thread pool - Key: CASSANDRA-8341 URL: https://issues.apache.org/jira/browse/CASSANDRA-8341 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Chris Lohfink Priority: Minor Labels: metrics Attachments: 8341.patch Can increment a counter with time spent in each queue. This can provide context on how much time is spent percentage wise in each stage. Additionally can be used with littles law in future if ever want to try to tune the size of the pools. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8341) Expose time spent in each thread pool
[ https://issues.apache.org/jira/browse/CASSANDRA-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219017#comment-14219017 ] Chris Lohfink commented on CASSANDRA-8341: -- as reference point, I did a [simplistic benchmark|https://gist.github.com/clohfink/9ee94e3767b5d8170220] (on a 2014 MBP) with JMH of a few scenarios from patch v1: {code} BenchmarkMode Samples Score Error Units baseline avgt 15 8300.492 ±92.987 ns/op staticWrap avgt 15 8438.268 ± 132.177 ns/op threadlocal avgt 15 8464.201 ± 161.554 ns/op wrapped avgt 15 8424.011 ± 134.407 ns/op baseline sample 253749 8197.106 ±16.157 ns/op staticWrap sample 251910 8233.368 ±14.841 ns/op threadlocalsample 244075 8540.737 ± 106.365 ns/op wrappedsample 247083 8443.597 ±99.892 ns/op baseline ss 15 49466.667 ± 8039.477 ns staticWrap ss 15 44466.667 ± 7697.095 ns threadlocalss 15 52266.667 ± 11679.329 ns wrappedss 15 6.000 ± 24000.352 ns {code} As a v2 I removed the meter, and switched to using currentmiilis instead of nanotime (~1/3 speed). I switched the normal ThreadPool one to just do everything in overriden execute. It wasn't as simple for the SEPExecutor. Since I didn't want to break the separation between the JMX SEP and the parent SEP by collecting metrics in the SEPWorker so I think keeping the threadlocal approach with a before/after execute callback is a little cleaner (match the ThreadExecutor api). With [those changes|https://gist.github.com/clohfink/b51eb027c55008377d93]: {code} BenchmarkMode Samples ScoreError Units baseline avgt 158052.932 ± 92.517 ns/op staticWrap avgt 158313.957 ±295.964 ns/op threadlocal avgt 158553.304 ±189.656 ns/op wrapped avgt 158354.060 ±169.359 ns/op baseline sample 2528128209.038 ± 14.917 ns/op staticWrap sample 2486488356.810 ± 15.974 ns/op threadlocalsample 2473628388.620 ± 17.310 ns/op wrappedsample 2497848338.419 ± 68.509 ns/op baseline ss 15 44933.333 ± 8305.190 ns staticWrap ss 15 50533.333 ± 13136.009 ns threadlocalss 15 303600.000 ± 964458.694 ns wrappedss 15 50066.667 ± 7300.839 ns {code} Problem with using ms precision is that a lot of the time it takes less then a ms to do a task so it ends up being very below the actual value. I was hesitant to use getCurrentThreadCpuTime as a measure, but its a ns collection and much more meaningful then walltime. Kinda ideal so I [gave it a shot|https://gist.github.com/clohfink/50e22ab895a7a700661b] - its 1 little more then 1us cost per task execution: {code} BenchmarkMode Samples Score Error Units baseline avgt 15 8187.818 ± 162.270 ns/op staticWrap avgt 15 9585.818 ± 185.490 ns/op threadlocal avgt 15 9508.422 ± 166.103 ns/op wrapped avgt 15 9380.099 ± 142.239 ns/op baseline sample 251564 8259.852 ±30.677 ns/op staticWrap sample 221622 9365.621 ±19.831 ns/op threadlocalsample 218599 9521.815 ±33.865 ns/op wrappedsample 213108 9743.400 ±24.726 ns/op baseline ss 15 49333.333 ± 8663.130 ns staticWrap ss 15 45533.333 ± 9543.739 ns threadlocalss 15 56933.333 ± 12742.318 ns wrappedss 15 52800.000 ± 12001.537 ns {code} Expose time spent in each thread pool - Key: CASSANDRA-8341 URL: https://issues.apache.org/jira/browse/CASSANDRA-8341 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Chris Lohfink Priority: Minor Labels: metrics Attachments: 8341.patch, 8341v2.txt Can increment a counter with time spent in each queue. This can provide context on how much time is spent percentage wise in each stage. Additionally can be used with littles law in future if ever want to try to tune the size of the pools. -- This message was sent by Atlassian JIRA (v6.3.4#6332)