I am in favor of the project adopting as a library. My automation is very outdated, so what I am saying maybe a legacy thing… so w/e is the “new” way is what we should promote…. I rely a lot on the collapsed format and wish to migrate to the JFR format so I can collect CPU / Memory at the same time; it would be great for us to expose this as a promoted ability (curl cassandra/profile -o result.jfr). One issue I see with exposing the raw “execute” method is that it tied our API with the tools API, so any breaking changes there break our API; I am not against this, but it is something to consider.
As Scott has pointed out, there have been stability issues, so we should be able to dynamically flag the feature off. > On Jun 16, 2025, at 9:26 AM, Jaydeep Chovatia <chovatia.jayd...@gmail.com> > wrote: > > >Previous experiences (good or bad) > I have been using an async-profiler in my project for quite some time to > profile the CPU. Additionally, I have wrapped it with an HTTP interface, > allowing one to open a browser and view the CPU flame graph in real-time, > which further simplifies the process. > It is integrated as a library, and my preference is to include it as a > library, rather than forking processes. > > Jaydeep > > On Sat, Jun 14, 2025 at 8:14 AM Josh McKenzie <jmcken...@apache.org > <mailto:jmcken...@apache.org>> wrote: >>> I have seen cases where specific async-profiler/JVM/Cassandra version >>> combos (JDK11/4.1-derived source tree) will immediately crash the JVM on >>> profile - especially successive profile invocations on the same process >> This would be a great candidate for testing to ensure that, at least for >> provided profiles, this doesn't happen. >> >> On Fri, Jun 13, 2025, at 10:41 PM, C. Scott Andreas wrote: >>> Supportive of inclusion as well. General preference for invoking as a >>> library rather than forking processes. >>> >>> Jon, thanks for the tips on off-CPU profiling - added to my personal cheat >>> sheet. >>> >>> I have seen cases where specific async-profiler/JVM/Cassandra version >>> combos (JDK11/4.1-derived source tree) will immediately crash the JVM on >>> profile - especially successive profile invocations on the same process - >>> but have not observed this on JDK21 or trunk-derived source trees. If we >>> have user reports of that happening, we’ll need to figure out how to >>> reproduce and get to the bottom of it. >>> >>> – Scott >>> >>> > On Jun 13, 2025, at 5:24 PM, Francisco Guerrero <fran...@apache.org >>> > <mailto:fran...@apache.org>> wrote: >>> > >>> > Thanks for bringing this discussion Doug. I didn't realize that >>> > async-profiler allows you to >>> > bring it as a dependency. It looks pretty neat from what I could tell. I >>> > also think bringing >>> > this to Cassandra as a dependency is a reasonable approach. We need to >>> > come up with >>> > a solid way to expose this via JMX / vtable. >>> > >>> > Best, >>> > - Francisco >>> > >>> >> On 2025/06/13 21:08:28 Doug Rohrer wrote: >>> >> The nice thing from what I can tell about using the Java API per [6] >>> >> below is that you can literally just get an instance of the profiler and >>> >> pass it some commands in the `execute` method… just need to be careful >>> >> how much of that surface area we expose. Jon (and others obviously) I’d >>> >> love to get your take on how we could make a useful interface to the >>> >> async-profiler, maybe exposed via JMX, that doesn’t require someone to >>> >> read the entirety of the async-profiler docs and provides some useful >>> >> profiles without the rough edges (things like managing temp files so >>> >> users don’t have to know the layout of the filesystem C* is running on, >>> >> for example, since at least in the Sidecar we’d be executing this on >>> >> behalf of a remote user, with all of the constraints that implies). >>> >> >>> >> We can always be more protective in the Sidecar than we are server-side >>> >> as well, but it seems like helping operators not do bad things is a good >>> >> thing. >>> >> >>> >> Obviously we’d want the ability Cassandra-side to disable this >>> >> functionality all together however we implement it. >>> >> >>> >> Doug >>> >> >>> >>>> On Jun 13, 2025, at 2:38 PM, Jon Haddad <j...@rustyrazorblade.com >>> >>>> <mailto:j...@rustyrazorblade.com>> wrote: >>> >>> >>> >>> I'd be very happy to see async-profiler included with C* I've made >>> >>> extensive use of it in my performance evaluations [1][2], and even >>> >>> posted a video about it [3] for general Java perf analysis (among >>> >>> others). It's part of easy-cass-lab and is easily the most informative >>> >>> tool I've found for the getting to the bottom of anything performance >>> >>> related. >>> >>> >>> >>> There's probably a good case to be made for including it with the C* >>> >>> artifact as well as having it be something you can drop in. I lean >>> >>> towards including it all the time, but I haven't run it this way myself >>> >>> yet, so there might be some downside I'm unaware of. >>> >>> >>> >>> When you call the asprof executable, it attaches the async-profiler to >>> >>> the running jvm using jattach [4]. We could do this as well, if we >>> >>> wanted to avoid including it with the release, but I don't know how >>> >>> much we really benefit from that. I've run into issues with it when >>> >>> it's unable to detatch correctly, then you're unable to reattach it >>> >>> until after the server is restarted. On the flip side, I don't know if >>> >>> you're able to set up all the same options for arbitrary profiling when >>> >>> it's loaded as an agent and turned on/off dynamically. I think we can, >>> >>> based on the integration page [6], but I haven't tried it yet. It >>> >>> would be a bummer if we only had a single mode of profiling available. >>> >>> >>> >>> The default mode, CPU profiling, is fantastic, but I've also made >>> >>> extensive use of allocation profiling [5] to identify perf issues as >>> >>> well so having that available is a must, imo. Wall clock / off cpu >>> >>> profiling is great for identifying when IO is the root cause, which >>> >>> isn't clearly revealed by on-cpu profiling due to the way threads are >>> >>> scheduled. When I look at a system I typically do CPU / Wall / Alloc / >>> >>> Off-CPU to be thorough, and the last thing you want to do is have to >>> >>> restart between each one. You can also specify specific Java methods, >>> >>> include or exclude frames matching specific regex, and a whole slew of >>> >>> other options. The latest version even supports continuous profiling >>> >>> with heatmaps although I haven't tried it yet. >>> >>> >>> >>> So hopefully the option we go with allows all of that, otherwise the >>> >>> limits would impose more of a headache to me as I'd need to remove it >>> >>> and continue to bring my own. >>> >>> >>> >>> Under the hood, the async-profiler uses Linux perf events + <> >>> >>> asynchronous polling of the java stack to match them up and generate >>> >>> it's reports. As a result, it requires certain permissions to run and >>> >>> get all the details I like. Specifically these kernel parameters: >>> >>> >>> >>> sudo sysctl kernel.perf_event_paranoid=1 >>> >>> sudo sysctl kernel.kptr_restrict=0 >>> >>> >>> >>> You also need to enable some capabilities for off-cpu profiliing: >>> >>> >>> >>> sudo find /usr/lib/jvm/ -type f -name 'java' -exec setcap >>> >>> "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" {} \; >>> >>> >>> >>> Then you can do off-cpu with this wild cryptic version (shout out to >>> >>> Andrei Pangin for helping me with this [7]): >>> >>> >>> >>> asprof -e kprobe:schedule -i 2 --cstack dwarf -X '*Unsafe.park*' >>> >>> "${@:2}" $PID >>> >>> >>> >>> There's also some subtle issues when it's run in a container, since by >>> >>> default you don't have access to the perf_event_open syscall. Just >>> >>> something to keep in mind. This is one of my main grievances with >>> >>> container deployments. >>> >>> >>> >>> Indeed Patrick, I am very happy to see this discussion! Thanks Doug >>> >>> for starting the thread. >>> >>> >>> >>> Jon >>> >>> >>> >>> [1] https://issues.apache.org/jira/browse/CASSANDRA-15452 >>> >>> [2] https://issues.apache.org/jira/browse/CASSANDRA-19477 >>> >>> [3] >>> >>> https://www.youtube.com/watch?v=yNZtnzjyJRI&t=212s&pp=ygUOYXN5bmMgcHJvZmlsZXI%3D >>> >>> [4] >>> >>> https://github.com/async-profiler/async-profiler/blob/2b556680dc8f5d02c3f26ac119d835dc2381e604/src/jattach/jattach_hotspot.c#L38 >>> >>> [5] https://issues.apache.org/jira/browse/CASSANDRA-20428 >>> >>> [6] >>> >>> https://github.com/async-profiler/async-profiler/blob/master/docs/IntegratingAsyncProfiler.md >>> >>> [7] https://github.com/async-profiler/async-profiler/issues/907 >>> >>> >>> >>> >>> >>> On Fri, Jun 13, 2025 at 10:18 AM Patrick McFadin <pmcfa...@gmail.com >>> >>> <mailto:pmcfa...@gmail.com> <mailto:pmcfa...@gmail.com >>> >>> <mailto:pmcfa...@gmail.com>>> wrote: >>> >>>> The fact o3 used "Bus-factor" as a dimension is just amazing. >>> >>>> >>> >>>> After reading more about the project, the possibilities are pretty >>> >>>> interesting. I suspect we'll see this in a Haddad talk soon. >>> >>>> >>> >>>> On Fri, Jun 13, 2025 at 1:57 AM Josh McKenzie <jmcken...@apache.org >>> >>>> <mailto:jmcken...@apache.org> <mailto:jmcken...@apache.org >>> >>>> <mailto:jmcken...@apache.org>>> wrote: >>> >>>>> I was curious if o3 (model from OpenAI) would be able to do a deep >>> >>>>> dive health check on a repo to assist in considering taking it as a >>> >>>>> dependency. The results can be found here: >>> >>>>> https://chatgpt.com/share/684be703-1d4c-8002-b831-f997f829f4b4 >>> >>>>> >>> >>>>> Apparently it can, and can do it quite well. This was a useful time >>> >>>>> saver (and honestly did a better job than I usually can in > 10x the >>> >>>>> time) >>> >>>>> >>> >>>>> I'm +1 to taking this as a dependency on the lib in core C*. The rest >>> >>>>> of the ecosystem can consume it (more easily if we move to a >>> >>>>> cassandra-shared regime shared library build as well), and it opens >>> >>>>> up some interesting opportunities for us in both how we test core C* >>> >>>>> proper and what we expose in tooling. >>> >>>>> >>> >>>>> On Thu, Jun 12, 2025, at 7:36 PM, Paulo Motta wrote: >>> >>>>>> I'd prefer to avoid calling an external process and use the library >>> >>>>>> if possible. Not sure about including it in the project by default, >>> >>>>>> but also not against. >>> >>>>>> >>> >>>>>> If there's contention about including it, I wonder if it would make >>> >>>>>> sense to explore java's optional module extension[1] to make this >>> >>>>>> available optionally ? I can see this being useful for other >>> >>>>>> extensions if we haven't explored that option. >>> >>>>>> >>> >>>>>> Then we could have another project cassandra-sidecar-extensions (or >>> >>>>>> similar) that would be linked by sidecar/advanced operators to >>> >>>>>> enable extended featureset in the main process. >>> >>>>>> >>> >>>>>> >>> >>>>>> [1] - >>> >>>>>> https://openjdk.org/projects/jigsaw/doc/topics/optional.html >>> >>>>>> >>> >>>>>> On Thu, 12 Jun 2025 at 17:57 Doug Rohrer <droh...@apple.com >>> >>>>>> <mailto:droh...@apple.com> <mailto:droh...@apple.com >>> >>>>>> <mailto:droh...@apple.com>>> wrote: >>> >>>>>> Hey folks! >>> >>>>>> >>> >>>>>> We're looking into enabling the sidecar to collect async profiles >>> >>>>>> from Cassandra and, digging through the async-profiler code and >>> >>>>>> usage, it seems like there may be a few different ways to do it. I’m >>> >>>>>> curious if other folks have already done this beyond just “run >>> >>>>>> asprof with the pid of the Cassandra process”, as I’m a bit hesitant >>> >>>>>> to depend on executing an external process from the Sidecar to >>> >>>>>> gather the actual profile if we can avoid it. >>> >>>>>> >>> >>>>>> There seem to be some opportunities to integrate the profiler into >>> >>>>>> another project (see >>> >>>>>> https://github.com/async-profiler/async-profiler/blob/master/docs/IntegratingAsyncProfiler.md#using-java-api) >>> >>>>>> but it seems this would end up having to be part of Cassandra, and >>> >>>>>> somehow callable via the sidecar (JMX? Some virtual table interface >>> >>>>>> where you insert a row to start a profile with the profiler options, >>> >>>>>> and it kicks off the profile, dumping the results into the table >>> >>>>>> when it’s done?). >>> >>>>>> >>> >>>>>> The benefit in putting this functionality into Cassandra would be >>> >>>>>> that other consumers (in-jvm dtests, python dtests, other monitoring >>> >>>>>> systems where Sidecar isn’t available, easy-cass-lab) would be able >>> >>>>>> to leverage the same interface rather than having to re-invent the >>> >>>>>> wheel each time. >>> >>>>>> >>> >>>>>> Drawback is it’s another library, and one with native library >>> >>>>>> dependencies, added to the class path and loaded at runtime. >>> >>>>>> >>> >>>>>> Thoughts? Previous experiences (good or bad)? >>> >>>>>> >>> >>>>>> Thanks, >>> >>>>>> >>> >>>>>> Doug >>> >>>>> >>> >> >>> >> >>> >>