I agree that exposing the raw execute method is a bad idea, for both the reason David mentions but also the foot-gun problem - there are a lot of ways that calling “execute” can cause you to overwrite files and we really shouldn’t expose an arbitrary file overwrite feature on purpose if we can avoid it.
Looking forward to seeing what Yaman comes back with after doing some additional research. Doug > On Jun 17, 2025, at 7:45 PM, David Capwell <dcapw...@apple.com> wrote: > > I am in favor of the project adopting as a library. > > My automation is very outdated, so what I am saying maybe a legacy thing… so > w/e is the “new” way is what we should promote…. I rely a lot on the > collapsed format and wish to migrate to the JFR format so I can collect CPU / > Memory at the same time; it would be great for us to expose this as a > promoted ability (curl cassandra/profile -o result.jfr). One issue I see with > exposing the raw “execute” method is that it tied our API with the tools API, > so any breaking changes there break our API; I am not against this, but it is > something to consider. > > As Scott has pointed out, there have been stability issues, so we should be > able to dynamically flag the feature off. > >> On Jun 16, 2025, at 9:26 AM, Jaydeep Chovatia <chovatia.jayd...@gmail.com> >> wrote: >> >> >Previous experiences (good or bad) >> I have been using an async-profiler in my project for quite some time to >> profile the CPU. Additionally, I have wrapped it with an HTTP interface, >> allowing one to open a browser and view the CPU flame graph in real-time, >> which further simplifies the process. >> It is integrated as a library, and my preference is to include it as a >> library, rather than forking processes. >> >> Jaydeep >> >> On Sat, Jun 14, 2025 at 8:14 AM Josh McKenzie <jmcken...@apache.org >> <mailto:jmcken...@apache.org>> wrote: >>>> I have seen cases where specific async-profiler/JVM/Cassandra version >>>> combos (JDK11/4.1-derived source tree) will immediately crash the JVM on >>>> profile - especially successive profile invocations on the same process >>> This would be a great candidate for testing to ensure that, at least for >>> provided profiles, this doesn't happen. >>> >>> On Fri, Jun 13, 2025, at 10:41 PM, C. Scott Andreas wrote: >>>> Supportive of inclusion as well. General preference for invoking as a >>>> library rather than forking processes. >>>> >>>> Jon, thanks for the tips on off-CPU profiling - added to my personal cheat >>>> sheet. >>>> >>>> I have seen cases where specific async-profiler/JVM/Cassandra version >>>> combos (JDK11/4.1-derived source tree) will immediately crash the JVM on >>>> profile - especially successive profile invocations on the same process - >>>> but have not observed this on JDK21 or trunk-derived source trees. If we >>>> have user reports of that happening, we’ll need to figure out how to >>>> reproduce and get to the bottom of it. >>>> >>>> – Scott >>>> >>>> > On Jun 13, 2025, at 5:24 PM, Francisco Guerrero <fran...@apache.org >>>> > <mailto:fran...@apache.org>> wrote: >>>> > >>>> > Thanks for bringing this discussion Doug. I didn't realize that >>>> > async-profiler allows you to >>>> > bring it as a dependency. It looks pretty neat from what I could tell. I >>>> > also think bringing >>>> > this to Cassandra as a dependency is a reasonable approach. We need to >>>> > come up with >>>> > a solid way to expose this via JMX / vtable. >>>> > >>>> > Best, >>>> > - Francisco >>>> > >>>> >> On 2025/06/13 21:08:28 Doug Rohrer wrote: >>>> >> The nice thing from what I can tell about using the Java API per [6] >>>> >> below is that you can literally just get an instance of the profiler >>>> >> and pass it some commands in the `execute` method… just need to be >>>> >> careful how much of that surface area we expose. Jon (and others >>>> >> obviously) I’d love to get your take on how we could make a useful >>>> >> interface to the async-profiler, maybe exposed via JMX, that doesn’t >>>> >> require someone to read the entirety of the async-profiler docs and >>>> >> provides some useful profiles without the rough edges (things like >>>> >> managing temp files so users don’t have to know the layout of the >>>> >> filesystem C* is running on, for example, since at least in the Sidecar >>>> >> we’d be executing this on behalf of a remote user, with all of the >>>> >> constraints that implies). >>>> >> >>>> >> We can always be more protective in the Sidecar than we are server-side >>>> >> as well, but it seems like helping operators not do bad things is a >>>> >> good thing. >>>> >> >>>> >> Obviously we’d want the ability Cassandra-side to disable this >>>> >> functionality all together however we implement it. >>>> >> >>>> >> Doug >>>> >> >>>> >>>> On Jun 13, 2025, at 2:38 PM, Jon Haddad <j...@rustyrazorblade.com >>>> >>>> <mailto:j...@rustyrazorblade.com>> wrote: >>>> >>> >>>> >>> I'd be very happy to see async-profiler included with C* I've made >>>> >>> extensive use of it in my performance evaluations [1][2], and even >>>> >>> posted a video about it [3] for general Java perf analysis (among >>>> >>> others). It's part of easy-cass-lab and is easily the most >>>> >>> informative tool I've found for the getting to the bottom of anything >>>> >>> performance related. >>>> >>> >>>> >>> There's probably a good case to be made for including it with the C* >>>> >>> artifact as well as having it be something you can drop in. I lean >>>> >>> towards including it all the time, but I haven't run it this way >>>> >>> myself yet, so there might be some downside I'm unaware of. >>>> >>> >>>> >>> When you call the asprof executable, it attaches the async-profiler to >>>> >>> the running jvm using jattach [4]. We could do this as well, if we >>>> >>> wanted to avoid including it with the release, but I don't know how >>>> >>> much we really benefit from that. I've run into issues with it when >>>> >>> it's unable to detatch correctly, then you're unable to reattach it >>>> >>> until after the server is restarted. On the flip side, I don't know >>>> >>> if you're able to set up all the same options for arbitrary profiling >>>> >>> when it's loaded as an agent and turned on/off dynamically. I think >>>> >>> we can, based on the integration page [6], but I haven't tried it yet. >>>> >>> It would be a bummer if we only had a single mode of profiling >>>> >>> available. >>>> >>> >>>> >>> The default mode, CPU profiling, is fantastic, but I've also made >>>> >>> extensive use of allocation profiling [5] to identify perf issues as >>>> >>> well so having that available is a must, imo. Wall clock / off cpu >>>> >>> profiling is great for identifying when IO is the root cause, which >>>> >>> isn't clearly revealed by on-cpu profiling due to the way threads are >>>> >>> scheduled. When I look at a system I typically do CPU / Wall / Alloc >>>> >>> / Off-CPU to be thorough, and the last thing you want to do is have to >>>> >>> restart between each one. You can also specify specific Java methods, >>>> >>> include or exclude frames matching specific regex, and a whole slew of >>>> >>> other options. The latest version even supports continuous profiling >>>> >>> with heatmaps although I haven't tried it yet. >>>> >>> >>>> >>> So hopefully the option we go with allows all of that, otherwise the >>>> >>> limits would impose more of a headache to me as I'd need to remove it >>>> >>> and continue to bring my own. >>>> >>> >>>> >>> Under the hood, the async-profiler uses Linux perf events + <> >>>> >>> asynchronous polling of the java stack to match them up and generate >>>> >>> it's reports. As a result, it requires certain permissions to run and >>>> >>> get all the details I like. Specifically these kernel parameters: >>>> >>> >>>> >>> sudo sysctl kernel.perf_event_paranoid=1 >>>> >>> sudo sysctl kernel.kptr_restrict=0 >>>> >>> >>>> >>> You also need to enable some capabilities for off-cpu profiliing: >>>> >>> >>>> >>> sudo find /usr/lib/jvm/ -type f -name 'java' -exec setcap >>>> >>> "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" {} \; >>>> >>> >>>> >>> Then you can do off-cpu with this wild cryptic version (shout out to >>>> >>> Andrei Pangin for helping me with this [7]): >>>> >>> >>>> >>> asprof -e kprobe:schedule -i 2 --cstack dwarf -X '*Unsafe.park*' >>>> >>> "${@:2}" $PID >>>> >>> >>>> >>> There's also some subtle issues when it's run in a container, since by >>>> >>> default you don't have access to the perf_event_open syscall. Just >>>> >>> something to keep in mind. This is one of my main grievances with >>>> >>> container deployments. >>>> >>> >>>> >>> Indeed Patrick, I am very happy to see this discussion! Thanks Doug >>>> >>> for starting the thread. >>>> >>> >>>> >>> Jon >>>> >>> >>>> >>> [1] https://issues.apache.org/jira/browse/CASSANDRA-15452 >>>> >>> [2] https://issues.apache.org/jira/browse/CASSANDRA-19477 >>>> >>> [3] >>>> >>> https://www.youtube.com/watch?v=yNZtnzjyJRI&t=212s&pp=ygUOYXN5bmMgcHJvZmlsZXI%3D >>>> >>> [4] >>>> >>> https://github.com/async-profiler/async-profiler/blob/2b556680dc8f5d02c3f26ac119d835dc2381e604/src/jattach/jattach_hotspot.c#L38 >>>> >>> [5] https://issues.apache.org/jira/browse/CASSANDRA-20428 >>>> >>> [6] >>>> >>> https://github.com/async-profiler/async-profiler/blob/master/docs/IntegratingAsyncProfiler.md >>>> >>> [7] https://github.com/async-profiler/async-profiler/issues/907 >>>> >>> >>>> >>> >>>> >>> On Fri, Jun 13, 2025 at 10:18 AM Patrick McFadin <pmcfa...@gmail.com >>>> >>> <mailto:pmcfa...@gmail.com> <mailto:pmcfa...@gmail.com >>>> >>> <mailto:pmcfa...@gmail.com>>> wrote: >>>> >>>> The fact o3 used "Bus-factor" as a dimension is just amazing. >>>> >>>> >>>> >>>> After reading more about the project, the possibilities are pretty >>>> >>>> interesting. I suspect we'll see this in a Haddad talk soon. >>>> >>>> >>>> >>>> On Fri, Jun 13, 2025 at 1:57 AM Josh McKenzie <jmcken...@apache.org >>>> >>>> <mailto:jmcken...@apache.org> <mailto:jmcken...@apache.org >>>> >>>> <mailto:jmcken...@apache.org>>> wrote: >>>> >>>>> I was curious if o3 (model from OpenAI) would be able to do a deep >>>> >>>>> dive health check on a repo to assist in considering taking it as a >>>> >>>>> dependency. The results can be found here: >>>> >>>>> https://chatgpt.com/share/684be703-1d4c-8002-b831-f997f829f4b4 >>>> >>>>> >>>> >>>>> Apparently it can, and can do it quite well. This was a useful time >>>> >>>>> saver (and honestly did a better job than I usually can in > 10x the >>>> >>>>> time) >>>> >>>>> >>>> >>>>> I'm +1 to taking this as a dependency on the lib in core C*. The >>>> >>>>> rest of the ecosystem can consume it (more easily if we move to a >>>> >>>>> cassandra-shared regime shared library build as well), and it opens >>>> >>>>> up some interesting opportunities for us in both how we test core C* >>>> >>>>> proper and what we expose in tooling. >>>> >>>>> >>>> >>>>> On Thu, Jun 12, 2025, at 7:36 PM, Paulo Motta wrote: >>>> >>>>>> I'd prefer to avoid calling an external process and use the library >>>> >>>>>> if possible. Not sure about including it in the project by default, >>>> >>>>>> but also not against. >>>> >>>>>> >>>> >>>>>> If there's contention about including it, I wonder if it would make >>>> >>>>>> sense to explore java's optional module extension[1] to make this >>>> >>>>>> available optionally ? I can see this being useful for other >>>> >>>>>> extensions if we haven't explored that option. >>>> >>>>>> >>>> >>>>>> Then we could have another project cassandra-sidecar-extensions (or >>>> >>>>>> similar) that would be linked by sidecar/advanced operators to >>>> >>>>>> enable extended featureset in the main process. >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> [1] - >>>> >>>>>> https://openjdk.org/projects/jigsaw/doc/topics/optional.html >>>> >>>>>> >>>> >>>>>> On Thu, 12 Jun 2025 at 17:57 Doug Rohrer <droh...@apple.com >>>> >>>>>> <mailto:droh...@apple.com> <mailto:droh...@apple.com >>>> >>>>>> <mailto:droh...@apple.com>>> wrote: >>>> >>>>>> Hey folks! >>>> >>>>>> >>>> >>>>>> We're looking into enabling the sidecar to collect async profiles >>>> >>>>>> from Cassandra and, digging through the async-profiler code and >>>> >>>>>> usage, it seems like there may be a few different ways to do it. >>>> >>>>>> I’m curious if other folks have already done this beyond just “run >>>> >>>>>> asprof with the pid of the Cassandra process”, as I’m a bit >>>> >>>>>> hesitant to depend on executing an external process from the >>>> >>>>>> Sidecar to gather the actual profile if we can avoid it. >>>> >>>>>> >>>> >>>>>> There seem to be some opportunities to integrate the profiler into >>>> >>>>>> another project (see >>>> >>>>>> https://github.com/async-profiler/async-profiler/blob/master/docs/IntegratingAsyncProfiler.md#using-java-api) >>>> >>>>>> but it seems this would end up having to be part of Cassandra, and >>>> >>>>>> somehow callable via the sidecar (JMX? Some virtual table interface >>>> >>>>>> where you insert a row to start a profile with the profiler >>>> >>>>>> options, and it kicks off the profile, dumping the results into the >>>> >>>>>> table when it’s done?). >>>> >>>>>> >>>> >>>>>> The benefit in putting this functionality into Cassandra would be >>>> >>>>>> that other consumers (in-jvm dtests, python dtests, other >>>> >>>>>> monitoring systems where Sidecar isn’t available, easy-cass-lab) >>>> >>>>>> would be able to leverage the same interface rather than having to >>>> >>>>>> re-invent the wheel each time. >>>> >>>>>> >>>> >>>>>> Drawback is it’s another library, and one with native library >>>> >>>>>> dependencies, added to the class path and loaded at runtime. >>>> >>>>>> >>>> >>>>>> Thoughts? Previous experiences (good or bad)? >>>> >>>>>> >>>> >>>>>> Thanks, >>>> >>>>>> >>>> >>>>>> Doug >>>> >>>>> >>>> >> >>>> >> >>>> >>> >