I created https://bugs.openjdk.java.net/browse/JDK-8189752 to track this.
Mandy
On 10/20/17 10:35 AM, Rafael Winterhalter wrote:
Hei Mandy,
thank you for your quick reply.
The benchmark is already only capturing the stack frame object. I ran
the benchmark with the transformation as a comparison and it is indeed
slower due to extra work. The JIT code suggests the native method plus
monitor to be the major cost here but it is of course necessary to
have it there to fetch the stack information.
Without this transformation in the user thread, capturing stack frame
objects of 100 frames is however still about 2.5 times as expensive as
constructing a throwable as a snapshot.
Just to not be misunderstood: the total cost is reduced by using the
stack walker, even compared to using java lang secrets under Java 8
but the inability to process the stack outside of a worker thread
makes it inapplicable for my purposes due to latency requirements.
Thank you for considering this use case and all your great work!
Cheers, Rafael
Am 20.10.2017 18:11 schrieb "mandy chung" <[email protected]
<mailto:[email protected]>>:
Hi Rafael,
Thanks for the feedback. We did some investigation in
understanding the overhead of Throwable if it used StackWalker API
[1]. It did come to mind whether the StackWalker API should
provide a way to walk the backtrace which we should do the
investigation with JDK-8141239.
The benchmark compares capturing 35 StackTraceElements for
apple-to-apple comparison which is fair. I am curious on the perf
difference when you capture only StackFrame objects? This would
save the overhead to construct StackTraceElement objects (and its
associated string objects).
Mandy
[1] https://bugs.openjdk.java.net/browse/JDK-8141239
<https://bugs.openjdk.java.net/browse/JDK-8141239>
On 10/20/17 6:32 AM, Rafael Winterhalter wrote:
Hello,
a typical patern when reading the stack of the current thread
in tooling
like performance monitoring used to imply the creation of an
instance of
Throwable and to process this instance's attached stack in
another thread.
The performance cost is shared about 10/90 for creating a new
throwable
compared to reading its frames, so this is really a worthy
optimization.
It is also common to use the JavaLangAccess API which offers
selective
access of single frames.
This API does no longer exist as it was superseeded by the
Stack Walker API
which is of course much safer and even a more performant
alternative when
looking at the total performance. However, using a stack
walker, it is no
longer possible to move the stack processing out of the user
thread but it
must be done at the moment the snapshot of the stack is taken.
It turns out
that this increases latency dramatically when processing
stacks compared to
the asyncronous alternative.
In a quick benchmark, it seems like walking 35 frames of a 100
frames stack
allows me 70k operations per second whereas creating a new
throwable yields
about 200k operations per second. Also, within a less isolated
test, I can
infer this additional overhead from the actual latency numbers
of a web
service when using the stack walker API to extract the top 35
frames
compared to the "old" solution using JavaLangAccess.
For this reason, it seems to be the best solution to avoid the
stack walker
when aiming for latency at the moment if the stack is not required
immediately and if access resources are available in other
threads.
I would therefore like to propose to extend the stack walker
API to allow
walking the stack of an existing throwable to allow for
similar performance
as with JavaLangAccess. I understand that the VM must do more work
altogether. When receving the full stack from a throwable,
this takes about
three times as long. In practice, for a product I am involved
in, this
casues a noticable overhead when running a Java 9 VM compared
to Java 8.
Alternatively, it would of course even be better if one could
take a
snapshot of only the top x frames to walk on this object
rather then a
throwable.
I have added my benchmarks (snapshot for the current user
thread operation,
complete for the entire processing) into this Gist:
https://gist.github.com/raphw/96e7c81d7c719cf7991b361bb7266c70
<https://gist.github.com/raphw/96e7c81d7c719cf7991b361bb7266c70>
Thank you for any feedback on my finding!
Best regards, Rafael