Hello, a typical patern when reading the stack of the current thread in tooling like performance monitoring used to imply the creation of an instance of Throwable and to process this instance's attached stack in another thread. The performance cost is shared about 10/90 for creating a new throwable compared to reading its frames, so this is really a worthy optimization.
It is also common to use the JavaLangAccess API which offers selective access of single frames. This API does no longer exist as it was superseeded by the Stack Walker API which is of course much safer and even a more performant alternative when looking at the total performance. However, using a stack walker, it is no longer possible to move the stack processing out of the user thread but it must be done at the moment the snapshot of the stack is taken. It turns out that this increases latency dramatically when processing stacks compared to the asyncronous alternative. In a quick benchmark, it seems like walking 35 frames of a 100 frames stack allows me 70k operations per second whereas creating a new throwable yields about 200k operations per second. Also, within a less isolated test, I can infer this additional overhead from the actual latency numbers of a web service when using the stack walker API to extract the top 35 frames compared to the "old" solution using JavaLangAccess. For this reason, it seems to be the best solution to avoid the stack walker when aiming for latency at the moment if the stack is not required immediately and if access resources are available in other threads. I would therefore like to propose to extend the stack walker API to allow walking the stack of an existing throwable to allow for similar performance as with JavaLangAccess. I understand that the VM must do more work altogether. When receving the full stack from a throwable, this takes about three times as long. In practice, for a product I am involved in, this casues a noticable overhead when running a Java 9 VM compared to Java 8. Alternatively, it would of course even be better if one could take a snapshot of only the top x frames to walk on this object rather then a throwable. I have added my benchmarks (snapshot for the current user thread operation, complete for the entire processing) into this Gist: https://gist.github.com/raphw/96e7c81d7c719cf7991b361bb7266c70 Thank you for any feedback on my finding! Best regards, Rafael
