I created https://bugs.openjdk.java.net/browse/JDK-8189752 to track this.

Mandy

On 10/20/17 10:35 AM, Rafael Winterhalter wrote:
Hei Mandy,
thank you for your quick reply.

The benchmark is already only capturing the stack frame object. I ran the benchmark with the transformation as a comparison and it is indeed slower due to extra work. The JIT code suggests the native method plus monitor to be the major cost here but it is of course necessary to have it there to fetch the stack information.

Without this transformation in the user  thread, capturing stack frame objects of 100 frames is however still about 2.5 times as expensive as constructing a throwable as a snapshot.

Just to not be misunderstood: the total cost is reduced by using the stack walker, even compared to using java lang secrets under Java 8 but the inability to process the stack outside of a worker thread makes it inapplicable for my purposes due to latency requirements.

Thank you for considering this use case and all your great work!

Cheers, Rafael


Am 20.10.2017 18:11 schrieb "mandy chung" <[email protected] <mailto:[email protected]>>:

    Hi Rafael,

    Thanks for the feedback.  We did some investigation in
    understanding the overhead of Throwable if it used StackWalker API
    [1].  It did come to mind whether the StackWalker API should
    provide a way to walk the backtrace which we should do the
    investigation with JDK-8141239.

    The benchmark compares capturing 35 StackTraceElements for
    apple-to-apple comparison which is fair.  I am curious on the perf
    difference when you capture only StackFrame objects?  This would
    save the overhead to construct StackTraceElement objects (and its
    associated string objects).

    Mandy
    [1] https://bugs.openjdk.java.net/browse/JDK-8141239
    <https://bugs.openjdk.java.net/browse/JDK-8141239>


    On 10/20/17 6:32 AM, Rafael Winterhalter wrote:

        Hello,

        a typical patern when reading the stack of the current thread
        in tooling
        like performance monitoring used to imply the creation of an
        instance of
        Throwable and to process this instance's attached stack in
        another thread.
        The performance cost is shared about 10/90 for creating a new
        throwable
        compared to reading its frames, so this is really a worthy
        optimization.

        It is also common to use the JavaLangAccess API which offers
        selective
        access of single frames.

        This API does no longer exist as it was superseeded by the
        Stack Walker API
        which is of course much safer and even a more performant
        alternative when
        looking at the total performance. However, using a stack
        walker, it is no
        longer possible to move the stack processing out of the user
        thread but it
        must be done at the moment the snapshot of the stack is taken.
        It turns out
        that this increases latency dramatically when processing
        stacks compared to
        the asyncronous alternative.

        In a quick benchmark, it seems like walking 35 frames of a 100
        frames stack
        allows me 70k operations per second whereas creating a new
        throwable yields
        about 200k operations per second. Also, within a less isolated
        test, I can
        infer this additional overhead from the actual latency numbers
        of a web
        service when using the stack walker API to extract the top 35
        frames
        compared to the "old" solution using JavaLangAccess.

        For this reason, it seems to be the best solution to avoid the
        stack walker
        when aiming for latency at the moment if the stack is not required
        immediately and if access resources are available in other
        threads.

        I would therefore like to propose to extend the stack walker
        API to allow
        walking the stack of an existing throwable to allow for
        similar performance
        as with JavaLangAccess. I understand that the VM must do more work
        altogether. When receving the full stack from a throwable,
        this takes about
        three times as long. In practice, for a product I am involved
        in, this
        casues a noticable overhead when running a Java 9 VM compared
        to Java 8.

        Alternatively, it would of course even be better if one could
        take a
        snapshot of only the top x frames to walk on this object
        rather then a
        throwable.

        I have added my benchmarks (snapshot for the current user
        thread operation,
        complete for the entire processing) into this Gist:
        https://gist.github.com/raphw/96e7c81d7c719cf7991b361bb7266c70
        <https://gist.github.com/raphw/96e7c81d7c719cf7991b361bb7266c70>

        Thank you for any feedback on my finding!

        Best regards, Rafael




Reply via email to