Re: [OpenJDK 2D-Dev] sun.java2D.Pisces renderer Performance and Memory enhancements

Jim Graham Mon, 01 Jul 2013 19:57:31 -0700

Hi Laurent,
On 5/29/13 2:01 AM, Laurent Bourgès wrote:

Two comments:
- hotspot takes some time to optimize the very large
Renderer.endRendering() and ScalineIterator.next() methods: probably
because these methods represents too many byte codes (many loops ...)
- I tried merging them but it seems that it becomes slower (too big
method ? or hotspot produce less optimized code according to its
heuristics ... warmup issue ?)
- Do you think it could help hotspot if such methods are split in
smaller ones ?

Perhaps it is another nudge to consider a native port where we couldstatically compile these critical methods with a high degree ofoptimization?

- Many float operations are used to compute crossings that could be
replaced by DDA (integer ops) to improve performance ...

Are you referring to fixed point implementations? Or something moreprecise? I'm worried about error accumulation with fixed pointdifferencing across a large number of pixels.

Remaining items:
- clipping issues (dasher emits many segments even out of bounds)
- rounding issues: float to int conversions (biais or not)

Finally I propose to focus on optimizing these 2 algorithms (methods) as
it represents 55% of cpu time:
- optimize / modify algorithms: integer instead of float ops, reduce
number of condition statements (if) to improve branch prediction ...
- or / and port these methods into C code (JNI)

Porting to C might help with a lot of the integer conversions thathappen when we try to stuff "all data into floats".

        I think that the ScanLineIterator class is no more useful and
        could be
        merged into Renderer directly: I try to optimize these 2 code paths
        (crossing / crossing -> alpha) but it seems quite difficult as I
        must
        understand hotspot optimizations (assembler code)...


    I was originally skeptical about breaking that part of the process
    into an iterator class as well.


I tried to merge these methods into Renderer.endRendering() but
benchmark results seems worse: probably hotspot generates less optimized
code ...


D'oh!

        For now I want to keep pisces in Java code as hotspot is efficient
        enough and probably the algorithm can be reworked a bit;
        few questions:
        - should edges be sorted by Ymax ONCE to avoid complete edges
        traversal
        to count crossings for each Y value:

           156             if ((bucketcount & 0x1) != 0) {
           157                 int newCount = 0;
           158                 for (int i = 0, ecur; i < count; i++) {
           159                     ecur = ptrs[i];
        *  160                     if (_edgesInt[ecur + YMAX] > cury) {
        *  161                         ptrs[newCount++] = ecur;

           162                     }
           163                 }
           164                 count = newCount;
           165             }


    This does not traverse all edges, just the edges currently "in play"
    and it only does it for scanlines that had a recorded ymax on them
    (count is multiplied by 2 and then optionally the last bit is set if
    some edge ends on that scanline so we know whether or not to do the
    "remove expired edges" processing).


Agreed, this algorithm is the "well know" edge eviction from active edge
list => move it into dedicated method ?

That would depend on the "hotspot thinks this method is too big" issuethat we don't really fully understand, but it sounds like a good pointto investigate.

        - why multiply x2 and divide /2 the crossings (+ rounding issues) ?

           202             for (int i = 0, ecur, j; i < count; i++) {
           203                 ecur = ptrs[i];
           204                 curx = _edges[ecur /* + CURX */];
           205                 _edges[ecur /* + CURX */] = curx +
        _edges[ecur + SLOPE];
           206
        *  207                 cross = ((int) curx) << 1;
        *  208                 if (_edgesInt[ecur + OR] != 0 /* > 0 */) {
           209                     cross |= 1;


    The line above sets the bottom bit if the crossing is one
    orientation vs. the other so we know whether to add one or subtract
    one to the winding count.  The crossings can then be sorted and the
    orientation flag is carried along with the values as they are
    sorted.  The cost of this trick is having to shift the actual
    crossing coordinates by 1 to vacate the LSBit.


Two comments:
- float ops used to compute the x crossing => use DDA (integer ops ?)

Provided you can keep precision high enough to avoid errors. Also,sometimes offloading some of your calculations to a float unit keepsadded parallelism in the code and moving the instructions back to intcan actually slow it down.

- store the orientation flag into another byte array to avoid shift
operations << 1 and >> 1 (later in the code) but the insertion sort
should then be smart to provide a sorted index mapping ...

You then have added memory accesses and remember that we sort thesevalues so you'd have to move the flags along with the crossings. AC-struct that had an int and a flag would avoid the shifts while makingit reasonable to sort just one array at the cost of storage complexityand wasted memory for the flag to keep the structs aligned.

        Finally, it seems that hotspot settings (CompileThreshold=1000 and
        -XX:aggressiveopts) are able to compile theses hotspots better ...


    What about if we use the default settings as would most non-server
    apps ?


As my linux is 64 bits, I can only run benchmark with the server
compiler (hotspot c2) ... I could also test with / without tiered
compilation (between client and server compiler) ...


We have to understand the performance on a variety of compilers.

        Thanks; probably the edgeBucket / edgeBucketCount arrays could
        be merged
        into a single one to improve cache affinity.


    Interesting.


It could be interesting to evaluate the cpu cache affinity related to
all these arrays: maybe 2D arrays could be packed into 1D arrays to
improve cache efficiency; idem for edge "struct":
it is better to keep int / float arrays instead of having an Edge object
(with instance pool) ?

I'm a bit lost with that, but I think we could only know by prototypingit. Also, perhaps we could store floats into int arrays usingtoFloat/IntBits() faster than we could cast to integers? Note thatthose methods for conversion to intbits are likely handled internally incompiled code as simple "casts" (where the compiled code doesn't reallydo anything to "cast" a value).

        FYI, I can write C/C++ code but I never practised JNI code.
        Does somebody could help us to port only these 2 hotspot methods ?


    Port 2 Hotspot methods?  I'm not sure what you are referring to here?


I mean implement the 'hotspot' methods i.e. implement the rendering
algorithms in C code using JNI to access Renderer fields.
Such C code could be compiled using optimizations (gcc O2) but I have
doubts that it will be faster than optimized code produced by hotspot c2
compiler ...

I tried for JavaFX and it was a grueling process and I ended up with abastardized half-baked port. I would recommend doing it right by simplyimplementing C code using the Java code as a guide.


                        ...jim

Re: [OpenJDK 2D-Dev] sun.java2D.Pisces renderer Performance and Memory enhancements

Reply via email to