Hi Laurent,
On 5/29/13 2:01 AM, Laurent Bourgès wrote:
Two comments:
- hotspot takes some time to optimize the very large
Renderer.endRendering() and ScalineIterator.next() methods: probably
because these methods represents too many byte codes (many loops ...)
- I tried merging them but it seems that it becomes slower (too big
method ? or hotspot produce less optimized code according to its
heuristics ... warmup issue ?)
- Do you think it could help hotspot if such methods are split in
smaller ones ?
Perhaps it is another nudge to consider a native port where we could
statically compile these critical methods with a high degree of
optimization?
- Many float operations are used to compute crossings that could be
replaced by DDA (integer ops) to improve performance ...
Are you referring to fixed point implementations? Or something more
precise? I'm worried about error accumulation with fixed point
differencing across a large number of pixels.
Remaining items:
- clipping issues (dasher emits many segments even out of bounds)
- rounding issues: float to int conversions (biais or not)
Finally I propose to focus on optimizing these 2 algorithms (methods) as
it represents 55% of cpu time:
- optimize / modify algorithms: integer instead of float ops, reduce
number of condition statements (if) to improve branch prediction ...
- or / and port these methods into C code (JNI)
Porting to C might help with a lot of the integer conversions that
happen when we try to stuff "all data into floats".
I think that the ScanLineIterator class is no more useful and
could be
merged into Renderer directly: I try to optimize these 2 code paths
(crossing / crossing -> alpha) but it seems quite difficult as I
must
understand hotspot optimizations (assembler code)...
I was originally skeptical about breaking that part of the process
into an iterator class as well.
I tried to merge these methods into Renderer.endRendering() but
benchmark results seems worse: probably hotspot generates less optimized
code ...
D'oh!
For now I want to keep pisces in Java code as hotspot is efficient
enough and probably the algorithm can be reworked a bit;
few questions:
- should edges be sorted by Ymax ONCE to avoid complete edges
traversal
to count crossings for each Y value:
156 if ((bucketcount & 0x1) != 0) {
157 int newCount = 0;
158 for (int i = 0, ecur; i < count; i++) {
159 ecur = ptrs[i];
* 160 if (_edgesInt[ecur + YMAX] > cury) {
* 161 ptrs[newCount++] = ecur;
162 }
163 }
164 count = newCount;
165 }
This does not traverse all edges, just the edges currently "in play"
and it only does it for scanlines that had a recorded ymax on them
(count is multiplied by 2 and then optionally the last bit is set if
some edge ends on that scanline so we know whether or not to do the
"remove expired edges" processing).
Agreed, this algorithm is the "well know" edge eviction from active edge
list => move it into dedicated method ?
That would depend on the "hotspot thinks this method is too big" issue
that we don't really fully understand, but it sounds like a good point
to investigate.
- why multiply x2 and divide /2 the crossings (+ rounding issues) ?
202 for (int i = 0, ecur, j; i < count; i++) {
203 ecur = ptrs[i];
204 curx = _edges[ecur /* + CURX */];
205 _edges[ecur /* + CURX */] = curx +
_edges[ecur + SLOPE];
206
* 207 cross = ((int) curx) << 1;
* 208 if (_edgesInt[ecur + OR] != 0 /* > 0 */) {
209 cross |= 1;
The line above sets the bottom bit if the crossing is one
orientation vs. the other so we know whether to add one or subtract
one to the winding count. The crossings can then be sorted and the
orientation flag is carried along with the values as they are
sorted. The cost of this trick is having to shift the actual
crossing coordinates by 1 to vacate the LSBit.
Two comments:
- float ops used to compute the x crossing => use DDA (integer ops ?)
Provided you can keep precision high enough to avoid errors. Also,
sometimes offloading some of your calculations to a float unit keeps
added parallelism in the code and moving the instructions back to int
can actually slow it down.
- store the orientation flag into another byte array to avoid shift
operations << 1 and >> 1 (later in the code) but the insertion sort
should then be smart to provide a sorted index mapping ...
You then have added memory accesses and remember that we sort these
values so you'd have to move the flags along with the crossings. A
C-struct that had an int and a flag would avoid the shifts while making
it reasonable to sort just one array at the cost of storage complexity
and wasted memory for the flag to keep the structs aligned.
Finally, it seems that hotspot settings (CompileThreshold=1000 and
-XX:aggressiveopts) are able to compile theses hotspots better ...
What about if we use the default settings as would most non-server
apps ?
As my linux is 64 bits, I can only run benchmark with the server
compiler (hotspot c2) ... I could also test with / without tiered
compilation (between client and server compiler) ...
We have to understand the performance on a variety of compilers.
Thanks; probably the edgeBucket / edgeBucketCount arrays could
be merged
into a single one to improve cache affinity.
Interesting.
It could be interesting to evaluate the cpu cache affinity related to
all these arrays: maybe 2D arrays could be packed into 1D arrays to
improve cache efficiency; idem for edge "struct":
it is better to keep int / float arrays instead of having an Edge object
(with instance pool) ?
I'm a bit lost with that, but I think we could only know by prototyping
it. Also, perhaps we could store floats into int arrays using
toFloat/IntBits() faster than we could cast to integers? Note that
those methods for conversion to intbits are likely handled internally in
compiled code as simple "casts" (where the compiled code doesn't really
do anything to "cast" a value).
FYI, I can write C/C++ code but I never practised JNI code.
Does somebody could help us to port only these 2 hotspot methods ?
Port 2 Hotspot methods? I'm not sure what you are referring to here?
I mean implement the 'hotspot' methods i.e. implement the rendering
algorithms in C code using JNI to access Renderer fields.
Such C code could be compiled using optimizations (gcc O2) but I have
doubts that it will be faster than optimized code produced by hotspot c2
compiler ...
I tried for JavaFX and it was a grueling process and I ended up with a
bastardized half-baked port. I would recommend doing it right by simply
implementing C code using the Java code as a guide.
...jim