Re: [OpenJDK Rasterizer] Marlin #4

Laurent Bourgès Mon, 21 Sep 2015 14:17:08 -0700

Jim,

I would like your point of view on the new algorithms to store alpha values
efficiently and some advices on heuristics / metrics to make the adaptive
approach more efficient / robust.


I hope having some spare time soon to spend on improving this patch...

Here are few more explanations that may help you:

1. I figured out RLE encoding is only faster if run lengths are important
(high repeats). It corresponds to few crossings but large pixel spans. For
small shapes or highly complex ones, it is better to leave alpha values
uncompressed.

>From this assumption, I adopted an adaptive approach based on a simple
heuristics: see MarlinCache.useRLE

// Hack to switch RLE encoding ON or OFF:
                // sparse density:
                useRLE = (width >= 128)
                    && (((maxy - miny) * width) >= (primCount << 5)); //
larger than 64x64

It only enables RLE if the shape width is larger than 128 pixels and also
it is not too complex: width > (32 * primitive count) / height

Of course, it is very rough as primitive count / height is ~
mean(crossings) !

If you have better idea to determine the approximated crossing density per
pixel, please tell me.


2. Moreover, I implemented a new tricky approach = block flags where 1
means the block has crossings, 0 none.

It helps to traverse & test less pixels ie only blocks with flag = 1 are
really useful (others always are full of zero)

It let me use efficient Arrays.fill (noRLE variant) or larger run lengths
(RLE variant) as nothing is varying in 0 flagged blocks.

Of course other heuristics are needed as it only provides gains if blocks
can be skipped (pixel steps > 32) and to reduce the overhead due to
flagging blocks in the scanline processing loop.

2.1: endRendering() called once per shape:

block flags can be enabled either based on the previous useRLE heuristics
(=cache.useRLE) or if width > 64 (at least 2 tiles) so maybe 1 tile may be
empty ?

        // Heuristics for using block flags:
        if (ENABLE_TILE_FLAGS) {
            if (ENABLE_TILE_FLAGS_ONLY_RLE) {
                enableTileFlags = this.cache.useRLE;
            } else {
                enableTileFlags = ((pmaxX - pminX) >= 64); // 64
                // TODO: check numCrossings (later) ?
            }
        }

2.2 endRendering() per scanline (quick test):

If enableTileFlags flag is enabled before, at every scanline, I quickly
check if the pixel width is really > 64 and few crossings => a better
probability to have large steps.

                        // fast condition:
                        useBlkFlags &= (numCrossings <= 10)
                            && (pix_maxX - pix_minX) >= 512; // 64px ie 3
tiles
Ideas are welcome to refine such metrics.


Quick comments below:

>
> Sorry it took me so long to get around to this...

Hope you have some time soon to help me with that [re|over]view

>
> MarlinConst.java, line 86 - "+2 explain"?

Sorry, cleanup needed.

In copyAARowNoRLE_WithTileFlags, I use the trick to fill the rowAAChunk
array by 32 multiples (tiles) so SuperWord optimization rocks ! However, I
am adding +1 or +2 to indices that may exceed the array length. To be
checked.

>
> MarlinProperties.java - indentation on && continuations should be 4
spaces and/or line up the operands (as in:
>     return isEnableRLE() &&
>            isSomethingElse...;
> OR
>     return isEnableRLE()
>         && isSomethingElse...;
> )

To be fixed.

>
> Renderer.java - edgeNewSize - why did this use to be long and why
downgrade to int now?

To be fixed.

However, edgebuckets is storing indices as integer so the offheap array is
only still accessible up to 2M even if it can store a lot more.

>
> Renderer.java - tosubpix() - did removing static help?

Not really, sorry.

>
> Renderer.java, line 1057,1163 - for the case of producing -1 or 0 from
the sign bit, you could use:
>      ((err >> 31) << 1)

I tried both solutions, but none is faster.

>
> Renderer.java, line 1288,1350 - don't you need to mark tile for (pix_x+1)
in case it is a new tile?

No need, as I always traverse pixels [px0; px1] where px1 = (t <<
TILE_SIZE_LG) + 1 to ensure including the last pixel.

>
> Renderer.java, line 1308,1370 - don't you need to mark all tiles from x0
to x1?

No wanted: I only want to flag tiles that contain alpha variations: alpha
is constant between ]x0+1,  x1[.

>
> TransformingPC2D - why make all the inner classes private?  I'm wary of
private for inner classes because sometimes it forces the compiler to
insert internal accessor methods for methods which are "semantically
accessible" according to the rules of inner classes, but the standard
inter-class access was marked private.

Sorry, to be fixed.

>
> I'm still reviewing the new RLE stuff, but wanted to get this much out
there for now.

thanks for the first pass, I left many typos / attempts during my that
intensive working session.

>>
>> Here is the first webrev improving copyAARow() on large shapes (pixel
>> loops):
>> http://cr.openjdk.java.net/~lbourges/marlin/marlin-s4.0/
>>
>> I advocate it is not yet completly ready (cleanup, log statement) but I
>> wanted to show the new algorithm & variants:
>> copyAARow uses now 4 variants:
>> - RLE encoding or uncompress alpha values
>> - Both can use block flags to only process small touched pixel blocks
>> (like tiles but only 1D) that boosts simple but large shapes !
>
>
> Whoops, I guess I reviewed the wrong stuff (all the glue rather than the
RLE algorithms themselves - D'oh!).

No problem, I am looking forward your comments on block traversal / RLE
algorithms...

>
>
>> Please give me your first comments (overview).
>>   I tested them with my regression tests and all variants are now OK.

It seems using ONLY the variant [noRLE + tile flags] has still artefacts:
to be fixed asap.

>>
>>
>> Here are few results on my machine:
>
>
> Unfortunately, the giant tables of numbers came through on my end as just
a mish-mosh of digits in confusing columns.  Can you summarize or try to
express it as a smaller ascii table or a real HTML <TABLE>?

Here is a summary showing only my ellipse draw / fill tests (radius = 1 to
2000):

Marlin 0.7.0 on JDK1.8.60:
Test                                             Threads    Ops    Med
*Pct95*    Avg    StdDev    Min    Max    TotalOps    [ms/op]
EllipseTests-fill-false.ser                      1    25    518.527
*519.683*    518.957    1.780    518.350    527.552    25
EllipseTests-fill-true.ser                       1    25    910.986   *
911.630*    911.034    0.439    910.128    912.556    25
New patch Marlin 0.7.1:
Best settings on JDK1.8.60:
Test                                             Threads    Ops    Med
*Pct95*    Avg    StdDev    Min    Max    TotalOps    [ms/op]
EllipseTests-fill-false.ser                      1    35    299.068
*299.602 *   299.116    0.297    298.702    300.086    35
EllipseTests-fill-true.ser                       1    25    434.568
*437.110*    434.871    0.875    434.375    437.897    25

OpenJDK9:
Test                                             Threads    Ops    Med
*Pct95*    Avg    StdDev    Min    Max    TotalOps    [ms/op]
EllipseTests-fill-false.ser                      1    35    295.859
*296.245*    295.924    0.211    295.542    296.503    35
EllipseTests-fill-true.ser                       1    25    491.937
*492.165*    491.936    0.193    491.662    492.591    25

Ductus on JDK1.8.60:
Test                                             Threads    Ops    Med
*Pct95*    Avg    StdDev    Min    Max    TotalOps    [ms/op]
EllipseTests-fill-false.ser                      1    35    297.560
*299.328 *   297.480    1.093    295.417    299.590    35
EllipseTests-fill-true.ser                       1    25    453.612
*456.290  *  453.589    1.813    448.936    456.817    25
Conclusion:
The new patch seems promising as it is very close to ductus performance.
Filling ellipse seems slower on OpenJDK9 (492 / 437 = 12% slower) ! Any
MaskFill changes ?

Regards,
Laurent

Re: [OpenJDK Rasterizer] Marlin #4

Reply via email to