Jim, I would like your point of view on the new algorithms to store alpha values efficiently and some advices on heuristics / metrics to make the adaptive approach more efficient / robust.
I hope having some spare time soon to spend on improving this patch... Here are few more explanations that may help you: 1. I figured out RLE encoding is only faster if run lengths are important (high repeats). It corresponds to few crossings but large pixel spans. For small shapes or highly complex ones, it is better to leave alpha values uncompressed. >From this assumption, I adopted an adaptive approach based on a simple heuristics: see MarlinCache.useRLE // Hack to switch RLE encoding ON or OFF: // sparse density: useRLE = (width >= 128) && (((maxy - miny) * width) >= (primCount << 5)); // larger than 64x64 It only enables RLE if the shape width is larger than 128 pixels and also it is not too complex: width > (32 * primitive count) / height Of course, it is very rough as primitive count / height is ~ mean(crossings) ! If you have better idea to determine the approximated crossing density per pixel, please tell me. 2. Moreover, I implemented a new tricky approach = block flags where 1 means the block has crossings, 0 none. It helps to traverse & test less pixels ie only blocks with flag = 1 are really useful (others always are full of zero) It let me use efficient Arrays.fill (noRLE variant) or larger run lengths (RLE variant) as nothing is varying in 0 flagged blocks. Of course other heuristics are needed as it only provides gains if blocks can be skipped (pixel steps > 32) and to reduce the overhead due to flagging blocks in the scanline processing loop. 2.1: endRendering() called once per shape: block flags can be enabled either based on the previous useRLE heuristics (=cache.useRLE) or if width > 64 (at least 2 tiles) so maybe 1 tile may be empty ? // Heuristics for using block flags: if (ENABLE_TILE_FLAGS) { if (ENABLE_TILE_FLAGS_ONLY_RLE) { enableTileFlags = this.cache.useRLE; } else { enableTileFlags = ((pmaxX - pminX) >= 64); // 64 // TODO: check numCrossings (later) ? } } 2.2 endRendering() per scanline (quick test): If enableTileFlags flag is enabled before, at every scanline, I quickly check if the pixel width is really > 64 and few crossings => a better probability to have large steps. // fast condition: useBlkFlags &= (numCrossings <= 10) && (pix_maxX - pix_minX) >= 512; // 64px ie 3 tiles Ideas are welcome to refine such metrics. Quick comments below: > > Sorry it took me so long to get around to this... Hope you have some time soon to help me with that [re|over]view > > MarlinConst.java, line 86 - "+2 explain"? Sorry, cleanup needed. In copyAARowNoRLE_WithTileFlags, I use the trick to fill the rowAAChunk array by 32 multiples (tiles) so SuperWord optimization rocks ! However, I am adding +1 or +2 to indices that may exceed the array length. To be checked. > > MarlinProperties.java - indentation on && continuations should be 4 spaces and/or line up the operands (as in: > return isEnableRLE() && > isSomethingElse...; > OR > return isEnableRLE() > && isSomethingElse...; > ) To be fixed. > > Renderer.java - edgeNewSize - why did this use to be long and why downgrade to int now? To be fixed. However, edgebuckets is storing indices as integer so the offheap array is only still accessible up to 2M even if it can store a lot more. > > Renderer.java - tosubpix() - did removing static help? Not really, sorry. > > Renderer.java, line 1057,1163 - for the case of producing -1 or 0 from the sign bit, you could use: > ((err >> 31) << 1) I tried both solutions, but none is faster. > > Renderer.java, line 1288,1350 - don't you need to mark tile for (pix_x+1) in case it is a new tile? No need, as I always traverse pixels [px0; px1] where px1 = (t << TILE_SIZE_LG) + 1 to ensure including the last pixel. > > Renderer.java, line 1308,1370 - don't you need to mark all tiles from x0 to x1? No wanted: I only want to flag tiles that contain alpha variations: alpha is constant between ]x0+1, x1[. > > TransformingPC2D - why make all the inner classes private? I'm wary of private for inner classes because sometimes it forces the compiler to insert internal accessor methods for methods which are "semantically accessible" according to the rules of inner classes, but the standard inter-class access was marked private. Sorry, to be fixed. > > I'm still reviewing the new RLE stuff, but wanted to get this much out there for now. thanks for the first pass, I left many typos / attempts during my that intensive working session. >> >> Here is the first webrev improving copyAARow() on large shapes (pixel >> loops): >> http://cr.openjdk.java.net/~lbourges/marlin/marlin-s4.0/ >> >> I advocate it is not yet completly ready (cleanup, log statement) but I >> wanted to show the new algorithm & variants: >> copyAARow uses now 4 variants: >> - RLE encoding or uncompress alpha values >> - Both can use block flags to only process small touched pixel blocks >> (like tiles but only 1D) that boosts simple but large shapes ! > > > Whoops, I guess I reviewed the wrong stuff (all the glue rather than the RLE algorithms themselves - D'oh!). No problem, I am looking forward your comments on block traversal / RLE algorithms... > > >> Please give me your first comments (overview). >> I tested them with my regression tests and all variants are now OK. It seems using ONLY the variant [noRLE + tile flags] has still artefacts: to be fixed asap. >> >> >> Here are few results on my machine: > > > Unfortunately, the giant tables of numbers came through on my end as just a mish-mosh of digits in confusing columns. Can you summarize or try to express it as a smaller ascii table or a real HTML <TABLE>? Here is a summary showing only my ellipse draw / fill tests (radius = 1 to 2000): Marlin 0.7.0 on JDK1.8.60: Test Threads Ops Med *Pct95* Avg StdDev Min Max TotalOps [ms/op] EllipseTests-fill-false.ser 1 25 518.527 *519.683* 518.957 1.780 518.350 527.552 25 EllipseTests-fill-true.ser 1 25 910.986 * 911.630* 911.034 0.439 910.128 912.556 25 New patch Marlin 0.7.1: Best settings on JDK1.8.60: Test Threads Ops Med *Pct95* Avg StdDev Min Max TotalOps [ms/op] EllipseTests-fill-false.ser 1 35 299.068 *299.602 * 299.116 0.297 298.702 300.086 35 EllipseTests-fill-true.ser 1 25 434.568 *437.110* 434.871 0.875 434.375 437.897 25 OpenJDK9: Test Threads Ops Med *Pct95* Avg StdDev Min Max TotalOps [ms/op] EllipseTests-fill-false.ser 1 35 295.859 *296.245* 295.924 0.211 295.542 296.503 35 EllipseTests-fill-true.ser 1 25 491.937 *492.165* 491.936 0.193 491.662 492.591 25 Ductus on JDK1.8.60: Test Threads Ops Med *Pct95* Avg StdDev Min Max TotalOps [ms/op] EllipseTests-fill-false.ser 1 35 297.560 *299.328 * 297.480 1.093 295.417 299.590 35 EllipseTests-fill-true.ser 1 25 453.612 *456.290 * 453.589 1.813 448.936 456.817 25 Conclusion: The new patch seems promising as it is very close to ductus performance. Filling ellipse seems slower on OpenJDK9 (492 / 437 = 12% slower) ! Any MaskFill changes ? Regards, Laurent