Sergey, >> So it looks scalar operations on vector (4) ie vectorization should be >> applicable. > > > yes, I think so.
I googled a bit and it seems tricky to implement alpha blending with sse2 but many projects succeeded by using writing directly sse2 primitives ! >> Maybe the conditions (pathA > 0) && (pathA < 0xff) are a bigger penalty >> as they can not be easily predicted (but may happen often). >> Sometimes it is faster to perform useless math operations without >> branching (gpu approach). >> >> Do you have other ideas to make it faster ? as it represents 30% of the >> ellipse fill test (huge ellipses). >> I noticed that larger tiles (64x64) are a bit faster (larger tile width >> / height, less jni calls) > > > I just commented out some of the code inside this method and checks the performance. It seems that the simple code like: > inloop->readBytes->decodeRGB->encodeBytes->saveBytes is quite fast. But if some branch/multiplication are added after decodeRGB then the code became really slow(x10 slower on my system). This is expected because we complete huge number of multiplications, but if I try to make the same math standalone(without byte decoding) then the result is fast also. So it seems that we slow because of mixing of byteReading/branches/multipliation. It seems possible to for RGBA: - compute A+G and R+B together (2×16bits) to double the throughput - use bit shifts instead of mul / div Could you try implementing such variants ? >> Should I try (as I did in the past) to implement the MaskFill in Java to >> benefit from hotspot optimizations (like Marlin) ? > > > It will be interesting. I remember that someone already tried to do the same, but I do not remember the result. Probably Jim can suggest something. I implemented alpha blending in java last year (using custom composite operator hack): http://mail.openjdk.java.net/pipermail/2d-dev/2014-August/004751.html I could try soon optimizing my java impl... Cheers, Laurent