Sergey, Did you made any progress ?
I finally looked at the preprocessed C code and also enabled ftree-vectorizer-verbose output: CFLAGS := -save-temps -ftree-vectorize -ftree-vectorizer-verbose=2 $(CFLAGS_JDKLIB) $(LIBAWT_CFLAGS), \ I looked at the IntArgbPreSrcMaskFill hotspot (in my EllipseFillTest) according to oprofile: samples % image name symbol name 469141 30.0043 libawt.so IntArgbPreSrcMaskFill Here is the preprocessed C code: - It is still complex to read as there are many do { } while (0) blocks due to macro expansion... void IntArgbSrcMaskFill (void *rasBase, jubyte *pMask, jint maskOff, jint maskScan, jint width, jint height, jint fgColor, SurfaceDataRasInfo *pRasInfo, NativePrimitive *pPrim, CompositeInfo *pCompInfo) { jint srcA; jint srcR, srcG, srcB; jint rasScan = pRasInfo->scanStride; IntArgbDataType *pRas = (IntArgbDataType *) (rasBase); jint DstPix; do { (srcB) = (fgColor) & 0xff; (srcG) = ((fgColor) >> 8) & 0xff; (srcR) = ((fgColor) >> 16) & 0xff; (srcA) = ((fgColor) >> 24) & 0xff; } while (0); if (srcA == 0) { srcR = srcG = srcB = 0; fgColor = 0; } else { if (!(0)) { fgColor = (srcA << 24) | (fgColor & 0x00ffffff); ; } if (srcA != 0xff) { do { srcR = mul8table[srcA][srcR]; srcG = mul8table[srcA][srcG]; srcB = mul8table[srcA][srcB]; } while (0); } if (0) { ; } } DstPix = 0; ; rasScan -= width * 4; if (pMask) { pMask += maskOff; maskScan -= width; do { jint w = width; ; do { jint resA; jint resR, resG, resB; jint dstF; jint pathA = *pMask++; if (pathA > 0) { if (pathA == 0xff) { (pRas)[0] = (fgColor); } else { ; dstF = 0xff - pathA; do { DstPix = (pRas)[0]; resA = ((juint) DstPix) >> 24; } while (0); resA = mul8table[dstF][resA]; if (!(0)) { dstF = resA; } resA += mul8table[pathA][srcA]; do { resR = (DstPix >> 16) & 0xff; resG = (DstPix >> 8) & 0xff; resB = (DstPix >> 0) & 0xff; } while (0); do { resR = mul8table[dstF][resR] + mul8table[pathA][srcR]; resG = mul8table[dstF][resG] + mul8table[pathA][srcG]; resB = mul8table[dstF][resB] + mul8table[pathA][srcB]; } while (0); if (!(0) && resA && resA < 0xff) { do { resR = div8table[resA][resR]; resG = div8table[resA][resG]; resB = div8table[resA][resB]; } while (0); } (pRas)[0] = (((((((resA) << 8) | (resR)) << 8) | (resG)) << 8) | (resB)); } } pRas = ((void *) (((intptr_t) (pRas)) + (4))); ; } while (--w > 0); pRas = ((void *) (((intptr_t) (pRas)) + (rasScan))); ; pMask = ((void *) (((intptr_t) (pMask)) + (maskScan))); } while (--height > 0); } else { do { jint w = width; ; do { (pRas)[0] = (fgColor); pRas = ((void *) (((intptr_t) (pRas)) + (4))); ; } while (--w > 0); pRas = ((void *) (((intptr_t) (pRas)) + (rasScan))); ; } while (--height > 0); } } It seems that alpha blending macros are quite complex and can not be vectorized: Analyzing loop at IntArgb.c:109 IntArgb.c:109: note: not vectorized: control flow in loop. IntArgb.c:109: note: bad inner-loop form. IntArgb.c:109: note: not vectorized: Bad inner loop. IntArgb.c:109: note: bad loop form. Analyzing loop at IntArgb.c:109 IntArgb.c:109: note: not vectorized: control flow in loop. IntArgb.c:109: note: bad loop form. Analyzing loop at IntArgb.c:109 IntArgb.c:109: note: failed: evolution of base is not affine. IntArgb.c:109: note: bad data references. Analyzing loop at IntArgb.c:109 IntArgb.c:109: note: Unknown misalignment, is_packed = 0 IntArgb.c:109: note: virtual phi. skip. IntArgb.c:109: note: not vectorized: value used after loop. IntArgb.c:109: note: bad operation or unsupported loop bound. IntArgb.c:109: note: vectorized 0 loops in function. IntArgb.c:109: note: not consecutive access rasScan_26 = pRasInfo_25(D)->scanStride; IntArgb.c:109: note: Failed to SLP the basic block. IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in basic block. IntArgb.c:109: note: not vectorized: not enough data-refs in basic block. IntArgb.c:109: note: Unknown alignment for access: mul8table IntArgb.c:109: note: not consecutive access _40 = mul8table[srcA_36][srcB_33]; IntArgb.c:109: note: not consecutive access _42 = mul8table[srcA_36][srcB_31]; IntArgb.c:109: note: not consecutive access _44 = mul8table[srcA_36][srcB_29]; IntArgb.c:109: note: Failed to SLP the basic block. IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in basic block. IntArgb.c:109: note: not vectorized: not enough data-refs in basic block. IntArgb.c:109: note: SLP: step doesn't divide the vector-size. IntArgb.c:109: note: Unknown alignment for access: *pMask_1 IntArgb.c:109: note: Failed to SLP the basic block. IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in basic block. IntArgb.c:109: note: not vectorized: not enough data-refs in basic block. IntArgb.c:109: note: SLP: step doesn't divide the vector-size. IntArgb.c:109: note: Unknown alignment for access: *rasBase_9 IntArgb.c:109: note: Failed to SLP the basic block. IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in basic block. IntArgb.c:109: note: SLP: step doesn't divide the vector-size. IntArgb.c:109: note: Unknown alignment for access: *rasBase_9 IntArgb.c:109: note: Unknown alignment for access: mul8table IntArgb.c:109: note: not consecutive access _65 = mul8table[dstF_60][resA_64]; IntArgb.c:109: note: not consecutive access _67 = mul8table[pathA_58][srcA_36]; IntArgb.c:109: note: not consecutive access _75 = mul8table[dstF_66][resR_71]; IntArgb.c:109: note: not consecutive access _77 = mul8table[pathA_58][srcB_6]; IntArgb.c:109: note: not consecutive access _80 = mul8table[dstF_66][resG_73]; IntArgb.c:109: note: not consecutive access _82 = mul8table[pathA_58][srcB_7]; IntArgb.c:109: note: not consecutive access _85 = mul8table[dstF_66][resB_74]; IntArgb.c:109: note: not consecutive access _87 = mul8table[pathA_58][srcB_8]; IntArgb.c:109: note: Failed to SLP the basic block. IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in basic block. IntArgb.c:109: note: Unknown alignment for access: div8table IntArgb.c:109: note: not consecutive access _93 = div8table[resA_69][resR_79]; IntArgb.c:109: note: not consecutive access _95 = div8table[resA_69][resG_84]; IntArgb.c:109: note: not consecutive access _97 = div8table[resA_69][resB_89]; IntArgb.c:109: note: Failed to SLP the basic block. IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in basic block. IntArgb.c:109: note: SLP: step doesn't divide the vector-size. IntArgb.c:109: note: Unknown alignment for access: *rasBase_9 IntArgb.c:109: note: Failed to SLP the basic block. IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in basic block. IntArgb.c:109: note: not vectorized: not enough data-refs in basic block. IntArgb.c:109: note: SLP: step doesn't divide the vector-size. IntArgb.c:109: note: Unknown alignment for access: *rasBase_11 IntArgb.c:109: note: Failed to SLP the basic block. IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in basic block. IntArgb.c:109: note: not vectorized: not enough data-refs in basic block. Any idea to make such code faster ? or to make it work with vectorization ? Finally I noticed that the macros with Lcd suffix seems to perform proper gamma corrections: void IntArgbDrawGlyphListLCD(SurfaceDataRasInfo *pRasInfo, ImageRef *glyphs, jint totalGlyphs, jint fgpixel, jint argbcolor, jint clipLeft, jint clipTop, jint clipRight, jint clipBottom, jint rgbOrder, unsigned char *gammaLut, unsigned char * invGammaLut, NativePrimitive *pPrim, CompositeInfo *pCompInfo) ... srcR = invGammaLut[srcR]; srcG = invGammaLut[srcG]; srcB = invGammaLut[srcB]; ... alpha blending ... dstR = gammaLut[dstR]; dstG = gammaLut[dstG]; dstB = gammaLut[dstB]; That's exactly what I want to implement the correct gamma correction in mask fill operations (shape draw / fill) for software loops (buffered image rendering). I will try now to figure out how that C code is generated by the nested macros ! Laurent