Re: [OpenJDK Rasterizer] AWT & gcc 4.8 optimization options

Sergey Bylokhov Fri, 15 Jan 2016 13:56:52 -0800

Hi,

I found that in case of vectorisation on of the main hotspot is outtable lookup pattern: mul8table/div8table which cannot be vectorized.Another hotspot is a many conditions inside the main loops.


On 15/01/16 20:14, Laurent Bourgès wrote:

Sergey,

Did you made any progress ?

I finally looked at the preprocessed C code and also enabled
ftree-vectorizer-verbose output:
     CFLAGS := -save-temps -ftree-vectorize -ftree-vectorizer-verbose=2
$(CFLAGS_JDKLIB) $(LIBAWT_CFLAGS), \


I looked at the IntArgbPreSrcMaskFill hotspot (in my EllipseFillTest)
according to oprofile:
samples  %        image name               symbol name
469141   30.0043  libawt.so                IntArgbPreSrcMaskFill


Here is the preprocessed C code:
- It is still complex to read as there are many do { } while (0) blocks
due to macro expansion...

void IntArgbSrcMaskFill (void *rasBase, jubyte *pMask, jint maskOff,
jint maskScan, jint width, jint height, jint fgColor, SurfaceDataRasInfo
*pRasInfo, NativePrimitive *pPrim, CompositeInfo *pCompInfo)
{
     jint srcA;
     jint srcR, srcG, srcB;
     jint rasScan = pRasInfo->scanStride;
     IntArgbDataType *pRas = (IntArgbDataType *) (rasBase);
     jint DstPix;
     do
     {
         (srcB) = (fgColor) & 0xff;
         (srcG) = ((fgColor) >> 8) & 0xff;
         (srcR) = ((fgColor) >> 16) & 0xff;
         (srcA) = ((fgColor) >> 24) & 0xff;
     }
     while (0);
     if (srcA == 0)
     {
         srcR = srcG = srcB = 0;
         fgColor = 0;
     }
     else
     {
         if (!(0))
         {
             fgColor = (srcA << 24) | (fgColor & 0x00ffffff);
             ;
         }
         if (srcA != 0xff)
         {
             do
             {
                 srcR = mul8table[srcA][srcR];
                 srcG = mul8table[srcA][srcG];
                 srcB = mul8table[srcA][srcB];
             }
             while (0);
         }
         if (0)
         {
             ;
         }
     }
     DstPix = 0;
     ;
     rasScan -= width * 4;
     if (pMask)
     {
         pMask += maskOff;
         maskScan -= width;
         do
         {
             jint w = width;
             ;
             do
             {
                 jint resA;
                 jint resR, resG, resB;
                 jint dstF;
                 jint pathA = *pMask++;
                 if (pathA > 0)
                 {
                     if (pathA == 0xff)
                     {
                         (pRas)[0] = (fgColor);
                     }
                     else
                     {
                         ;
                         dstF = 0xff - pathA;
                         do
                         {
                             DstPix = (pRas)[0];
                             resA = ((juint) DstPix) >> 24;
                         }
                         while (0);
                         resA = mul8table[dstF][resA];
                         if (!(0))
                         {
                             dstF = resA;
                         }
                         resA += mul8table[pathA][srcA];
                         do
                         {
                             resR = (DstPix >> 16) & 0xff;
                             resG = (DstPix >> 8) & 0xff;
                             resB = (DstPix >> 0) & 0xff;
                         }
                         while (0);
                         do
                         {
                             resR = mul8table[dstF][resR] +
mul8table[pathA][srcR];
                             resG = mul8table[dstF][resG] +
mul8table[pathA][srcG];
                             resB = mul8table[dstF][resB] +
mul8table[pathA][srcB];
                         }
                         while (0);
                         if (!(0) && resA && resA < 0xff)
                         {
                             do
                             {
                                 resR = div8table[resA][resR];
                                 resG = div8table[resA][resG];
                                 resB = div8table[resA][resB];
                             }
                             while (0);
                         }
                         (pRas)[0] = (((((((resA) << 8) | (resR)) << 8)
| (resG)) << 8) | (resB));
                     }
                 }
                 pRas = ((void *) (((intptr_t) (pRas)) + (4)));
                 ;
             }
             while (--w > 0);
             pRas = ((void *) (((intptr_t) (pRas)) + (rasScan)));
             ;
             pMask = ((void *) (((intptr_t) (pMask)) + (maskScan)));
         }
         while (--height > 0);
     }
     else
     {
         do
         {
             jint w = width;
             ;
             do
             {
                 (pRas)[0] = (fgColor);
                 pRas = ((void *) (((intptr_t) (pRas)) + (4)));
                 ;
             }
             while (--w > 0);
             pRas = ((void *) (((intptr_t) (pRas)) + (rasScan)));
             ;
         }
         while (--height > 0);
     }
}

It seems that alpha blending macros are quite complex and can not be
vectorized:

Analyzing loop at IntArgb.c:109
IntArgb.c:109: note: not vectorized: control flow in loop.
IntArgb.c:109: note: bad inner-loop form.
IntArgb.c:109: note: not vectorized: Bad inner loop.
IntArgb.c:109: note: bad loop form.
Analyzing loop at IntArgb.c:109
IntArgb.c:109: note: not vectorized: control flow in loop.
IntArgb.c:109: note: bad loop form.
Analyzing loop at IntArgb.c:109
IntArgb.c:109: note: failed: evolution of base is not affine.
IntArgb.c:109: note: bad data references.
Analyzing loop at IntArgb.c:109
IntArgb.c:109: note: Unknown misalignment, is_packed = 0
IntArgb.c:109: note: virtual phi. skip.
IntArgb.c:109: note: not vectorized: value used after loop.
IntArgb.c:109: note: bad operation or unsupported loop bound.
IntArgb.c:109: note: vectorized 0 loops in function.
IntArgb.c:109: note: not consecutive access rasScan_26 =
pRasInfo_25(D)->scanStride;
IntArgb.c:109: note: Failed to SLP the basic block.
IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
basic block.
IntArgb.c:109: note: not vectorized: not enough data-refs in basic block.
IntArgb.c:109: note: Unknown alignment for access: mul8table
IntArgb.c:109: note: not consecutive access _40 =
mul8table[srcA_36][srcB_33];
IntArgb.c:109: note: not consecutive access _42 =
mul8table[srcA_36][srcB_31];
IntArgb.c:109: note: not consecutive access _44 =
mul8table[srcA_36][srcB_29];
IntArgb.c:109: note: Failed to SLP the basic block.
IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
basic block.
IntArgb.c:109: note: not vectorized: not enough data-refs in basic block.
IntArgb.c:109: note: SLP: step doesn't divide the vector-size.
IntArgb.c:109: note: Unknown alignment for access: *pMask_1
IntArgb.c:109: note: Failed to SLP the basic block.
IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
basic block.
IntArgb.c:109: note: not vectorized: not enough data-refs in basic block.
IntArgb.c:109: note: SLP: step doesn't divide the vector-size.
IntArgb.c:109: note: Unknown alignment for access: *rasBase_9
IntArgb.c:109: note: Failed to SLP the basic block.
IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
basic block.
IntArgb.c:109: note: SLP: step doesn't divide the vector-size.
IntArgb.c:109: note: Unknown alignment for access: *rasBase_9
IntArgb.c:109: note: Unknown alignment for access: mul8table
IntArgb.c:109: note: not consecutive access _65 =
mul8table[dstF_60][resA_64];
IntArgb.c:109: note: not consecutive access _67 =
mul8table[pathA_58][srcA_36];
IntArgb.c:109: note: not consecutive access _75 =
mul8table[dstF_66][resR_71];
IntArgb.c:109: note: not consecutive access _77 =
mul8table[pathA_58][srcB_6];
IntArgb.c:109: note: not consecutive access _80 =
mul8table[dstF_66][resG_73];
IntArgb.c:109: note: not consecutive access _82 =
mul8table[pathA_58][srcB_7];
IntArgb.c:109: note: not consecutive access _85 =
mul8table[dstF_66][resB_74];
IntArgb.c:109: note: not consecutive access _87 =
mul8table[pathA_58][srcB_8];
IntArgb.c:109: note: Failed to SLP the basic block.
IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
basic block.
IntArgb.c:109: note: Unknown alignment for access: div8table
IntArgb.c:109: note: not consecutive access _93 =
div8table[resA_69][resR_79];
IntArgb.c:109: note: not consecutive access _95 =
div8table[resA_69][resG_84];
IntArgb.c:109: note: not consecutive access _97 =
div8table[resA_69][resB_89];
IntArgb.c:109: note: Failed to SLP the basic block.
IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
basic block.
IntArgb.c:109: note: SLP: step doesn't divide the vector-size.
IntArgb.c:109: note: Unknown alignment for access: *rasBase_9
IntArgb.c:109: note: Failed to SLP the basic block.
IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
basic block.
IntArgb.c:109: note: not vectorized: not enough data-refs in basic block.
IntArgb.c:109: note: SLP: step doesn't divide the vector-size.
IntArgb.c:109: note: Unknown alignment for access: *rasBase_11
IntArgb.c:109: note: Failed to SLP the basic block.
IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
basic block.
IntArgb.c:109: note: not vectorized: not enough data-refs in basic block.


Any idea to make such code faster ? or to make it work with vectorization ?


Finally I noticed that the macros with Lcd suffix seems to perform
proper gamma corrections:

void IntArgbDrawGlyphListLCD(SurfaceDataRasInfo *pRasInfo, ImageRef
*glyphs, jint totalGlyphs, jint fgpixel, jint argbcolor, jint clipLeft,
jint clipTop, jint clipRight, jint clipBottom, jint rgbOrder, unsigned
char *gammaLut, unsigned char * invGammaLut, NativePrimitive *pPrim,
CompositeInfo *pCompInfo)
...
     srcR = invGammaLut[srcR];
     srcG = invGammaLut[srcG];
     srcB = invGammaLut[srcB];
...
alpha blending
...
     dstR = gammaLut[dstR];
     dstG = gammaLut[dstG];
     dstB = gammaLut[dstB];

That's exactly what I want to implement the correct gamma correction in
mask fill operations (shape draw / fill) for software loops (buffered
image rendering).

I will try now to figure out how that C code is generated by the nested
macros !

Laurent



--
Best regards, Sergey.

Re: [OpenJDK Rasterizer] AWT & gcc 4.8 optimization options

Reply via email to