Re: [Beignet] [PATCH] GBE: Merge successive load/store together for better performance.

Moore, Anthony W Wed, 20 Aug 2014 09:08:31 -0700

Zhigang-

I tested an OpenCL kernel that does native multiplies and adds as well as fmad 
method and for Beignet and VPG the native instructions performed better. The 
relative ratio of VPG to Beignet performance of a float16 kernel with 64 
workgroup size and 8MB workload size is 46:1 (these sizes performed the best). 
I don't think the MAD instruction is the issue but rather I suspect VPG is 
using all SIMD lanes Beignet is "scalarizing" all the instructions.


I did try the patch you suggested and did see a noticeable difference in 
performance. 

thanks
Tony



-----Original Message-----
From: Zhigang Gong [mailto:[email protected]] 
Sent: Tuesday, August 19, 2014 7:59 PM
To: Moore, Anthony W
Cc: Song, Ruiling; [email protected]
Subject: Re: [Beignet] [PATCH] GBE: Merge successive load/store together for 
better performance.

Tony,

On Tue, Aug 19, 2014 at 03:40:12PM +0000, Moore, Anthony W wrote:
> Ruiling-
> Thanks for your response. Yes, that makes sense. That problem did not occur 
> to me since I saw the utests pass after I tried the change, but understand 
> that some tests could fail. In regards to the DWORD vs BYTE loading 
> performance you speak about, I was curious about the performance and can 
> confirm that 1 byte gather is much slower than DWORD.
> The performance testing I've been doing has been comparing Beignet and 
> Intel's closed source driver. We're seeing much better performance of Intel's 
> driver on Haswell using a custom set of kernels available in OpenCV. I've 
> been focusing on a very simple kernel (RGB2Gray) where Beignet takes 2x 
> longer and suspect that it's the 3 byte loads that's contributing to the slow 
> down. I reduced this to a single load as an experiment and we came very close 
> to Intel's driver. My suspicion is this is the case for many kernels which is 
> why I'm trying to combine loads where possible. I'm wondering how difficult 
> it would be to adapt the readByteAsDWord to extract multiple bytes by reading 
> successive DWORDS or even doing an unaligned oblock read.
I will look into this issue. Hope we can get comparable performance with even 
unaligned read.

> Another experiment we did was to compare the available flops using a kernel 
> found here: http://www.bealto.com/gpu-benchmarks_flops.html. Intel's driver 
> came very close to the theoretical GFLOPs on HSW and BYT with float16, but 
> Beignet was much lower. Looking at the IR/asm it seems that it's not taking 
> advantage of SIMD. Does the backend always split up vectors or only in 
> certain situations?

I have a quick look at the benchmark. It's a very simple loop testing.
I assume some mad instructions should be used, but it's not. I found the mad 
pattern matching is disabled by my patch under SIMD16 for some reason. But my 
recent experience shows that the mad should be better even it generats the same 
count of instructions.
You may try to revert the following patch to see whether it has some 
performance impact.
commit d73170df3508d18e250d0af118e3b7955401194f
Author: Zhigang Gong <[email protected]>
Date:   Thu May 15 13:35:00 2014 +0800

    GBE: disable mad for some cases.

Another point is that beignet hasn't regonize do/while loop as structured basic 
block, so we still use unstructured fasion to encode that block and introduce 
several additional instructions to maintain the software PCIPs. This should 
hurt performance, and we will continue to optimize beignet to fix this gap. 
Before we get things done, could you help to provide more details of the 
performance comparision(relative ratio is good enough, no need for real scores) 
under different test data type, for example float/float2/float4 And the data 
set size, 512K/1M with 256 work group size.

Thanks for your valuable feedback.
Zhigang.

> Thanks,
> Tony
> 
> -----Original Message-----
> From: Song, Ruiling
> Sent: Monday, August 18, 2014 7:50 PM
> To: Moore, Anthony W; [email protected]
> Subject: RE: [Beignet] [PATCH] GBE: Merge successive load/store together for 
> better performance.
> 
> Hi Tony,
> 
> In short word, it is not easy to handle merged 2, 3 or 4 bytes read/write in 
> backend.
> Currently if you only change the logic in llvm_loadstore_optimization.cpp to 
> make byte read/write merged, you may get wrong result if the starting address 
> of merged memory access is not-4-byte-aligned.
> The later steps will simply treat 4 byte load as 1 int load (int load always 
> need 4-byte-aligned address).
> And on Gen7, int load is much better than byte load. So you will see 
> significant
> 
> See emitByteGather() in gen_insn_selection.cpp if(valueNum > 1) {
>       // read 4 byte as 1 int and unpack it, here starting address must be 
> 4-byte-aligned } else {
>   GBE_ASSERT(insn.getValueNum() == 1);
>   // read 1 int and extract actual byte using some logic-shift
>   // and you can see here it is not too easy to handle 2, 3 or 4 bytes read.
> }
> I am not sure if I explain it clearly.
> 
> Could you share me more details about your test? which OpenCV kernels or 
> related performance test in OpenCV? So I could do some performance testing.
> I am not sure if you meet something like vload4(int offset, uchar * p)? 
> OpenCL spec does not ensure the address 'p' is 4-byte-aligned.
> If it is a uchar4* read/write, things will be different, the address is 
> 4-byte-aligned. And the performance is much better than vload4 of uchar* in 
> beignet.
> 
> Thanks!
> Ruiling
> 
> -----Original Message-----
> From: Beignet [mailto:[email protected]] On Behalf 
> Of Moore, Anthony W
> Sent: Monday, August 18, 2014 11:47 PM
> To: [email protected]
> Subject: Re: [Beignet] [PATCH] GBE: Merge successive load/store together for 
> better performance.
> 
> Hi,
> 
> For this patch 
> http://lists.freedesktop.org/archives/beignet/2014-May/002879.html, why are 
> only DWORDs (and floats) enabled for merging? I tried adding 8-bit and 16-bit 
> and saw some significant performance improvement with some of OpenCV's 
> kernels.
> 
> +        // we only support DWORD data type merge
> +        if(!ty->isFloatTy() && !ty->isIntegerTy(32)) continue;
> 
> Thanks!
> Tony
> _______________________________________________
> Beignet mailing list
> [email protected]
> http://lists.freedesktop.org/mailman/listinfo/beignet
> _______________________________________________
> Beignet mailing list
> [email protected]
> http://lists.freedesktop.org/mailman/listinfo/beignet
_______________________________________________
Beignet mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] [PATCH] GBE: Merge successive load/store together for better performance.

Reply via email to