Tony,

On Tue, Aug 19, 2014 at 03:40:12PM +0000, Moore, Anthony W wrote:
> Ruiling-
> Thanks for your response. Yes, that makes sense. That problem did not occur 
> to me since I saw the utests pass after I tried the change, but understand 
> that some tests could fail. In regards to the DWORD vs BYTE loading 
> performance you speak about, I was curious about the performance and can 
> confirm that 1 byte gather is much slower than DWORD.
> The performance testing I've been doing has been comparing Beignet and 
> Intel's closed source driver. We're seeing much better performance of Intel's 
> driver on Haswell using a custom set of kernels available in OpenCV. I've 
> been focusing on a very simple kernel (RGB2Gray) where Beignet takes 2x 
> longer and suspect that it's the 3 byte loads that's contributing to the slow 
> down. I reduced this to a single load as an experiment and we came very close 
> to Intel's driver. My suspicion is this is the case for many kernels which is 
> why I'm trying to combine loads where possible. I'm wondering how difficult 
> it would be to adapt the readByteAsDWord to extract multiple bytes by reading 
> successive DWORDS or even doing an unaligned oblock read.
I will look into this issue. Hope we can get comparable performance with even 
unaligned read.

> Another experiment we did was to compare the available flops using a kernel 
> found here: http://www.bealto.com/gpu-benchmarks_flops.html. Intel's driver 
> came very close to the theoretical GFLOPs on HSW and BYT with float16, but 
> Beignet was much lower. Looking at the IR/asm it seems that it's not taking 
> advantage of SIMD. Does the backend always split up vectors or only in 
> certain situations?

I have a quick look at the benchmark. It's a very simple loop testing.
I assume some mad instructions should be used, but it's not. I found
the mad pattern matching is disabled by my patch under SIMD16 for some
reason. But my recent experience shows that the mad should be better
even it generats the same count of instructions.
You may try to revert the following patch to see whether it has some
performance impact.
commit d73170df3508d18e250d0af118e3b7955401194f
Author: Zhigang Gong <[email protected]>
Date:   Thu May 15 13:35:00 2014 +0800

    GBE: disable mad for some cases.

Another point is that beignet hasn't regonize do/while loop as structured basic 
block,
so we still use unstructured fasion to encode that block and introduce several 
additional
instructions to maintain the software PCIPs. This should hurt performance, and 
we will
continue to optimize beignet to fix this gap. Before we get things done, could 
you
help to provide more details of the performance comparision(relative ratio is 
good enough,
no need for real scores) under different test data type, for example 
float/float2/float4
And the data set size, 512K/1M with 256 work group size.

Thanks for your valuable feedback.
Zhigang.

> Thanks,
> Tony
> 
> -----Original Message-----
> From: Song, Ruiling 
> Sent: Monday, August 18, 2014 7:50 PM
> To: Moore, Anthony W; [email protected]
> Subject: RE: [Beignet] [PATCH] GBE: Merge successive load/store together for 
> better performance.
> 
> Hi Tony,
> 
> In short word, it is not easy to handle merged 2, 3 or 4 bytes read/write in 
> backend.
> Currently if you only change the logic in llvm_loadstore_optimization.cpp to 
> make byte read/write merged, you may get wrong result if the starting address 
> of merged memory access is not-4-byte-aligned.
> The later steps will simply treat 4 byte load as 1 int load (int load always 
> need 4-byte-aligned address).
> And on Gen7, int load is much better than byte load. So you will see 
> significant 
> 
> See emitByteGather() in gen_insn_selection.cpp if(valueNum > 1) {
>       // read 4 byte as 1 int and unpack it, here starting address must be 
> 4-byte-aligned } else {
>   GBE_ASSERT(insn.getValueNum() == 1);
>   // read 1 int and extract actual byte using some logic-shift
>   // and you can see here it is not too easy to handle 2, 3 or 4 bytes read.
> }
> I am not sure if I explain it clearly.
> 
> Could you share me more details about your test? which OpenCV kernels or 
> related performance test in OpenCV? So I could do some performance testing.
> I am not sure if you meet something like vload4(int offset, uchar * p)? 
> OpenCL spec does not ensure the address 'p' is 4-byte-aligned.
> If it is a uchar4* read/write, things will be different, the address is 
> 4-byte-aligned. And the performance is much better than vload4 of uchar* in 
> beignet.
> 
> Thanks!
> Ruiling
> 
> -----Original Message-----
> From: Beignet [mailto:[email protected]] On Behalf Of 
> Moore, Anthony W
> Sent: Monday, August 18, 2014 11:47 PM
> To: [email protected]
> Subject: Re: [Beignet] [PATCH] GBE: Merge successive load/store together for 
> better performance.
> 
> Hi,
> 
> For this patch 
> http://lists.freedesktop.org/archives/beignet/2014-May/002879.html, why are 
> only DWORDs (and floats) enabled for merging? I tried adding 8-bit and 16-bit 
> and saw some significant performance improvement with some of OpenCV's 
> kernels.
> 
> +        // we only support DWORD data type merge
> +        if(!ty->isFloatTy() && !ty->isIntegerTy(32)) continue;
> 
> Thanks!
> Tony
> _______________________________________________
> Beignet mailing list
> [email protected]
> http://lists.freedesktop.org/mailman/listinfo/beignet
> _______________________________________________
> Beignet mailing list
> [email protected]
> http://lists.freedesktop.org/mailman/listinfo/beignet
_______________________________________________
Beignet mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/beignet

Reply via email to