Re: [Beignet] [PATCH] GBE: Merge successive load/store together for better performance.

Zou, Nanhai Tue, 19 Aug 2014 18:00:04 -0700

Hi Anthony,
        This is a good point,
We will consider to provide an API to read timestamp.

Thanks
Zou Nanhai

-----Original Message-----
From: Beignet [mailto:[email protected]] On Behalf Of 
Moore, Anthony W
Sent: Wednesday, August 20, 2014 1:54 AM
To: Song, Ruiling; [email protected]
Subject: Re: [Beignet] [PATCH] GBE: Merge successive load/store together for 
better performance.

Since we're discussing performance do you know what it would take to expose the 
Timestamp register to an OpenCL kernel? It would enable people to profile 
sections of their code. Seems like the assembly would just be a MOV, but all of 
the LLVM logic is foreign to me.
thanks

-----Original Message-----
From: Moore, Anthony W 
Sent: Tuesday, August 19, 2014 8:40 AM
To: Song, Ruiling; [email protected]
Subject: RE: [Beignet] [PATCH] GBE: Merge successive load/store together for 
better performance.

Ruiling-
Thanks for your response. Yes, that makes sense. That problem did not occur to 
me since I saw the utests pass after I tried the change, but understand that 
some tests could fail. In regards to the DWORD vs BYTE loading performance you 
speak about, I was curious about the performance and can confirm that 1 byte 
gather is much slower than DWORD.
The performance testing I've been doing has been comparing Beignet and Intel's 
closed source driver. We're seeing much better performance of Intel's driver on 
Haswell using a custom set of kernels available in OpenCV. I've been focusing 
on a very simple kernel (RGB2Gray) where Beignet takes 2x longer and suspect 
that it's the 3 byte loads that's contributing to the slow down. I reduced this 
to a single load as an experiment and we came very close to Intel's driver. My 
suspicion is this is the case for many kernels which is why I'm trying to 
combine loads where possible. I'm wondering how difficult it would be to adapt 
the readByteAsDWord to extract multiple bytes by reading successive DWORDS or 
even doing an unaligned oblock read.
Another experiment we did was to compare the available flops using a kernel 
found here: http://www.bealto.com/gpu-benchmarks_flops.html. Intel's driver 
came very close to the theoretical GFLOPs on HSW and BYT with float16, but 
Beignet was much lower. Looking at the IR/asm it seems that it's not taking 
advantage of SIMD. Does the backend always split up vectors or only in certain 
situations?
Thanks,
Tony

-----Original Message-----
From: Song, Ruiling 
Sent: Monday, August 18, 2014 7:50 PM
To: Moore, Anthony W; [email protected]
Subject: RE: [Beignet] [PATCH] GBE: Merge successive load/store together for 
better performance.

Hi Tony,

In short word, it is not easy to handle merged 2, 3 or 4 bytes read/write in 
backend.
Currently if you only change the logic in llvm_loadstore_optimization.cpp to 
make byte read/write merged, you may get wrong result if the starting address 
of merged memory access is not-4-byte-aligned.
The later steps will simply treat 4 byte load as 1 int load (int load always 
need 4-byte-aligned address).
And on Gen7, int load is much better than byte load. So you will see 
significant 

See emitByteGather() in gen_insn_selection.cpp if(valueNum > 1) {
        // read 4 byte as 1 int and unpack it, here starting address must be 
4-byte-aligned } else {
  GBE_ASSERT(insn.getValueNum() == 1);
  // read 1 int and extract actual byte using some logic-shift
  // and you can see here it is not too easy to handle 2, 3 or 4 bytes read.
}
I am not sure if I explain it clearly.

Could you share me more details about your test? which OpenCV kernels or 
related performance test in OpenCV? So I could do some performance testing.
I am not sure if you meet something like vload4(int offset, uchar * p)? OpenCL 
spec does not ensure the address 'p' is 4-byte-aligned.
If it is a uchar4* read/write, things will be different, the address is 
4-byte-aligned. And the performance is much better than vload4 of uchar* in 
beignet.

Thanks!
Ruiling

-----Original Message-----
From: Beignet [mailto:[email protected]] On Behalf Of 
Moore, Anthony W
Sent: Monday, August 18, 2014 11:47 PM
To: [email protected]
Subject: Re: [Beignet] [PATCH] GBE: Merge successive load/store together for 
better performance.

Hi,

For this patch 
http://lists.freedesktop.org/archives/beignet/2014-May/002879.html, why are 
only DWORDs (and floats) enabled for merging? I tried adding 8-bit and 16-bit 
and saw some significant performance improvement with some of OpenCV's kernels.

+        // we only support DWORD data type merge
+        if(!ty->isFloatTy() && !ty->isIntegerTy(32)) continue;

Thanks!
Tony
_______________________________________________
Beignet mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/beignet
_______________________________________________
Beignet mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/beignet
_______________________________________________
Beignet mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] [PATCH] GBE: Merge successive load/store together for better performance.

Reply via email to