Hi Anthony,
This is a good point,
We will consider to provide an API to read timestamp.
Thanks
Zou Nanhai
-----Original Message-----
From: Beignet [mailto:[email protected]] On Behalf Of
Moore, Anthony W
Sent: Wednesday, August 20, 2014 1:54 AM
To: Song, Ruiling; [email protected]
Subject: Re: [Beignet] [PATCH] GBE: Merge successive load/store together for
better performance.
Since we're discussing performance do you know what it would take to expose the
Timestamp register to an OpenCL kernel? It would enable people to profile
sections of their code. Seems like the assembly would just be a MOV, but all of
the LLVM logic is foreign to me.
thanks
-----Original Message-----
From: Moore, Anthony W
Sent: Tuesday, August 19, 2014 8:40 AM
To: Song, Ruiling; [email protected]
Subject: RE: [Beignet] [PATCH] GBE: Merge successive load/store together for
better performance.
Ruiling-
Thanks for your response. Yes, that makes sense. That problem did not occur to
me since I saw the utests pass after I tried the change, but understand that
some tests could fail. In regards to the DWORD vs BYTE loading performance you
speak about, I was curious about the performance and can confirm that 1 byte
gather is much slower than DWORD.
The performance testing I've been doing has been comparing Beignet and Intel's
closed source driver. We're seeing much better performance of Intel's driver on
Haswell using a custom set of kernels available in OpenCV. I've been focusing
on a very simple kernel (RGB2Gray) where Beignet takes 2x longer and suspect
that it's the 3 byte loads that's contributing to the slow down. I reduced this
to a single load as an experiment and we came very close to Intel's driver. My
suspicion is this is the case for many kernels which is why I'm trying to
combine loads where possible. I'm wondering how difficult it would be to adapt
the readByteAsDWord to extract multiple bytes by reading successive DWORDS or
even doing an unaligned oblock read.
Another experiment we did was to compare the available flops using a kernel
found here: http://www.bealto.com/gpu-benchmarks_flops.html. Intel's driver
came very close to the theoretical GFLOPs on HSW and BYT with float16, but
Beignet was much lower. Looking at the IR/asm it seems that it's not taking
advantage of SIMD. Does the backend always split up vectors or only in certain
situations?
Thanks,
Tony
-----Original Message-----
From: Song, Ruiling
Sent: Monday, August 18, 2014 7:50 PM
To: Moore, Anthony W; [email protected]
Subject: RE: [Beignet] [PATCH] GBE: Merge successive load/store together for
better performance.
Hi Tony,
In short word, it is not easy to handle merged 2, 3 or 4 bytes read/write in
backend.
Currently if you only change the logic in llvm_loadstore_optimization.cpp to
make byte read/write merged, you may get wrong result if the starting address
of merged memory access is not-4-byte-aligned.
The later steps will simply treat 4 byte load as 1 int load (int load always
need 4-byte-aligned address).
And on Gen7, int load is much better than byte load. So you will see
significant
See emitByteGather() in gen_insn_selection.cpp if(valueNum > 1) {
// read 4 byte as 1 int and unpack it, here starting address must be
4-byte-aligned } else {
GBE_ASSERT(insn.getValueNum() == 1);
// read 1 int and extract actual byte using some logic-shift
// and you can see here it is not too easy to handle 2, 3 or 4 bytes read.
}
I am not sure if I explain it clearly.
Could you share me more details about your test? which OpenCV kernels or
related performance test in OpenCV? So I could do some performance testing.
I am not sure if you meet something like vload4(int offset, uchar * p)? OpenCL
spec does not ensure the address 'p' is 4-byte-aligned.
If it is a uchar4* read/write, things will be different, the address is
4-byte-aligned. And the performance is much better than vload4 of uchar* in
beignet.
Thanks!
Ruiling
-----Original Message-----
From: Beignet [mailto:[email protected]] On Behalf Of
Moore, Anthony W
Sent: Monday, August 18, 2014 11:47 PM
To: [email protected]
Subject: Re: [Beignet] [PATCH] GBE: Merge successive load/store together for
better performance.
Hi,
For this patch
http://lists.freedesktop.org/archives/beignet/2014-May/002879.html, why are
only DWORDs (and floats) enabled for merging? I tried adding 8-bit and 16-bit
and saw some significant performance improvement with some of OpenCV's kernels.
+ // we only support DWORD data type merge
+ if(!ty->isFloatTy() && !ty->isIntegerTy(32)) continue;
Thanks!
Tony
_______________________________________________
Beignet mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/beignet
_______________________________________________
Beignet mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/beignet
_______________________________________________
Beignet mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/beignet