Thanks for the response. Glad to hear there's a possibility for better performance! If you have similar data on Baytrail/Ivybridge please share with the list so we can compare against our results. tony
On Wed Nov 26 2014 at 7:41:21 AM Zhigang Gong <[email protected]> wrote: > The load combination optimization is really not very helpful for most > uint vector. > You can easily disable this optimization and try the vload_bench_uint > with global buffer. > It should only get some gain with the uint3 vector. (open file > llvm_to_gen.cpp and comment the > passes.add(createLoadStoreOptimizationsPass();) > > And if you are using Haswell, then it seems that there are some DC > configuration issues > on your platform. It should not be so slower than the constant buffer > under uint2 load. > on uint2 load, the cache locality is very good, and almost all data is > from L3 cache. > And the cache speed should be much faster than 11.7GB/s. I tested it > on my Haswell > machine, uint2 performance with global buffer is more than 80GB/s. > > On Wed, Nov 26, 2014 at 9:30 PM, Tony Moore <[email protected]> wrote: > > Hello, > > I'm actually using Haswell for these experiments. I modified the > > benchmark_run app to use read-only and constant memory and the biggest > > improvement was with small vectors of uint size. I'm guessing the loss in > > the performance with larger vectors is because they are not being > combined. > > Some sizes did do worse. Attached logs. > > > > constant | global > > > > vload_bench_uint() vload_bench_uint() > > Vector size 2: Vector size 2: > > Offset 0 : 58.2GB/S | Offset 0 : 11.7GB/S > > Offset 1 : 59.6GB/S | Offset 1 : 10.7GB/S > > Vector size 3: Vector size 3: > > Offset 0 : 34.3GB/S | Offset 0 : 7.6GB/S > > Offset 1 : 34.3GB/S | Offset 1 : 7.6GB/S > > Offset 2 : 34.3GB/S | Offset 2 : 7.8GB/S > > Vector size 4: Vector size 4: > > Offset 0 : 28.1GB/S | Offset 0 : 12.6GB/S > > Offset 1 : 28.1GB/S | Offset 1 : 10.4GB/S > > Offset 2 : 28.1GB/S | Offset 2 : 10.3GB/S > > Offset 3 : 28.1GB/S | Offset 3 : 10.2GB/S > > > > > > On Tue Nov 25 2014 at 9:47:09 PM Zhigang Gong < > [email protected]> > > wrote: > >> > >> I guess Tony is using BayTrail, so the constant cache(RO) > >> is just half of the standard IvyBridge's. Actually the cache > >> influnce is highly related to the access pattern and not > >> highly related to the memory size. > >> > >> If the access always has big stride then the performance > >> will not be good even use less than the constant cache size, > >> as the cache and memory mapping is not 1:1. It may still > >> cause cache replacement if two constant buffer conflicts > >> on the same cache bank. > >> > >> If the access locality is good, then even if you use a > >> very large amount of contant cache, the miss rate will > >> be relatively low, and the performance will be good. > >> > >> Another related issue is, according to OpenCL spec: > >> CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE > >> cl_ulong Max size in bytes of a constant buffer > >> allocation. The minimum value is 64 KB for devices > >> that are not of type CL_DEVICE_TYPE_CUSTOM. > >> > >> So there is a limitation for the total constant buffer usage. > >> Beignet current set it to 512KB. But this is not a hard limitation > >> on Gen Platform. We may consider to increase it to a higher threshold. > >> Do you have any suggestion? > >> > >> On Wed, Nov 26, 2014 at 03:04:03AM +0000, Song, Ruiling wrote: > >> > I am not an expert on the Cache related thing, basically constant > cache > >> > is part of the Read-only cache lies in L3. > >> > From the code in src/intel/intel_gpgpu.c, below logic is for > IvyBridge: > >> > If (slmMode) > >> > allocate 64KB constant cache > >> > Else > >> > Allocate 32KB constant cache > >> > > >> > I am not sure is there any big performance difference between less > than > >> > or greater than the real constant cache size in L3. > >> > I simply wrote a random-selected number 512KB as the up limit in > driver > >> > API. > >> > But it did deserve to investigate the performance change according to > >> > used constant size. > >> > If we use too much constant larger than the constant cache allocated > >> > from L3, > >> > I think it will definitely cause constant cache data swap in-out > >> > frequently. Right? > >> > If you would like to contribute any performance test to beignet, or > any > >> > other open source test suite, it would be really appreciated! > >> > > >> > Thanks! > >> > Ruiling > >> > From: Beignet [mailto:[email protected]] On > Behalf > >> > Of Tony Moore > >> > Sent: Wednesday, November 26, 2014 6:45 AM > >> > To: [email protected] > >> > Subject: Re: [Beignet] Combine Loads from __constant space > >> > > >> > Another question I had about __constant, was there seems to be no > limit. > >> > I'm using __constant for every read-only parameter now totalling > 1500Kb and > >> > this test now runs in 32ms. So, is there a limit? Is this method > reliable? > >> > Can driver do this implicitly on all read-only buffers? > >> > thanks > >> > > >> > On Tue Nov 25 2014 at 2:11:26 PM Tony Moore > >> > <[email protected]<mailto:[email protected]>> wrote: > >> > Hello, > >> > I notice that reads are not being combined when I use __constant on a > >> > read-only kernel buffer. Is this something that can be improved? > >> > > >> > In my kernel there are many loads from a read-only data structure. > When > >> > I use the __global specifier for the memory space I see a total of 33 > send > >> > instructions and a runtime of 81ms. When I use the __constant > specifier, I > >> > see 43 send instructions and a runtime of 40ms. I'm hoping that > combining > >> > the loads could improve performance further. > >> > > >> > thanks! > >> > tony > >> > >> > _______________________________________________ > >> > Beignet mailing list > >> > [email protected] > >> > http://lists.freedesktop.org/mailman/listinfo/beignet > >> > > > > _______________________________________________ > > Beignet mailing list > > [email protected] > > http://lists.freedesktop.org/mailman/listinfo/beignet > > >
_______________________________________________ Beignet mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/beignet
