>>-----Original Message----- >>From: [email protected] >>[mailto:[email protected]] On >>Behalf Of Zou, Nanhai >>Sent: 2011年6月22日 12:29 >>To: Keith Packard; [email protected] >>Cc: Anholt, Eric >>Subject: Re: [Intel-gfx] gem clflush optimization for media encoding >> >> >> >>>>-----Original Message----- >>>>From: Keith Packard [mailto:[email protected]] >>>>Sent: 2011年6月22日 12:14 >>>>To: Zou, Nanhai; [email protected] >>>>Cc: Anholt, Eric >>>>Subject: Re: [Intel-gfx] gem clflush optimization for media encoding >>>> >>>>On Wed, 22 Jun 2011 11:13:09 +0800, "Zou, Nanhai" <[email protected]> >>wrote: >>>> >>>>> If I upload input buffer with movnti or movntdq (bypass cache) + >>>>> sfence(clear write combine buffer) in the end, clflush should >>>>> not be needed. >>>> >>>>Alas, neither of these will flush existing cached data, so you must >>>>still use clflush to ensure that the data makes it out to memory. All >>>>that they do is avoid consuming additional cache lines. >>>> >> As I understand, >> with movnti + sfence, data should be surly reach memory. Cache should be >>coherent at this case. >> >>>>You want to use a write combining mapping, which should give you full >>>>bandwidth access to memory without hitting any caches. You can use the GTT >>>>mapping as the aperture is configured for write combining access, or we >>>>can figure out how to make PAT work. >>>> >> map_gtt in current gem is super slow. >> I've tried map_gtt but it seems that the speed is unacceptable. >> >>>>> Since it is CPU read only surface, clflush in not needed at all. >>>> >>>>You'd still have to invalidate cache lines using clflush to avoid using >>>>stale data in the CPU cache. >>>> >>>>-- >> Yes, you are right, in this case clflush is still needed to invalidate the >>CPU cache. >> >> The problem is that we do not now how large the coded output buffer is >> before >>we do the encoding. >> So we have to allocate a large enough gem object before encoding, in most >>case the encoding result will be less than 1/10 of the safe buffer size, 9/10 >>of the buffer was unnecessarily clflushed. >> >> A fast map_gtt implementation could be the best choice here. >> Or can we clflush cache line by cache line while reading instead of flush the entire object? This optimization will have >40% speedup for 1080p encoding.
>>Thanks >>Zou Nanhai >> >>>>[email protected] >>_______________________________________________ >>Intel-gfx mailing list >>[email protected] >>http://lists.freedesktop.org/mailman/listinfo/intel-gfx _______________________________________________ Intel-gfx mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/intel-gfx
