Re: [E-devel] patch - imlib2 blend in AMD64

The Rasterman Tue, 23 Aug 2005 09:07:35 -0700

On Tue, 23 Aug 2005 07:17:42 -0600 Tres Melton <[EMAIL PROTECTED]> babbled:


> On Tue, 2005-08-23 at 11:30 +0900, Carsten Haitzler wrote:
> > On Mon, 22 Aug 2005 19:43:58 +0000 Tiago Victor Gehring
> > <[EMAIL PROTECTED]> babbled:
> 
> Lots of people said..... and then raster said:
> 
> > actually do tests - you may find the unaligned copies  not that much slower
> > as traditionally x86 hw has always done the fixups for unaligned
> > read/writes in hardware and thus the overhead is fairly small.
> 
> Tests are needed and so is some discussion.  In relation to the above
> topics:
> 
> 
> Mornin' all,
> 
> 
>       Okay, here's the deal.  I'm going to talk about some stupid shit that
> everyone already knows and then you guys can to call me an idiot, jerky,
> or whatever.  Raster et. al, please correct me if I am wrong and my
> apologies for the review to everyone.
> 
>       To review, okay the problem is that the hardware needs to be accounted
> for.  It is impossible to load a byte from RAM into cache just like you
> can't read a byte from the hard disk.  We all know that you can
> read( ?, ?, 1 ) but we also know that when the kernel gets the call it
> reads a block, returns the character asked for and holds the rest in a
> buffer.  That is what happens in hardware.  The chips have all the wires
> coming into them and their controller chip watches them and upon the
> proper signal, ALE, it will respond.  The Address Latch Enable (ALE)
> means that the CPU is asking for the memory at this address and is
> actually a wire on the cpu/mobo and the voltage hits 5V (in my day, now
> ~2.5) and that current hits the memory controller chip and causes it to
> start its cycle (ANDed with the timer wire so it starts at the next
> clock tick).  That memory controller interconnects RAM and L[123]_Cache
> and they all have to deal with ram in chunks/pages/lines/etc and the
> number of bytes in a chunk is always a function based on the number of
> wires internally/externally (ie. 386sx = 32bit chip on 16bit bus).  Add
> all this shit up, and basically what you have is memory that is byte
> addressable on the software level but much more complex on the hardware
> level.  
> 
>       When the CPU asks for a byte of memory the controller will return a
> line that contains the byte in question.  When the CPU asks for a value
> that is multiple bytes it will be returned in one or more cache lines of
> memory.  If the CPU can be assured that the data in question is aligned
> with the same alignment as the storage location is then it can
> manipulate the wires once to move the data.  If it is not so aligned
> then it must lode the data in 2+ chunks.  The plus is mooted by the
> instruction set (it doesn't handle data types larger than the register
> size to/from the SIMD core).  This data move can be done in a single
> tick (to/from registers and L1 cache) so it is a waste of a cycle to
> check for alignment because you could have moved the unaligned data by
> then.  The two cycles are contingent upon the fix-ups that raster
> mentioned above being efficient and I'm not sure exactly how the
> hardware does it.  It could roll the address of the source wire to
> achieve alignment or simply take two cycles and move the data in pieces.
> 
>       It is therefore appropriate for us to check once at the beginning of
> the image.  The preferable solution is to guarantee alignment upon entry
> otherwise we are going to have to use unaligned memory moves or have two
> pieces of code.  In order to achieve an alignment guarantee we need to
> control how *image is created and ensure that it is a pointer that fits
> "if ( image % alignment ) then do unaligned_stuff".  This is a function
> of the compiler and other things and can be accomplished with the
> __align__ operator in C and the .align directive in asm with the GNU
> tools.  I haven't investigated all of the possibilities here so I know
> that the functions exist but am not entirely positive of the calling
> convention nor implementation.  Anyway, all of the image_load type of
> functions (or the one image_create ,or whatever it's named, that is
> called by all others) need to be rewritten to ensure alignment and we
> would still need to check for the possibility of a user created image
> that gives an unaligned pointer to the pixels.  In order to avoid the
> sigv we would have to check for alignment and maybe call a function with
> unaligned moves, re-align the data, or error out in that case.  
> 
>       I'm not real familiar with the imlib2 code, and more importantly, how
> it is used, so that is why I'm mentioning things like this.  For those
> of you that know the internals, what do you propose?  the "works for
> all" solution is to just use unaligned memory accesses.  The "faster
> than all others" is going to need fully aligned memory, pre-fetched
> caches (already in there), and most of all predictability.

a very good summary of how memory works there... mind you - a few things i'd add

1. modern x86 architecture has sepculative fetching. basically if you read byte
1, 2, 3, 4, 5, etc. hardware picks up this pattern and goes "oooh nice!" and
starts fetching bytes 5, 6, 7, 9 etc. expecting you will need them soon - IF
the memory bus is idle and has nothing better to do. so often the advantage of
pre-fetchng is nullfiied by hardware thats too smart for us :)

2. if you fetch a byte it will read much more than a byte off the memory chips.
often it keeps this extra around in cache and when you suddenly need more data
- guess what. it's in cache and the read requires no memory latency round-trip.
the cost is the extra decode and execution of an instruction cycle. often in
tight loops this is 1 clock cycle as its already in cache (the instruction and
the data).

3. thus if you have other things you are doing and we lose 1 cycle - you may
find an aligned read (do blend) writeis maybe 2-3% faster as you spend 50
cycles doing the blend and save only 1 cycle with the aligned read. is this
worth all the extra pain? sometimes i'd wonder.

4. none of the above are always true - they vary from chip to chip, board to
board, generation to generation, manufacturer to manufacturer.

5. the only way to PROVE any of the above to satisfaction requires you do real
speed tests - write the code and benchmark it. comapre it.

6. making everying 128bit alighed for amd64 will be hard as malloc only
guarantees either 32 or 64bit alignment (32bit for 32bit mode) as malloc
guarantees an allocated chunk wil be aligned to the largest single datadtype
the cpu handles natively.

7. making it aligned is not worh it - as when you want to process pixels at odd
pixel co-ordinates u will have unalined accesses then anyway as all pixels in
imlbi2 are 32bit aligned dwords.

8. if source and dest are off by 1 pixels (one aligned one not) you can do no
fixups to make things aligned. they will be so permenantly. you have to just
deal with it.

9. given that u have to deal with unaligned cases anyway - maybe just make the
code work unaligned and do a quick test code to see what the speed diff is
aligned vs. unaligned. i'd be interested to know - but i would imagine the
difference is not huge (given past experience with this stuff over the years).


-- 
------------- Codito, ergo sum - "I code, therefore I am" --------------
The Rasterman (Carsten Haitzler)    [EMAIL PROTECTED]
裸好多                              [EMAIL PROTECTED]
Tokyo, Japan (東京 日本)


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
enlightenment-devel mailing list
enlightenment-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/enlightenment-devel

Re: [E-devel] patch - imlib2 blend in AMD64

Reply via email to