Re: [E-devel] patch - imlib2 blend in AMD64

Tres Melton Tue, 23 Aug 2005 12:37:30 -0700

Somehow the CC of the following never made it to the list.  Here it is again.



On Tue, 2005-08-23 at 11:30 +0900, Carsten Haitzler wrote:
> On Mon, 22 Aug 2005 19:43:58 +0000 Tiago Victor Gehring
> <[EMAIL PROTECTED]> babbled:

Lots of people said..... and then raster said:

> actually do tests - you may find the unaligned copies  not that much slower as
> traditionally x86 hw has always done the fixups for unaligned read/writes in
> hardware and thus the overhead is fairly small.

Tests are needed and so is some discussion.  In relation to the above
topics:


Mornin' all,


        Okay, here's the deal.  I'm going to talk about some stupid shit that
everyone already knows and then you guys can to call me an idiot, jerky,
or whatever.  Raster et. al, please correct me if I am wrong and my
apologies for the review to everyone.

        To review, okay the problem is that the hardware needs to be accounted
for.  It is impossible to load a byte from RAM into cache just like you
can't read a byte from the hard disk.  We all know that you can
read( ?, ?, 1 ) but we also know that when the kernel gets the call it
reads a block, returns the character asked for and holds the rest in a
buffer.  That is what happens in hardware.  The chips have all the wires
coming into them and their controller chip watches them and upon the
proper signal, ALE, it will respond.  The Address Latch Enable (ALE)
means that the CPU is asking for the memory at this address and is
actually a wire on the cpu/mobo and the voltage hits 5V (in my day, now
~2.5) and that current hits the memory controller chip and causes it to
start its cycle (ANDed with the timer wire so it starts at the next
clock tick).  That memory controller interconnects RAM and L[123]_Cache
and they all have to deal with ram in chunks/pages/lines/etc and the
number of bytes in a chunk is always a function based on the number of
wires internally/externally (ie. 386sx = 32bit chip on 16bit bus).  Add
all this shit up, and basically what you have is memory that is byte
addressable on the software level but much more complex on the hardware
level.  

        When the CPU asks for a byte of memory the controller will return a
line that contains the byte in question.  When the CPU asks for a value
that is multiple bytes it will be returned in one or more cache lines of
memory.  If the CPU can be assured that the data in question is aligned
with the same alignment as the storage location is then it can
manipulate the wires once to move the data.  If it is not so aligned
then it must lode the data in 2+ chunks.  The plus is mooted by the
instruction set (it doesn't handle data types larger than the register
size to/from the SIMD core).  This data move can be done in a single
tick (to/from registers and L1 cache) so it is a waste of a cycle to
check for alignment because you could have moved the unaligned data by
then.  The two cycles are contingent upon the fix-ups that raster
mentioned above being efficient and I'm not sure exactly how the
hardware does it.  It could roll the address of the source wire to
achieve alignment or simply take two cycles and move the data in pieces.

        It is therefore appropriate for us to check once at the beginning of
the image.  The preferable solution is to guarantee alignment upon entry
otherwise we are going to have to use unaligned memory moves or have two
pieces of code.  In order to achieve an alignment guarantee we need to
control how *image is created and ensure that it is a pointer that fits
"if ( image % alignment ) then do unaligned_stuff".  This is a function
of the compiler and other things and can be accomplished with the
__align__ operator in C and the .align directive in asm with the GNU
tools.  I haven't investigated all of the possibilities here so I know
that the functions exist but am not entirely positive of the calling
convention nor implementation.  Anyway, all of the image_load type of
functions (or the one image_create ,or whatever it's named, that is
called by all others) need to be rewritten to ensure alignment and we
would still need to check for the possibility of a user created image
that gives an unaligned pointer to the pixels.  In order to avoid the
sigv we would have to check for alignment and maybe call a function with
unaligned moves, re-align the data, or error out in that case.  

        I'm not real familiar with the imlib2 code, and more importantly, how
it is used, so that is why I'm mentioning things like this.  For those
of you that know the internals, what do you propose?  the "works for
all" solution is to just use unaligned memory accesses.  The "faster
than all others" is going to need fully aligned memory, pre-fetched
caches (already in there), and most of all predictability.

        Comments, please...

Cheers,
The River Rat



-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
enlightenment-devel mailing list
enlightenment-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/enlightenment-devel

Re: [E-devel] patch - imlib2 blend in AMD64

Reply via email to