Hi,

On Mon, Jul 23, 2018 at 5:58 AM, Eric Auer <e.a...@jpberlin.de> wrote:
>
> Hi! I am not sure whether I understand your method, so
> maybe you can explain it in more detail. Is the alpha
> mask 1 byte per pixel, either 00 or ff per pixel? The
> multiplication is costly.

Since when is MUL costly? Or only because you're doing it for every
pixel (i.e. thousands of times)? I know he's targeting 16-bit machines
for fun, but indeed, most people will use 386 or newer cpus, where MUL
(etc.) have an "early out" algorithm, so they don't take nearly as
long as you'd think. It's still much faster than DIV, of course.

The normal workaround for MUL, even further, is using faster (386) LEA
to do add/shift/mul in one instruction. BTW, Bret mentions SHL, which
is indeed a teeny bit slower on many processors, so you may wish to
just use a simple ADD instead.

* 
http://web.archive.org/web/20150920114055fw_/http://dflund.se/~john_e/gems/gem0009.html

But it's not worth counting cycles, even for ancient machines, until
you've finalized exactly what you're trying to do. Premature
optimization is usually a waste of time (but a bit of forethought
beforehand doesn't hurt). See Agner Fog's manuals (although for an
actual 8088/8086, you might just want to email a guru like Jim
Leonard).

> You can also use bit test
> and "set conditionally" (to 0 or 255) and "move
> conditionally" byte sized 386 operations, but then
> you are back to pixel wise processing. The good
> thing about conditional setting and moving is that
> you avoid conditional jumps which are always more
> time-consuming than a fixed calculation which can
> involve conditional setting and moving :-)

CMOVxx is 686 (PPro) only. SETxx is indeed 386 only, but you can
halfway fake it on 8086 / 16-bit cpus. I don't know of a perfect
example offhand, but even I've done it (barely). Basically, you
combine boolean results into one and only jump when absolutely needed.
Or else you use a mask, "or" onto it in certain cases, then do your
operation with that value (where false is a no-op). Something like
that, it's hard to explain. BTW, jumps aren't really slow except on
8086, so newer processors (e.g. 486) make it not worth worrying about
(except maybe due to small cpu instruction cache or no branch
prediction or slow cpu clock or such other problem).

Actually, I forgot that (barely documented) SALC is basically SBB
AL,AL, which is similar to (386) SETC AL. So you're basically
moving/extending into a register from a flag result of some operation
then using that mask to do some further conditional bitwise operation.

* 
http://web.archive.org/web/20150920114042fw_/http://dflund.se/~john_e/gems/gem0013.html
* 
http://web.archive.org/web/20150920114042fw_/http://dflund.se/~john_e/gems/gem000f.html

I don't know if this explains it, but I gleaned this from some old
Usenet posting:

  xchg ah,al
  cmp ah,10
  sbb bh,bh
  cmp al,10
  sbb bl,bl
  and bx,0707h
  add ax,'77'
  sub ax,bx

See what I mean? Here's another example I wrote myself (but it's a bit
sloppy/confusing):

mov cx,'az' ; check if lowercase alpha
push cx
call rangecheck
...
ret
int 20h
...
rangecheck: ; in: (upper_limit shl 8) + lower_limit
pop bp
pop bx ; mov bx,[sp+2]
push bp
.check:
; int3
cmp al,bl
sbb ch,ch
inc bh
cmp al,bh
cmc
sbb bl,bl
or bl,ch
cmp bl,1 ; set CF if BL == 0
cmc ; return NC if AL within valid range
ret

But of course even CALL/RET is slow on 8086, too, but newer cpus make
it not a problem. Again, I don't know if he really truly cares about
every single old cpu. I only pretend to care (for fun, completeness,
etc.) because I don't even have any 8086s or similar old cpus. (But I
do heavily prefer backwards compatible software!) Even my old 486 is
disconnected, probably broken. But it doesn't hurt to be careful and
try to be compatible in software anyways (in theory).

>> With 4 pixels loaded in a 32-bit register:
>>
>> AND the input pixels with the alpha mask
>> SHR this result so that the bit is in position 0
>> Multiply so that this bit is expanded to a full byte of 1s
>> AND the input and screen with this mask
>> OR the modified input onto the screen

I don't think he cares as much about 386, but it doesn't hurt to tell
him anyways. In particular, it's fairly easy (even before CPUID) to
detect cpu at runtime (see Eric's CPULEVEL tool) ... or at least let
the user manually enable it via cmdline, if that isn't feasible. So
the optimal solution, if you're diligent enough, is to optimize very
frequently used routines for both 8086 and 386 (or 686 or whatever).
Dynamic cpu dispatch via function pointers (or whatever you want to
call it).

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Freedos-devel mailing list
Freedos-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-devel

Reply via email to