Hi, On Mon, Jul 23, 2018 at 5:58 AM, Eric Auer <e.a...@jpberlin.de> wrote: > > Hi! I am not sure whether I understand your method, so > maybe you can explain it in more detail. Is the alpha > mask 1 byte per pixel, either 00 or ff per pixel? The > multiplication is costly.
Since when is MUL costly? Or only because you're doing it for every pixel (i.e. thousands of times)? I know he's targeting 16-bit machines for fun, but indeed, most people will use 386 or newer cpus, where MUL (etc.) have an "early out" algorithm, so they don't take nearly as long as you'd think. It's still much faster than DIV, of course. The normal workaround for MUL, even further, is using faster (386) LEA to do add/shift/mul in one instruction. BTW, Bret mentions SHL, which is indeed a teeny bit slower on many processors, so you may wish to just use a simple ADD instead. * http://web.archive.org/web/20150920114055fw_/http://dflund.se/~john_e/gems/gem0009.html But it's not worth counting cycles, even for ancient machines, until you've finalized exactly what you're trying to do. Premature optimization is usually a waste of time (but a bit of forethought beforehand doesn't hurt). See Agner Fog's manuals (although for an actual 8088/8086, you might just want to email a guru like Jim Leonard). > You can also use bit test > and "set conditionally" (to 0 or 255) and "move > conditionally" byte sized 386 operations, but then > you are back to pixel wise processing. The good > thing about conditional setting and moving is that > you avoid conditional jumps which are always more > time-consuming than a fixed calculation which can > involve conditional setting and moving :-) CMOVxx is 686 (PPro) only. SETxx is indeed 386 only, but you can halfway fake it on 8086 / 16-bit cpus. I don't know of a perfect example offhand, but even I've done it (barely). Basically, you combine boolean results into one and only jump when absolutely needed. Or else you use a mask, "or" onto it in certain cases, then do your operation with that value (where false is a no-op). Something like that, it's hard to explain. BTW, jumps aren't really slow except on 8086, so newer processors (e.g. 486) make it not worth worrying about (except maybe due to small cpu instruction cache or no branch prediction or slow cpu clock or such other problem). Actually, I forgot that (barely documented) SALC is basically SBB AL,AL, which is similar to (386) SETC AL. So you're basically moving/extending into a register from a flag result of some operation then using that mask to do some further conditional bitwise operation. * http://web.archive.org/web/20150920114042fw_/http://dflund.se/~john_e/gems/gem0013.html * http://web.archive.org/web/20150920114042fw_/http://dflund.se/~john_e/gems/gem000f.html I don't know if this explains it, but I gleaned this from some old Usenet posting: xchg ah,al cmp ah,10 sbb bh,bh cmp al,10 sbb bl,bl and bx,0707h add ax,'77' sub ax,bx See what I mean? Here's another example I wrote myself (but it's a bit sloppy/confusing): mov cx,'az' ; check if lowercase alpha push cx call rangecheck ... ret int 20h ... rangecheck: ; in: (upper_limit shl 8) + lower_limit pop bp pop bx ; mov bx,[sp+2] push bp .check: ; int3 cmp al,bl sbb ch,ch inc bh cmp al,bh cmc sbb bl,bl or bl,ch cmp bl,1 ; set CF if BL == 0 cmc ; return NC if AL within valid range ret But of course even CALL/RET is slow on 8086, too, but newer cpus make it not a problem. Again, I don't know if he really truly cares about every single old cpu. I only pretend to care (for fun, completeness, etc.) because I don't even have any 8086s or similar old cpus. (But I do heavily prefer backwards compatible software!) Even my old 486 is disconnected, probably broken. But it doesn't hurt to be careful and try to be compatible in software anyways (in theory). >> With 4 pixels loaded in a 32-bit register: >> >> AND the input pixels with the alpha mask >> SHR this result so that the bit is in position 0 >> Multiply so that this bit is expanded to a full byte of 1s >> AND the input and screen with this mask >> OR the modified input onto the screen I don't think he cares as much about 386, but it doesn't hurt to tell him anyways. In particular, it's fairly easy (even before CPUID) to detect cpu at runtime (see Eric's CPULEVEL tool) ... or at least let the user manually enable it via cmdline, if that isn't feasible. So the optimal solution, if you're diligent enough, is to optimize very frequently used routines for both 8086 and 386 (or 686 or whatever). Dynamic cpu dispatch via function pointers (or whatever you want to call it). ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Freedos-devel mailing list Freedos-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freedos-devel