Hi Julian,

I added bzip2 applet to busybox. See http://busybox.net/
for more info about busybox.

In the process of adapting it, I found a few places where I can
optimize it a bit. However, bigger optimizations came simply
from declaring functions static, as recent gcc versions
are using register parameter passing in this case.

As a result, bzip2 in busybox is significantly faster than bzip2
built from your latest source tarball. To be more exact:
when compiled with gcc 4.2.1, and run on Athlon 64 1800 MHz
with 512K L2 cache, stock bzip2 is 26.4% slower than bbox bzip2.
(time to compress gcc-4.2.1.tar is 126.4% compared to bbox).

Small optimization opportunities I noticed and implemented
in compress.c (sorry, indentation and formatting a bit changed,
but shouldn't be hard to find these places):

                        /*
                         * Find the coding table which is best for this group,
                         * and record its identity in the selector table.
                         */
                        /*bc = 999999999;*/
                        /*bt = -1;*/
                        bc = cost[0];
                        bt = 0;
                        for (t = 1 /*0*/; t < nGroups; t++) {
                                if (cost[t] < bc) {
                                        bc = cost[t];
                                        bt = t;
                                }
                        }

And here:
"Transmit the mapping table" part is heavily modified,
temporary vector eliminated (now we use bitmap instead, fits into one register),
two loops eliminated, many one-bit bsW() calls coalesced into 16-bit ones.
Object code is much smaller:

        /*--- Transmit the mapping table. ---*/
        {
                /* bbox: optimized a bit more than in bzip2 */
                int inUse16 = 0;
                for (i = 0; i < 16; i++) {
                        if (sizeof(long) <= 4) {
                                inUse16 = inUse16*2 +
                                        ((*(uint32_t*)&(s->inUse[i * 16 + 0])
                                        | *(uint32_t*)&(s->inUse[i * 16 + 4])
                                        | *(uint32_t*)&(s->inUse[i * 16 + 8])
                                        | *(uint32_t*)&(s->inUse[i * 16 + 12])) 
!= 0);
                        } else { /* Our CPU can do better */
                                inUse16 = inUse16*2 +
                                        ((*(uint64_t*)&(s->inUse[i * 16 + 0])
                                        | *(uint64_t*)&(s->inUse[i * 16 + 8])) 
!= 0);
                        }
                }

                nBytes = s->numZ;
                bsW(s, 16, inUse16);

                inUse16 <<= (sizeof(int)*8 - 16); /* move 15th bit into sign 
bit */
                for (i = 0; i < 16; i++) {
                        if (inUse16 < 0) {
                                unsigned v16 = 0;
                                for (j = 0; j < 16; j++)
                                        v16 = v16*2 + s->inUse[i * 16 + j];
                                bsW(s, 16, v16);
                        }
                        inUse16 <<= 1;
                }
        }

Again, I do not think that observed speedup is due to these optimizations
only, they are too small.

Please note that gcc 4.2.1 disregards inline here:

static
__inline__
void bsW(...)

You need __attribute__(( always_inline )) if you want to force it.

If you want to run busybox's bzip2, do this:

svn co svn://busybox.net/trunk/busybox
cd busybox
make allnoconfig
make menuconfig
(select bzip2 in the menu, save and exit)
make

You will have busybox binary built. Rename it to bzip2.

bzip2 source is in archival/bzip2.c and archival/bz/*

Thanks for your excellent code,

Best regards, Denis.
--
vda
_______________________________________________
busybox mailing list
[email protected]
http://busybox.net/cgi-bin/mailman/listinfo/busybox

Reply via email to