Re: bunzip2 fails to decompress pbzip2-compressed files

Denys Vlasenko Sun, 31 Oct 2010 18:06:16 -0700

On Sunday 31 October 2010 23:17, Rob Landley wrote:
> On Thursday 28 October 2010 16:10:35 Denys Vlasenko wrote:
> > On Friday 22 October 2010 20:07, Dan Fandrich wrote:
> > > pbzip2 is a parallel bzip2 compressor that uses multiple threads while
> > > compressing to linearly speed bzip2 compression by the number of cores
> > > available (see http://compression.ca/pbzip2/). The files it produces
> > > are compatible with traditional bzip2, but have slightly different output
> > > because of the way the independently-compressed blocks are concatenated.
> > >
> > > Unfortunately, this seems to prevent Busybox's bunzip2 from decompressing
> > > them fully. Busybox decompresses the first block, then silently stops
> > > without indicating any sort of error.  It can be reproduced like this:
> > >
> > > $ cp /lib/libc-2.9.so bigfile   #need a file larger than 900K
> > > $ pbzip2 -9 bigfile
> > > $ bzcat bigfile.bz2 | wc -c
> > > 1331404
> > > $ busybox bzcat bigfile.bz2 | wc -c
> > > 900000
> > > $ busybox bunzip2 bigfile.bz2
> > > $ echo $?
> > > 0
> > > $ stat -c %s bigfile
> > > 900000
> 
> So pbzip2 is producing nonstandard files that work by coincidence on the old 
> implementation, and we need to change to accomodate this.  Ok.


I was surprised too when one day users filed
a bug "busybox is unable to read concatenated gz files"...

I googled it and discovered that it is sort-of standard feature, not a fluke,
that cz (and bz2) files can be just cat'ed together.

It's just not a well known fact.

> >-#define RETVAL_SHORT_WRITE              (-4)
> >+//#define RETVAL_SHORT_WRITE              (-4)
> 
> Are you saying that if the disk fills up or similar, not being able to detect 
> this is an improvement?  Or are you saying the wrapper functions will die() 
> for us, which they didn't used to do?

No, it's just me adding bugz :D  Fixing:

-//#define RETVAL_SHORT_WRITE              (-4)
+#define RETVAL_SHORT_WRITE              (-4)
 #define RETVAL_DATA_ERROR               (-5)
 #define RETVAL_OUT_OF_MEMORY            (-6)
 #define RETVAL_OBSOLETE_INPUT           (-7)
@@ -745,6 +745,7 @@ unpack_bz2_stream(int src_fd, int dst_fd)
                                        break;
                                if (i != full_write(dst_fd, outbuf, i)) {
                                        bb_error_msg("short write");
+                                       i = RETVAL_SHORT_WRITE;
                                        goto release_mem;


> >-int FAST_FUNC start_bunzip(bunzip_data **bdp, int in_fd, const unsigned
> > char *inbuf, -                                              int len)
> >+int FAST_FUNC start_bunzip(bunzip_data **bdp, int in_fd,
> >+            const void *inbuf, int len)
> 
> inbuf is an unsigned char array.  Honest and truly it is.  Specifically, if 
> you 
> don't say "unsigned" you introduce bugs on different platforms based on 
> whether 
> or not "char" defaults to unsigned, which is why I was careful to get it 
> right 
> and be consistent about it.

In parameter, having [const] void* allows _callers_ to avoid casting the 
pointer.
The same patch removes two such casts.


> What's all this FAST_FUNC nonsense anyway?  If the compiler can't optimize, 
> fix 
> the compiler.  This is DEEPLY NOT OUR PROBLEM.

Eh... i386 stack-passing thing isn't easy to fix "globally".
This is the best I managed to come up with.


> 
> >     /* Init the CRC32 table (big endian) */
> >     crc32_filltable(bd->crc32Table, 1);
> >@@ -652,37 +654,59 @@ IF_DESKTOP(long long) int FAST_FUNC
> > unpack_bz2_stream(int src_fd, int dst_fd)
> > {
> >     IF_DESKTOP(long long total_written = 0;)
> >+            bunzip_data *bd;
> >     char *outbuf;
> 
> Why does the indentation change here?

This is wrong. Fixing.


> I still viscerally cringe when I see while (1) instead of for(;;).  I'm aware 
> that modern optimizers take it out, but when there is a way to state exactly 
> what you want the code to do and you choose to instead say something you 
> _don't_ want the code to actually look like, I don't understand why.  Oh well.

Will it make you feel better if I will use for(;;) for this idiom?


> >+            i = start_bunzip(&bd, src_fd, outbuf + 2, len);
> >+
> >+            if (i == 0) {
> >+                    while (1) { /* "Produce some output bytes" loop */
> >+                            i = read_bunzip(bd, outbuf, IOBUF_SIZE);
> >+                            if (i <= 0)
> >+                                    break;
> >+                            if (i != full_write(dst_fd, outbuf, i)) {
> >+                                    bb_error_msg("short write");
> >+                                    goto release_mem;
> >+                            }
> >+                            IF_DESKTOP(total_written += i;)
> 
> If you say:
> 
>   if (DESKTOP) total_written += i;
> 
> The result will always be syntax checked.  (Local variables that are never 
> used get optimized out.)

"Never used" part is easy to get wrong. One single forgotten IF_DESKTOP()
around "total_written += i;" and the variable is suddenly isn't "never used"
anymore.

I want compiler to check me. Therefore *I do not declare* total_written,
and compiler will scream at every use.


> IF_DESKTOP(xxx) is more concise way of doing an #ifdef.  I originally came up 
> with it to chop out varargs in functions without needing multi-line #ifdef 
> blocks for each one.  As with all #ifdefs, it introduces the possibility that 
> different configurations will build break.  It's also yet another non-obvious 
> non-C construct that we have to EXPLAIN to people.  Not explaining why doing 
> it that way is smaller or faster or more portable, but explaining what the 
> heck it does in the first place and how to touch the code without breaking 
> the 
> increasingly brittle build.
> 
> Why on earth are there archival/bbunzip_test*.sh files outside of the test 
> suite directory?

There was a time when I did not use testsuite. These are vestigial bits
of my ad-hoc tests. I need to move them to testsuite.


> Why are there half a dozen archival/lzo*.c files in a directory that used to 
> just have things like "rpm.c", "tar.c" with all the actual compression code 
> living in libunarchive?

archival/lzo*.c and archival/bz/* are not _un_archivers. But yes, they
can be moved to libunarchive. Come to think about it, this is
a good idea.


> I'm sorry, I can't look at this anymore.  It's not fun.  I'm going to go 
> write 
> a shell script version of zapchroot rather than trying to make busybox umount 
> -a do it, and just stop bothering you all.  Go have fun writing fractal code 
> nobody else can understand.  Micro-optimize until it shatters.  Outdo the FSF.
> 
> I was in it for the simple, and there is no more simple here.  It's not a 
> question of doing cleanup, it's a question of the project's goals clearly no 
> longer valuing "simple" or "clean" even slightly.  The code gets uglier every 
> release.  As far as I can tell the only reason you're still writing in C is 
> because hand-coded assembly isn't portable.

Which brings us to a meta-question:

Obviously, different people are good at (slightly) different things
(optimize for -O2 versus optimize for -Os etc);
and also they like different styles (while(1) versus for(;;),
// versus /* */, 4-char tabs versus 8-char ones, placement of {},
levels of optimizations ("how ugly are we ready to make our code
to squeeze out that last microsecond/byte"), etc etc etc.

How to make them cooperate even though they do not 100% agree on
"how to do things"?

"Everyone goes to his own corner and play only with his toys" is
a solution, yes, but it does not scale: you can't reimplement
everything, not alone, not in one lifetime. If you do want
to reimplement coreutils, util-linux, etc, + libc + maybe even bits
of kernel, you have to work with other people.

You don't have to work with really nasty guys (let's not point
at well-known examples here), but OTOH you can't expect people
will magically code in a way which is EXACTLY matching
your views of proper balance between "simple", "clean", "fast",
"small", "compatible" etc without (at a minimum) receiving
your comments. How would they know they strayed away from
"one true path"?


-- 
vda
_______________________________________________
busybox mailing list
[email protected]
http://lists.busybox.net/mailman/listinfo/busybox

Re: bunzip2 fails to decompress pbzip2-compressed files

Reply via email to