On Sunday 31 October 2010 23:17, Rob Landley wrote:
> On Thursday 28 October 2010 16:10:35 Denys Vlasenko wrote:
> > On Friday 22 October 2010 20:07, Dan Fandrich wrote:
> > > pbzip2 is a parallel bzip2 compressor that uses multiple threads while
> > > compressing to linearly speed bzip2 compression by the number of cores
> > > available (see http://compression.ca/pbzip2/). The files it produces
> > > are compatible with traditional bzip2, but have slightly different output
> > > because of the way the independently-compressed blocks are concatenated.
> > >
> > > Unfortunately, this seems to prevent Busybox's bunzip2 from decompressing
> > > them fully. Busybox decompresses the first block, then silently stops
> > > without indicating any sort of error. It can be reproduced like this:
> > >
> > > $ cp /lib/libc-2.9.so bigfile #need a file larger than 900K
> > > $ pbzip2 -9 bigfile
> > > $ bzcat bigfile.bz2 | wc -c
> > > 1331404
> > > $ busybox bzcat bigfile.bz2 | wc -c
> > > 900000
> > > $ busybox bunzip2 bigfile.bz2
> > > $ echo $?
> > > 0
> > > $ stat -c %s bigfile
> > > 900000
>
> So pbzip2 is producing nonstandard files that work by coincidence on the old
> implementation, and we need to change to accomodate this. Ok.
I was surprised too when one day users filed
a bug "busybox is unable to read concatenated gz files"...
I googled it and discovered that it is sort-of standard feature, not a fluke,
that cz (and bz2) files can be just cat'ed together.
It's just not a well known fact.
> >-#define RETVAL_SHORT_WRITE (-4)
> >+//#define RETVAL_SHORT_WRITE (-4)
>
> Are you saying that if the disk fills up or similar, not being able to detect
> this is an improvement? Or are you saying the wrapper functions will die()
> for us, which they didn't used to do?
No, it's just me adding bugz :D Fixing:
-//#define RETVAL_SHORT_WRITE (-4)
+#define RETVAL_SHORT_WRITE (-4)
#define RETVAL_DATA_ERROR (-5)
#define RETVAL_OUT_OF_MEMORY (-6)
#define RETVAL_OBSOLETE_INPUT (-7)
@@ -745,6 +745,7 @@ unpack_bz2_stream(int src_fd, int dst_fd)
break;
if (i != full_write(dst_fd, outbuf, i)) {
bb_error_msg("short write");
+ i = RETVAL_SHORT_WRITE;
goto release_mem;
> >-int FAST_FUNC start_bunzip(bunzip_data **bdp, int in_fd, const unsigned
> > char *inbuf, - int len)
> >+int FAST_FUNC start_bunzip(bunzip_data **bdp, int in_fd,
> >+ const void *inbuf, int len)
>
> inbuf is an unsigned char array. Honest and truly it is. Specifically, if
> you
> don't say "unsigned" you introduce bugs on different platforms based on
> whether
> or not "char" defaults to unsigned, which is why I was careful to get it
> right
> and be consistent about it.
In parameter, having [const] void* allows _callers_ to avoid casting the
pointer.
The same patch removes two such casts.
> What's all this FAST_FUNC nonsense anyway? If the compiler can't optimize,
> fix
> the compiler. This is DEEPLY NOT OUR PROBLEM.
Eh... i386 stack-passing thing isn't easy to fix "globally".
This is the best I managed to come up with.
>
> > /* Init the CRC32 table (big endian) */
> > crc32_filltable(bd->crc32Table, 1);
> >@@ -652,37 +654,59 @@ IF_DESKTOP(long long) int FAST_FUNC
> > unpack_bz2_stream(int src_fd, int dst_fd)
> > {
> > IF_DESKTOP(long long total_written = 0;)
> >+ bunzip_data *bd;
> > char *outbuf;
>
> Why does the indentation change here?
This is wrong. Fixing.
> I still viscerally cringe when I see while (1) instead of for(;;). I'm aware
> that modern optimizers take it out, but when there is a way to state exactly
> what you want the code to do and you choose to instead say something you
> _don't_ want the code to actually look like, I don't understand why. Oh well.
Will it make you feel better if I will use for(;;) for this idiom?
> >+ i = start_bunzip(&bd, src_fd, outbuf + 2, len);
> >+
> >+ if (i == 0) {
> >+ while (1) { /* "Produce some output bytes" loop */
> >+ i = read_bunzip(bd, outbuf, IOBUF_SIZE);
> >+ if (i <= 0)
> >+ break;
> >+ if (i != full_write(dst_fd, outbuf, i)) {
> >+ bb_error_msg("short write");
> >+ goto release_mem;
> >+ }
> >+ IF_DESKTOP(total_written += i;)
>
> If you say:
>
> if (DESKTOP) total_written += i;
>
> The result will always be syntax checked. (Local variables that are never
> used get optimized out.)
"Never used" part is easy to get wrong. One single forgotten IF_DESKTOP()
around "total_written += i;" and the variable is suddenly isn't "never used"
anymore.
I want compiler to check me. Therefore *I do not declare* total_written,
and compiler will scream at every use.
> IF_DESKTOP(xxx) is more concise way of doing an #ifdef. I originally came up
> with it to chop out varargs in functions without needing multi-line #ifdef
> blocks for each one. As with all #ifdefs, it introduces the possibility that
> different configurations will build break. It's also yet another non-obvious
> non-C construct that we have to EXPLAIN to people. Not explaining why doing
> it that way is smaller or faster or more portable, but explaining what the
> heck it does in the first place and how to touch the code without breaking
> the
> increasingly brittle build.
>
> Why on earth are there archival/bbunzip_test*.sh files outside of the test
> suite directory?
There was a time when I did not use testsuite. These are vestigial bits
of my ad-hoc tests. I need to move them to testsuite.
> Why are there half a dozen archival/lzo*.c files in a directory that used to
> just have things like "rpm.c", "tar.c" with all the actual compression code
> living in libunarchive?
archival/lzo*.c and archival/bz/* are not _un_archivers. But yes, they
can be moved to libunarchive. Come to think about it, this is
a good idea.
> I'm sorry, I can't look at this anymore. It's not fun. I'm going to go
> write
> a shell script version of zapchroot rather than trying to make busybox umount
> -a do it, and just stop bothering you all. Go have fun writing fractal code
> nobody else can understand. Micro-optimize until it shatters. Outdo the FSF.
>
> I was in it for the simple, and there is no more simple here. It's not a
> question of doing cleanup, it's a question of the project's goals clearly no
> longer valuing "simple" or "clean" even slightly. The code gets uglier every
> release. As far as I can tell the only reason you're still writing in C is
> because hand-coded assembly isn't portable.
Which brings us to a meta-question:
Obviously, different people are good at (slightly) different things
(optimize for -O2 versus optimize for -Os etc);
and also they like different styles (while(1) versus for(;;),
// versus /* */, 4-char tabs versus 8-char ones, placement of {},
levels of optimizations ("how ugly are we ready to make our code
to squeeze out that last microsecond/byte"), etc etc etc.
How to make them cooperate even though they do not 100% agree on
"how to do things"?
"Everyone goes to his own corner and play only with his toys" is
a solution, yes, but it does not scale: you can't reimplement
everything, not alone, not in one lifetime. If you do want
to reimplement coreutils, util-linux, etc, + libc + maybe even bits
of kernel, you have to work with other people.
You don't have to work with really nasty guys (let's not point
at well-known examples here), but OTOH you can't expect people
will magically code in a way which is EXACTLY matching
your views of proper balance between "simple", "clean", "fast",
"small", "compatible" etc without (at a minimum) receiving
your comments. How would they know they strayed away from
"one true path"?
--
vda
_______________________________________________
busybox mailing list
[email protected]
http://lists.busybox.net/mailman/listinfo/busybox