Re: dmdz

Andrei Alexandrescu Thu, 18 Mar 2010 13:50:37 -0700

On 03/18/2010 03:11 PM, Walter Bright wrote:

Andrei Alexandrescu wrote:

Reading the file header (e.g. first 512 bytes) and then matching
against archive signatures is, I think, a very nice touch. (I was only
thinking of matching by file name.) There is a mild complication - you
can't close and reopen the archive, so you need to pass those 512
bytes to the archiver along with the rest of the stream. This is
because the stream may not be rewindable, as is the case with pipes.


The reasons for reading the file to determine the archive type are:

1. Files sometimes lose their extensions when being transferred around.
I sometimes have this problem when downloading files from the internet -
Windows will store it without an extension.

2. Sometimes I have to remove the extension when sending a file via
email, as stupid email readers block certain email messages based on
file attachment extensions.

3. People don't always put the right extension onto the file.

4. Passing an archive of one type to a reader for another type causes
the reader to crash (yes, I know, readers should be more robust that
way, but reality is reality).


Makes sense.

Is it really necessary to support streaming archives?


It is not necessary, only vital.

The reason I ask
is we can nicely separate building/reading archives from file I/O. The
archives can be entirely done in memory. Perhaps if an archive is being
streamed, the program can simply accumulate it all in memory, then call
the archive library functions.

This is completely nonscalable! 90% of all my archive manipulationinvolves streaming, and I wouldn't dream of thinking of loading most ofthose files in RAM. They are huge!


I paste from a script I'm working on right now:

    if [[ ! -f $D/sentences.num.gz ]]; then
        echo '# Text to numeric...'
        ./txt2num.d $D/voc.txt \
            < <(pv $D/sentences.txt.gz | gunzip) \
            > >(gzip >$D/sentences.num.tmp.gz)
        mv $D/sentences.num.tmp.gz $D/sentences.num.gz
    fi

That takes a good amount of time to run because the .gz involved is2,180,367,456 bytes _after_ compression. Note how zipping is done bothways - on reading and writing.

It would be great if we all went to the utmost possible lengths todistance ourselves from such nonscalable thinking. It's the root reasonfor which the wc sample program on digitalmars.com is _inappropriate_and _damaging_ to the reputation of the language, and also the reasonfor which hash tables' implementation performs so poorly on large data -i.e., exactly when it matters. It's the kind of thinking stemming from"But I don't have _one_ file larger than 1GB anywhere on my hard drive!"which you repeatedly claimed as if it were a solid argument. Well if youdon't have one you better get some.

Nobody's going to give us a cookie if we process 50KB files 10 timesfaster than Perl or Python. Where it does matter is large data, and I'dbe in a much better mood if I didn't feel my beard growing while I'mwaiting next to a program that uses hashes to build a large index file.



Andrei

Re: dmdz

Reply via email to