On 03/18/2010 03:11 PM, Walter Bright wrote:
Andrei Alexandrescu wrote:
Reading the file header (e.g. first 512 bytes) and then matching
against archive signatures is, I think, a very nice touch. (I was only
thinking of matching by file name.) There is a mild complication - you
can't close and reopen the archive, so you need to pass those 512
bytes to the archiver along with the rest of the stream. This is
because the stream may not be rewindable, as is the case with pipes.
The reasons for reading the file to determine the archive type are:
1. Files sometimes lose their extensions when being transferred around.
I sometimes have this problem when downloading files from the internet -
Windows will store it without an extension.
2. Sometimes I have to remove the extension when sending a file via
email, as stupid email readers block certain email messages based on
file attachment extensions.
3. People don't always put the right extension onto the file.
4. Passing an archive of one type to a reader for another type causes
the reader to crash (yes, I know, readers should be more robust that
way, but reality is reality).
Makes sense.
Is it really necessary to support streaming archives?
It is not necessary, only vital.
The reason I ask
is we can nicely separate building/reading archives from file I/O. The
archives can be entirely done in memory. Perhaps if an archive is being
streamed, the program can simply accumulate it all in memory, then call
the archive library functions.
This is completely nonscalable! 90% of all my archive manipulation
involves streaming, and I wouldn't dream of thinking of loading most of
those files in RAM. They are huge!
I paste from a script I'm working on right now:
if [[ ! -f $D/sentences.num.gz ]]; then
echo '# Text to numeric...'
./txt2num.d $D/voc.txt \
< <(pv $D/sentences.txt.gz | gunzip) \
> >(gzip >$D/sentences.num.tmp.gz)
mv $D/sentences.num.tmp.gz $D/sentences.num.gz
fi
That takes a good amount of time to run because the .gz involved is
2,180,367,456 bytes _after_ compression. Note how zipping is done both
ways - on reading and writing.
It would be great if we all went to the utmost possible lengths to
distance ourselves from such nonscalable thinking. It's the root reason
for which the wc sample program on digitalmars.com is _inappropriate_
and _damaging_ to the reputation of the language, and also the reason
for which hash tables' implementation performs so poorly on large data -
i.e., exactly when it matters. It's the kind of thinking stemming from
"But I don't have _one_ file larger than 1GB anywhere on my hard drive!"
which you repeatedly claimed as if it were a solid argument. Well if you
don't have one you better get some.
Nobody's going to give us a cookie if we process 50KB files 10 times
faster than Perl or Python. Where it does matter is large data, and I'd
be in a much better mood if I didn't feel my beard growing while I'm
waiting next to a program that uses hashes to build a large index file.
Andrei