Paul Eggert [EMAIL PROTECTED] wrote:
Jim Meyering [EMAIL PROTECTED] writes:
I'm probably going to change the documentation so that
people will be less likely to depend on being able to run
a separate program. To be precise, I'd like to document
that the only valid values of GNUSORT_COMPRESSOR are the
empty string, gzip and bzip2[*].
This sounds extreme, particularly since gzip and bzip2 are
not the best algorithms for 'sort' compression, where you
want a fast compressor. Better choices right now would
include include lzop http://www.lzop.org/ and maybe
QuickLZ http://www.quicklz.com/.
I see that streaming support in QuickLZ is available only in the very
latest beta. That makes me think it's not quite ready for use in
GNU sort. Maybe lzop is more mature. I haven't looked yet.
The fast-compressor field is moving fairly rapidly.
(I've heard some rumors from some of my commercial friends.)
QuickLZ, a new algorithm, is at the top of the
maximumcompression list right now for fast compressors; see
http://www.maximumcompression.com/data/summary_mf3.php.
I would not be surprised to see a new champ next year.
Then we will have the liberty to remove the exec calls and use library
code instead, thus making the code a little more efficient -- but mainly,
more robust.
It's not clear to me that it'll be more efficient for the
soon-to-be common case of multicore chips, since 'sort' and
the compressor can run in parallel.
No. I proposed to remove only the *exec*, not the _fork_. The idea
is to have each child run library code to (de)compress, rather than to
exec an external program. Well, I do want it to revert to sequential
whenever fork fails, but otherwise it would still take advantage of
_some_ parallelism: 2-way when compressing, and up to 17-way (NMERGE + 1)
when merging. Of course, this parallelism currently kicks in only
with a data set large enough to require temporary files. Now that
dual-core systems are common, a nice project would be to make GNU sort
take advantage of that when performing day-to-day (in-memory) sorts.
We'll have to measure.
I agree about the robustness but that should be up to the user.
I want to keep the default code paths as simple and reliable as possible.
That means no exec, and a fall-back to running library code sequentially
upon fork failure -- both for compression and decompression. While this
new feature is a very nice addition, we have to be careful to ensure
that it does not compromise sort's reliability, even under duress.
Here's my reasoning: when compressing with an arbitrary program, sort
may do a *lot* of work up front (there is no decompression at all in the
beginning), yet fail only near the end when it discovers that prog
doesn't accept -d or can't fork. The result: the time spent sorting
that large data set is wasted. With judicious support for at least one
built-in compression method, this error path is completely eliminated.
This is yet another way in which sort differs from tar. If an invalid
compression program is specified to tar, tar fails right away -- it
doesn't first waste hours of compute time. In fact, given an invalid
compressor program, sort may well work repeatedly, potentially over a span
of days or longer, before encountering an input large enough to trigger
the need to use temporary files, where it could hit a decompression
failure. Avoiding this sort of delayed failure is my main motivation
for defining away the problem. And as I'm sure you agree, I'd rather
not have sort test (fork/exec) the compression program to ensure that
it works with -d.
Another reason to avoid the exec is memory. When merging, there are
usually 16 separate gzip processes running in parallel. Sure, each is
pretty small, with a RSS of under 500K on my system, but with 16 of
them, it does add up.
So we need support for at least one built-in compression
library. Here are my selection criteria, in decreasing order
of importance:
robust
portable
reasonable run-time efficiency
reasonable compression performance
If some other compression library is a better fit, then we
should add support for it, and consider making it the default.
Similarly, if there are people interested in sorting huge data sets
for which general purpose compression isn't good enough, they can
provide profiling data showing how their domain-specific compressor
is essential. If this ever happens, it would be a good reason to
allow sort to exec an arbitrary program again.
Perhaps we could put in something that says, If the
compressor is named 'gzip' we may optimize that. and
similarly for 'lzop' and/or a few other compressor names.
Or, more generally, we could have the convention that if the
compressor name starts with - we will strip the - and
then try to optimize the result if we can. Something like
that, anyway.
[*] If gzip and bzip2 are good enough for tar, why should sort make any
compromise (exec'ing some other program) in order