Re: sort behavior - Ubuntu problem?

2007-01-25 Thread The Wanderer

Kevin Scannell wrote:


  Wanderer, could you tell me what version of glibc you have?  Here's mine:
ii  libc6-dev  2.4-1ubuntu12  GNU C Library: Development Libraries 
and Hea


ii  libc6-dev  2.3.6.ds1-9GNU C Library: Development Libraries 
and Heade


Apparently Ubuntu is more bleeding-edge than Debian unstable is, in this
respect.

--
  The Wanderer

Warning: Simply because I argue an issue does not mean I agree with any
side of it.

Secrecy is the beginning of tyranny.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: feature request: gzip/bzip support for sort

2007-01-25 Thread Jim Meyering
Paul Eggert [EMAIL PROTECTED] wrote:
 Jim Meyering [EMAIL PROTECTED] writes:
 I'm probably going to change the documentation so that
 people will be less likely to depend on being able to run
 a separate program.  To be precise, I'd like to document
 that the only valid values of GNUSORT_COMPRESSOR are the
 empty string, gzip and bzip2[*].

 This sounds extreme, particularly since gzip and bzip2 are
 not the best algorithms for 'sort' compression, where you
 want a fast compressor.  Better choices right now would
 include include lzop http://www.lzop.org/ and maybe
 QuickLZ http://www.quicklz.com/.

I see that streaming support in QuickLZ is available only in the very
latest beta.  That makes me think it's not quite ready for use in
GNU sort.  Maybe lzop is more mature.  I haven't looked yet.

 The fast-compressor field is moving fairly rapidly.
 (I've heard some rumors from some of my commercial friends.)
 QuickLZ, a new algorithm, is at the top of the
 maximumcompression list right now for fast compressors; see
 http://www.maximumcompression.com/data/summary_mf3.php.
 I would not be surprised to see a new champ next year.

 Then we will have the liberty to remove the exec calls and use library
 code instead, thus making the code a little more efficient -- but mainly,
 more robust.

 It's not clear to me that it'll be more efficient for the
 soon-to-be common case of multicore chips, since 'sort' and
 the compressor can run in parallel.

No.  I proposed to remove only the *exec*, not the _fork_.  The idea
is to have each child run library code to (de)compress, rather than to
exec an external program.  Well, I do want it to revert to sequential
whenever fork fails, but otherwise it would still take advantage of
_some_ parallelism: 2-way when compressing, and up to 17-way (NMERGE + 1)
when merging.  Of course, this parallelism currently kicks in only
with a data set large enough to require temporary files.  Now that
dual-core systems are common, a nice project would be to make GNU sort
take advantage of that when performing day-to-day (in-memory) sorts.

 We'll have to measure.
 I agree about the robustness but that should be up to the user.

I want to keep the default code paths as simple and reliable as possible.
That means no exec, and a fall-back to running library code sequentially
upon fork failure -- both for compression and decompression.  While this
new feature is a very nice addition, we have to be careful to ensure
that it does not compromise sort's reliability, even under duress.

Here's my reasoning: when compressing with an arbitrary program, sort
may do a *lot* of work up front (there is no decompression at all in the
beginning), yet fail only near the end when it discovers that prog
doesn't accept -d or can't fork.  The result: the time spent sorting
that large data set is wasted.  With judicious support for at least one
built-in compression method, this error path is completely eliminated.
This is yet another way in which sort differs from tar.  If an invalid
compression program is specified to tar, tar fails right away -- it
doesn't first waste hours of compute time.  In fact, given an invalid
compressor program, sort may well work repeatedly, potentially over a span
of days or longer, before encountering an input large enough to trigger
the need to use temporary files, where it could hit a decompression
failure.  Avoiding this sort of delayed failure is my main motivation
for defining away the problem.  And as I'm sure you agree, I'd rather
not have sort test (fork/exec) the compression program to ensure that
it works with -d.

Another reason to avoid the exec is memory.  When merging, there are
usually 16 separate gzip processes running in parallel.  Sure, each is
pretty small, with a RSS of under 500K on my system, but with 16 of
them, it does add up.

So we need support for at least one built-in compression
library.  Here are my selection criteria, in decreasing order
of importance:

robust
portable
reasonable run-time efficiency
reasonable compression performance

If some other compression library is a better fit, then we
should add support for it, and consider making it the default.

Similarly, if there are people interested in sorting huge data sets
for which general purpose compression isn't good enough, they can
provide profiling data showing how their domain-specific compressor
is essential.  If this ever happens, it would be a good reason to
allow sort to exec an arbitrary program again.

 Perhaps we could put in something that says, If the
 compressor is named 'gzip' we may optimize that. and
 similarly for 'lzop' and/or a few other compressor names.
 Or, more generally, we could have the convention that if the
 compressor name starts with - we will strip the - and
 then try to optimize the result if we can.  Something like
 that, anyway.

 [*] If gzip and bzip2 are good enough for tar, why should sort make any
 compromise (exec'ing some other program) in order 

stat() order performance issues

2007-01-25 Thread Phillip Susi
I have noticed that performing commands such as ls ( even with -U ) and 
du in a Maildir with many thousands of small files takes ages to 
complete.  I have investigated and believe this is due to the order in 
which the files are stat()ed.  I believe that these utilities are simply 
stat()ing the files in the order that they are returned by readdir(), 
and this causes a lot of random disk reads to fetch the inodes from disk 
out of order.


My initial testing indicates that sorting the files into inode order and 
calling stat() on them in order is around an order of magnitude faster, 
so I would suggest that utilities be modified to behave this way.


Questions/comments?



___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: feature request: gzip/bzip support for sort

2007-01-25 Thread Jim Meyering
Dan Hipschman [EMAIL PROTECTED] wrote:
 On Wed, Jan 24, 2007 at 08:08:18AM +0100, Jim Meyering wrote:
 I've checked in your changes, then changed NEWS a little:

 Great!  Thanks :-)

 Additionally, I'm probably going to change the documentation so that
 people will be less likely to depend on being able to run a separate
 program.  To be precise, I'd like to document that the only valid values
 of GNUSORT_COMPRESSOR are the empty string, gzip and bzip2[*].
 Then we will have the liberty to remove the exec calls and use library
 code instead, thus making the code a little more efficient -- but mainly,
 more robust.

 Why not add a special value 'libz' and document it as follows:

We'll see.  I'm really inclined to disallow the exec option,
unless someone provides a good use case to justify it.  It's fun to add
features, but so far that one is not justified, given its downsides.

 By the way, I've got a little amendment to the patch.  I took a look at
 gnulib's findprog module, and it turns out find_in_path does an access-
 X_OK itself, so sort doesn't need to do it again.


 2007-01-24  Dan Hipschman  [EMAIL PROTECTED]

   * src/sort.c (create_temp): Remove superfluous access-X_OK
   check.  find_in_path does this for us.

Thanks.  Applied.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils