Re: sort: new feature: use environment variable to set buffer size

2012-08-30 Thread Jim Meyering
Pádraig Brady wrote:
 Thanks for the detailed rationale, however
 the existing env variables are significant to more utils than sort(1).
 I.E. they're generally system level settings, rather than command level.
 Also sort -S is very portable, even though not standardised.
 solaris' sort(1) has -S and GNU sort is used on most other platforms,
 which has -S available since TEXTUTILS-2_0_10-58-gbf86c62

 Note also this thread on the selection of a default buffer size for pipes:
 http://thread.gmane.org/gmane.comp.gnu.coreutils.general/878/focus=887

 So currently I'd be 70:30 against adding such a variable.

Thanks for replying, Pádraig.  Those are good points.
Another is that we have a strong aversion to adding
support for new environment variables in coreutils programs
because it makes formerly-robust usage prone to malfunction or abuse.

Imagine a script that worked fine before, but when someone
ends up running sort with some unusual or extreme envvar
setting that now makes that shell script's use of sort fail.

With a new envvar as proposed, every sort-using script would
have to be retrofitted with code to keep an erroneous or malicious
envvar setting from impacting sort.

Just say No! to new envvars ;-)



sort: new feature: use environment variable to set buffer size

2012-08-29 Thread Assaf Gordon
Hello,

I'd like to suggest a new feature to sort: the ability to set the buffer size 
(-S/--buffer-size X) using an environment variable.

In summary:
 $ export SORT_BUFFER_SIZE=20G 
 $ someprogram | sort -k1,1  output.txt
 # sort will use 20G of RAM, as if --buffer-size 20G was specified.


The rational:
recent commits improved the guessed buffer size when sort is given an input 
file,
but these don't apply if sort is used as part of a pipe line, with a pipe as 
input, e.g.
  some | program | sort | other | programs  file 

(Tested with v8.19 on linux 2.6.32, sort consumes few MBs of RAM, even though 
many GBs are available).
This results in many small temporary files being created.

The script (which uses sort) is not under my direct control, but even if it was,
I don't want to hard-code the amount of memory used, to keep it portable to 
different servers.

AFAIK, there are four aspects of sort the affect performance:
1. number of threads:
changeable with --parallel=X and with environment variable OMP_NUM_THREADS.

2. temporary files location:
changeable with --temporary-directory=DIR and with environment variable 
TMPDIR.

3. memory usage:
changeable with --buffer-size=SIZE but not with environment variable.

4. compression program:
changeable with --compression-program=PROG but not with environment variable.
(but at the moment, I do not address this aspect).


With the attached patch, sort will read an environment variable named 
SORT_BUFFER_SIZE, and will treat it as if --buffer-size was specified (but 
only if --buffer-size wasn't used on the command line).

If this is conceptually acceptable, I'll prepare a proper patch (with NEWS, 
help, docs, etc.).

Regards,
 -gordon
From db8f1c319d772c5b13df51894f279c3a7276416e Mon Sep 17 00:00:00 2001
From: A. Gordon gor...@cshl.edu
Date: Wed, 29 Aug 2012 16:42:31 -0400
Subject: [PATCH] sort: accept buffer size from environment variable.

---
 src/sort.c |7 +++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/src/sort.c b/src/sort.c
index 9dbfee1..1505a6d 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -4648,6 +4648,13 @@ main (int argc, char **argv)
   files = minus;
 }
 
+  if (sort_size == 0)
+{
+  char const *buffer_size = getenv (SORT_BUFFER_SIZE);
+  if (buffer_size)
+specify_sort_size(-1,'S',buffer_size);
+}
+
   /* Need to re-check that we meet the minimum requirement for memory
  usage with the final value for NMERGE. */
   if (0  sort_size)
-- 
1.7.9.1



Re: sort: new feature: use environment variable to set buffer size

2012-08-29 Thread Pádraig Brady
On 08/29/2012 09:50 PM, Assaf Gordon wrote:
 Hello,
 
 I'd like to suggest a new feature to sort: the ability to set the buffer size 
 (-S/--buffer-size X) using an environment variable.
 
 In summary:
  $ export SORT_BUFFER_SIZE=20G 
  $ someprogram | sort -k1,1  output.txt
  # sort will use 20G of RAM, as if --buffer-size 20G was specified.
 
 
 The rational:
 recent commits improved the guessed buffer size when sort is given an input 
 file,
 but these don't apply if sort is used as part of a pipe line, with a pipe as 
 input, e.g.
   some | program | sort | other | programs  file 
 
 (Tested with v8.19 on linux 2.6.32, sort consumes few MBs of RAM, even though 
 many GBs are available).
 This results in many small temporary files being created.
 
 The script (which uses sort) is not under my direct control, but even if it 
 was,
 I don't want to hard-code the amount of memory used, to keep it portable to 
 different servers.
 
 AFAIK, there are four aspects of sort the affect performance:
 1. number of threads:
 changeable with --parallel=X and with environment variable OMP_NUM_THREADS.
 
 2. temporary files location:
 changeable with --temporary-directory=DIR and with environment variable 
 TMPDIR.
 
 3. memory usage:
 changeable with --buffer-size=SIZE but not with environment variable.
 
 4. compression program:
 changeable with --compression-program=PROG but not with environment 
 variable.
 (but at the moment, I do not address this aspect).
 
 
 With the attached patch, sort will read an environment variable named 
 SORT_BUFFER_SIZE, and will treat it as if --buffer-size was specified 
 (but only if --buffer-size wasn't used on the command line).
 
 If this is conceptually acceptable, I'll prepare a proper patch (with NEWS, 
 help, docs, etc.).
 
 Regards,
  -gordon

Thanks for the detailed rationale, however
the existing env variables are significant to more utils than sort(1).
I.E. they're generally system level settings, rather than command level.
Also sort -S is very portable, even though not standardised.
solaris' sort(1) has -S and GNU sort is used on most other platforms,
which has -S available since TEXTUTILS-2_0_10-58-gbf86c62

Note also this thread on the selection of a default buffer size for pipes:
http://thread.gmane.org/gmane.comp.gnu.coreutils.general/878/focus=887

So currently I'd be 70:30 against adding such a variable.

cheers,
Pádraig.