Re: sort: new feature: use environment variable to set buffer size
Pádraig Brady wrote: Thanks for the detailed rationale, however the existing env variables are significant to more utils than sort(1). I.E. they're generally system level settings, rather than command level. Also sort -S is very portable, even though not standardised. solaris' sort(1) has -S and GNU sort is used on most other platforms, which has -S available since TEXTUTILS-2_0_10-58-gbf86c62 Note also this thread on the selection of a default buffer size for pipes: http://thread.gmane.org/gmane.comp.gnu.coreutils.general/878/focus=887 So currently I'd be 70:30 against adding such a variable. Thanks for replying, Pádraig. Those are good points. Another is that we have a strong aversion to adding support for new environment variables in coreutils programs because it makes formerly-robust usage prone to malfunction or abuse. Imagine a script that worked fine before, but when someone ends up running sort with some unusual or extreme envvar setting that now makes that shell script's use of sort fail. With a new envvar as proposed, every sort-using script would have to be retrofitted with code to keep an erroneous or malicious envvar setting from impacting sort. Just say No! to new envvars ;-)
sort: new feature: use environment variable to set buffer size
Hello, I'd like to suggest a new feature to sort: the ability to set the buffer size (-S/--buffer-size X) using an environment variable. In summary: $ export SORT_BUFFER_SIZE=20G $ someprogram | sort -k1,1 output.txt # sort will use 20G of RAM, as if --buffer-size 20G was specified. The rational: recent commits improved the guessed buffer size when sort is given an input file, but these don't apply if sort is used as part of a pipe line, with a pipe as input, e.g. some | program | sort | other | programs file (Tested with v8.19 on linux 2.6.32, sort consumes few MBs of RAM, even though many GBs are available). This results in many small temporary files being created. The script (which uses sort) is not under my direct control, but even if it was, I don't want to hard-code the amount of memory used, to keep it portable to different servers. AFAIK, there are four aspects of sort the affect performance: 1. number of threads: changeable with --parallel=X and with environment variable OMP_NUM_THREADS. 2. temporary files location: changeable with --temporary-directory=DIR and with environment variable TMPDIR. 3. memory usage: changeable with --buffer-size=SIZE but not with environment variable. 4. compression program: changeable with --compression-program=PROG but not with environment variable. (but at the moment, I do not address this aspect). With the attached patch, sort will read an environment variable named SORT_BUFFER_SIZE, and will treat it as if --buffer-size was specified (but only if --buffer-size wasn't used on the command line). If this is conceptually acceptable, I'll prepare a proper patch (with NEWS, help, docs, etc.). Regards, -gordon From db8f1c319d772c5b13df51894f279c3a7276416e Mon Sep 17 00:00:00 2001 From: A. Gordon gor...@cshl.edu Date: Wed, 29 Aug 2012 16:42:31 -0400 Subject: [PATCH] sort: accept buffer size from environment variable. --- src/sort.c |7 +++ 1 files changed, 7 insertions(+), 0 deletions(-) diff --git a/src/sort.c b/src/sort.c index 9dbfee1..1505a6d 100644 --- a/src/sort.c +++ b/src/sort.c @@ -4648,6 +4648,13 @@ main (int argc, char **argv) files = minus; } + if (sort_size == 0) +{ + char const *buffer_size = getenv (SORT_BUFFER_SIZE); + if (buffer_size) +specify_sort_size(-1,'S',buffer_size); +} + /* Need to re-check that we meet the minimum requirement for memory usage with the final value for NMERGE. */ if (0 sort_size) -- 1.7.9.1
Re: sort: new feature: use environment variable to set buffer size
On 08/29/2012 09:50 PM, Assaf Gordon wrote: Hello, I'd like to suggest a new feature to sort: the ability to set the buffer size (-S/--buffer-size X) using an environment variable. In summary: $ export SORT_BUFFER_SIZE=20G $ someprogram | sort -k1,1 output.txt # sort will use 20G of RAM, as if --buffer-size 20G was specified. The rational: recent commits improved the guessed buffer size when sort is given an input file, but these don't apply if sort is used as part of a pipe line, with a pipe as input, e.g. some | program | sort | other | programs file (Tested with v8.19 on linux 2.6.32, sort consumes few MBs of RAM, even though many GBs are available). This results in many small temporary files being created. The script (which uses sort) is not under my direct control, but even if it was, I don't want to hard-code the amount of memory used, to keep it portable to different servers. AFAIK, there are four aspects of sort the affect performance: 1. number of threads: changeable with --parallel=X and with environment variable OMP_NUM_THREADS. 2. temporary files location: changeable with --temporary-directory=DIR and with environment variable TMPDIR. 3. memory usage: changeable with --buffer-size=SIZE but not with environment variable. 4. compression program: changeable with --compression-program=PROG but not with environment variable. (but at the moment, I do not address this aspect). With the attached patch, sort will read an environment variable named SORT_BUFFER_SIZE, and will treat it as if --buffer-size was specified (but only if --buffer-size wasn't used on the command line). If this is conceptually acceptable, I'll prepare a proper patch (with NEWS, help, docs, etc.). Regards, -gordon Thanks for the detailed rationale, however the existing env variables are significant to more utils than sort(1). I.E. they're generally system level settings, rather than command level. Also sort -S is very portable, even though not standardised. solaris' sort(1) has -S and GNU sort is used on most other platforms, which has -S available since TEXTUTILS-2_0_10-58-gbf86c62 Note also this thread on the selection of a default buffer size for pipes: http://thread.gmane.org/gmane.comp.gnu.coreutils.general/878/focus=887 So currently I'd be 70:30 against adding such a variable. cheers, Pádraig.