Re: [PATCH] ls: add --files0-from=FILE option

2021-04-26 Thread Carl Edquist via GNU coreutils General Discussion

On Wed, 21 Apr 2021, Carl Edquist wrote:


Thanks Pádraig for the thoughtful reply!

You bring up some good points, which for the sake of interesting
discussion i'd like to follow up on also.  (Maybe later this week...)



So to follow up - i don't have any action items here, just some details 
that i thought might make for interesting discussion.



On Tue, 20 Apr 2021, p...@draigbrady.com wrote:


 In this case, xargs batches *13* separate invocations of ls; so the
 overall sorting is completely lost.

 But with the new option:

  [linux]$ find -name '*.[ch]' -print0 | ls -lrSh --files0-from=-

 The sizes all scroll in order


One can also implement this functionality with the DSU pattern like:

  nlargest=10
  find . -printf '%s\t%p\0' |
  sort -z -k1,1n | tail -z -n"$nlargest" | cut -z -f2 |
  xargs -r0 ls -lUd --color=auto --


I appreciate you taking the time to write out a full DSU example.

I was going to save this topic for a separate thread, but yeah one of my 
observations over the years is i would see people make (i would say 
oversimplified) comments that "you can do that with find", but they rarely 
go through the effort to show it (and thus demonstrate just how much more 
complicated the equivalent find-based solution is to type out).


(More in this vein a bit later.)



Arguably that's more scalable as the sort operation will not
be restricted by available RAM, and will use temp storage.


Agreed!  This is a good point to keep in mind in general for ls.

On the one hand, ls already has file metadata information available to it 
in raw format (eg, size, timestamps), so there isn't the overhead to 
convert relevant keys to sortable strings and write it all across pipes 
for IPC.


But on the other hand, as you are saying, for the things that find(1) can 
give sort(1) to sort (which is _most_ of the ls --sort modes), sort(1) can 
sort a bit more efficiently and can handle situations with very many file 
names and very limited system memory.


But by the same argument,

find . -mindepth 1 -maxdepth 1 ! -name .\* -printf '%f\0' |
sort -z | xargs -r0 ls $LS_OPTIONS -dU --

is more scalable than a bare 'ls' invocation, since there can be literally 
millions of entries in a single directory.


But 'ls' is certainly easier to type.

And if you want to sort on anything more interesting than the name itself, 
the DSU pipeline just gets more complex.


So at some point you can weigh scalability vs usability based on the 
situation.



(In the linux tree example above, the "ls -lrSh --files0-from=-" run with 
45648 input source files has a maxrss of ~16M, so for instance this use 
case is small enough for me not to worry about mem use.)



[By the same token, the sorting of shell globs in general (which in bash 
can be expanded to an arbitrary number of arguments), can be more scalably 
done outside of the shell (with find|sort) than in a shell glob itself. 
And while that is not a coreutils issue, the point is that despite this, 
(as with ls) shell globs can still be more convenient to use for sorting a 
set of files, perhaps in most cases, than an equivalent multi-tool 
pipeline.]



But doing the sorting in ls (rather than find|sort|cut|xargs) is not just 
easier to type -- it's also easier to _get it right_ without a lot of 
trial and error.


And for casual interactive use, that's kind of a feature, too.


For instance, in your DSU example, i spy a subtle bug in the undecorate 
step: looks like it should be "cut -zf2-" instead of "cut -zf2" ... 
because after all, tabs are legal in filenames, too.


The DSU in pixelbeat's "newest" script (mentioned later) is also not free 
from filename handling issues [1].


(Point being - even for experienced users, writing a replacement for ls's 
internal sorting with an equivalent DSU pipeline can be tricky & time 
consuming to get right, and easy to get subtly wrong.)



Also there are some other subtle differences that might be worth keeping 
in mind:


- To get the same sort order as ls, you actually need "sort -rzk1,1n" 
instead of just "sort -zk1,1n", since for tie-breaking ls sorts the 
keys and the names in opposite directions for ls -t and -S (whether or not 
you pass -r to ls).


(Likewise if you want the same order as ls -S without the -r, you need a 
slightly different "sort -zk1,1nr".)


- when you put 'ls' after a pipe (as in "... | ls -lrSh --files0-from=-") 
you typically get the alias (eg, slackware has ls='/bin/ls $LS_OPTIONS').
Meanwhile 'xargs ls' is whichever 'ls' is first in PATH, and does not 
include LS_OPTIONS.  Not a big deal, but, buyer beware.


- lastly (and perhaps the only thing you can't do anything about), when 
you use 'xargs ls' rather than a single ls invocation, you lose the 
aggregate width alignments for any ls formats with column output (in 
this case -l, but it's also true for -C and -x).




Also it's more robust as ls is the last step in the pipeline,
presenting the files to the user. If you wanted the 

Re: [PATCH] ls: add --files0-from=FILE option

2021-04-21 Thread Carl Edquist via GNU coreutils General Discussion
On Tue, 20 Apr 2021, p...@draigbrady.com wrote:

> One can also implement this functionality with the DSU pattern like:
>
>   nlargest=10
>   find . -printf '%s\t%p\0' |
>   sort -z -k1,1n | tail -z -n"$nlargest" | cut -z -f2 |
>   xargs -r0 ls -lUd --color=auto --
>
> Arguably that's more scalable as the sort operation will not
> be restricted by available RAM, and will use temp storage.
>
> Also it's more robust as ls is the last step in the pipeline,
> presenting the files to the user. If you wanted the largest 10
> with ls --files0-from then file names containing a newline
> would mess up further processing by tail etc.
>
>>  Similarly, say you would like to view / scroll through your extensive mp3
>>  collection in chronological order (based on when you added the files to
>>  your collection).  You can do it now with the new option:
>>
>>   [music]$ find -name \*.mp3 -print0 | ls -lrth --files0-from=-
>
> I've used a https://www.pixelbeat.org/scripts/newest script
> for a long time with similar find|xargs technique as I described above.
>
> In saying all of the above, I do agree though that for consistency
> commands that need to process all arguments with global context,
> should have a --files0-from option.
> Currently that's du and wc for total counts, and sort(1) for sorting.
> Since ls has sorting functionality, it should have this option too.


Thanks Pádraig for the thoughtful reply!

You bring up some good points, which for the sake of interesting 
discussion i'd like to follow up on also.  (Maybe later this week...)

Have a good day!

Carl

Re: [PATCH] ls: add --files0-from=FILE option

2021-04-21 Thread Carl Edquist via GNU coreutils General Discussion

Hi Berny,

On Wed, 21 Apr 2021, Bernhard Voelker wrote:

shouldn't it use the 'argv-iter' gnulib module (like du.c and wc.c) 
instead of directly using the underlying ...



+#include "readtokens0.h"


I considered this, too!  :)

I think the short answer is that du and wc don't actually need to keep any 
information around from all the files they process except for the running 
totals, so "at scale" the argv-iter version for files0 processing allows 
for a constant (?) memory usage in du and wc regardless of the number of 
input items.


But for ls, even if using argv-iter would result in one less copy of the 
filenames, the scaled memory requirements are still the same for sorting 
(fileinfo for each entry has to be kept till the end).


[If anything, i would look more carefully into whether gobble_file in ls.c 
really needs to make a copy of each filename, rather than just using the 
command line arguments or names from files0 input directly.]



So i ended up following the pattern closer to what's in sort.c.  Note that 
argv-iter is not used in there either, perhaps for a similar reason (it's 
not going to bring the memory requirements down to O(1), the way it could 
for wc and du).



The one case ls might possibly find an improvement with argv-iter is for 
unsorted, non-column output (so, specifically -U1), where the names are 
only needed once.  But in that case there's no need to use the 
'--files0-from' option for global sort or summary info -- you could use 
'xargs -0 ls -U1 ...' instead for identical output.



It just didn't seem like there was a strong argument for it (same memory 
scaling regardless).  And the other frank truth is (if you look at wc.c 
and du.c for examples), it seemed like adding argv_iterator processing 
would significantly complicate the code.



Those were my thoughts anyway  :)

Have a good one,
Carl



Re: [PATCH] ls: add --files0-from=FILE option

2021-04-20 Thread Bernhard Voelker
On 4/20/21 2:57 PM, Pádraig Brady wrote:
> On 19/04/2021 18:08, Carl Edquist wrote:
> In saying all of the above, I do agree though that for consistency
> commands that need to process all arguments with global context,
> should have a --files0-from option.
> Currently that's du and wc for total counts, and sort(1) for sorting.
> Since ls has sorting functionality, it should have this option too.

I don't have a strong opinion about whether to add it or not,
but if we add it, then shouldn't it use the 'argv-iter' gnulib module
(like du.c and wc.c) instead of directly using the underlying ...

>> +#include "readtokens0.h"


?

Have a nice day,
Berny



Re: [PATCH] ls: add --files0-from=FILE option

2021-04-20 Thread Pádraig Brady

On 19/04/2021 18:08, Carl Edquist wrote:

Greetings Coreutils,

I'm submitting for your consideration here a patch to add the standard
'--files0-from=FILE' option to ls.

(To read NUL-terminated names from FILE rather than using command line
arguments.)


Motivation for adding this to ls is mainly to let ls sort arbitrarily many
items, though it is also necessary for getting the correct aggregate
column widths to align long format (-l) output across all file names.


As a real example, if you want to use ls to list (say, in long format with
human sizes) all the sources in the linux kernel tree according to size,
you might naively try one of

 [linux]$ find -name '*.[ch]' -exec ls -lrSh {} +
 [linux]$ find -name '*.[ch]' -print0 | xargs -0 ls -lrSh

but you'll see the sizes spiral over and over, finally ending somewhere in
the middle:

...
-rw-r--r-- 1 kx users  81K Apr 15 04:30 ./arch/arm/boot/dts/imx6sll-pinfunc.h
-rw-r--r-- 1 kx users  83K Apr 15 04:30 ./arch/arm/boot/dts/imx6dl-pinfunc.h
-rw-r--r-- 1 kx users  87K Apr 15 04:30 ./arch/arm/mach-imx/iomux-mx35.h
-rw-r--r-- 1 kx users 107K Apr 15 04:30 ./arch/arm/boot/dts/imx7d-pinfunc.h
-rw-r--r-- 1 kx users 143K Apr 15 04:30 ./arch/arm/boot/dts/imx6sx-pinfunc.h


In this case, xargs batches *13* separate invocations of ls; so the
overall sorting is completely lost.

But with the new option:

 [linux]$ find -name '*.[ch]' -print0 | ls -lrSh --files0-from=-

The sizes all scroll in order


One can also implement this functionality with the DSU pattern like:

  nlargest=10
  find . -printf '%s\t%p\0' |
  sort -z -k1,1n | tail -z -n"$nlargest" | cut -z -f2 |
  xargs -r0 ls -lUd --color=auto --

Arguably that's more scalable as the sort operation will not
be restricted by available RAM, and will use temp storage.

Also it's more robust as ls is the last step in the pipeline,
presenting the files to the user. If you wanted the largest 10
with ls --files0-from then file names containing a newline
would mess up further processing by tail etc.


Similarly, say you would like to view / scroll through your extensive mp3
collection in chronological order (based on when you added the files to
your collection).  You can do it now with the new option:

 [music]$ find -name \*.mp3 -print0 | ls -lrth --files0-from=-


I've used a https://www.pixelbeat.org/scripts/newest script
for a long time with similar find|xargs technique as I described above.

In saying all of the above, I do agree though that for consistency
commands that need to process all arguments with global context,
should have a --files0-from option.
Currently that's du and wc for total counts, and sort(1) for sorting.
Since ls has sorting functionality, it should have this option too.

thanks!
Pádraig



Re: [PATCH] ls: add --files0-from=FILE option

2021-04-19 Thread Carl Edquist via GNU coreutils General Discussion

On Mon, 19 Apr 2021, Carl Edquist wrote:

I'm submitting for your consideration here a patch to add the standard 
'--files0-from=FILE' option to ls.


Oops! Somehow I ended up attaching an old version of the patch - this one 
includes some cleanups and help output.


Thanks,
CarlFrom 30fb711962bbd432aea4268baf93d7ba17aae4bc Mon Sep 17 00:00:00 2001
From: Carl Edquist 
Date: Fri, 10 May 2019 17:05:47 -0500
Subject: [PATCH] ls: add --files0-from=FILE option

Useful for things like

find [...] -type f -print0 | ls -lrSh --files0-from=-

where an attempt to do the same with 'find -print0 | xargs -0' or
'find -exec' would fail to fit all the file names onto a single command
line, and thus ls would not sort all the input items together, nor
align the columns from -l across all items.

* src/ls.c (files_from): New var for input filename.
(long_options): Add new long option.
(read_files0): Add helper function to consume files0 input.
(main): Add logic for files_from handling.
(decode_switches): Handle FILES0_FROM_OPTION.
* tests/ls/files0-from-option.sh: Excercise new option.
* tests/local.mk: Include new test.
* doc/coreutils.texi: Document --files0-from=FILE option.
* NEWS: Mention the new feature.
---
 NEWS   |  3 ++
 doc/coreutils.texi |  2 ++
 src/ls.c   | 69 +++---
 tests/local.mk |  1 +
 tests/ls/files0-from-option.sh | 40 
 5 files changed, 111 insertions(+), 4 deletions(-)
 create mode 100755 tests/ls/files0-from-option.sh

diff --git a/NEWS b/NEWS
index 090fbc7..7f7f86b 100644
--- a/NEWS
+++ b/NEWS
@@ -73,6 +73,9 @@ GNU coreutils NEWS-*- outline -*-
   ls now accepts the --sort=width option, to sort by file name width.
   This is useful to more compactly organize the default vertical column output.
 
+  ls now accepts the --files0-from=FILE option, where FILE contains a
+  list of NUL-terminated file names.
+
   nl --line-increment can now take a negative number to decrement the count.
 
 ** Improvements
diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index 3e3aedb..e008746 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -7467,6 +7467,8 @@ option has been specified (@option{--classify} (@option{-F}),
 @option{--dereference} (@option{-L}), or
 @option{--dereference-command-line} (@option{-H})).
 
+@filesZeroFromOption{ls,,sorted output (and aligned columns in -l mode)}
+
 @item --group-directories-first
 @opindex --group-directories-first
 Group all the directories before the files and then sort the
diff --git a/src/ls.c b/src/ls.c
index 4586b5e..0d0678b 100644
--- a/src/ls.c
+++ b/src/ls.c
@@ -102,6 +102,7 @@
 #include "mpsort.h"
 #include "obstack.h"
 #include "quote.h"
+#include "readtokens0.h"
 #include "smack.h"
 #include "stat-size.h"
 #include "stat-time.h"
@@ -356,6 +357,9 @@ static bool align_variable_outer_quotes;
 static void **sorted_file;
 static size_t sorted_file_alloc;
 
+/* input filename for --files0-from */
+static char *files_from;
+
 /* When true, in a color listing, color each symlink name according to the
type of file it points to.  Otherwise, color them according to the 'ln'
directive in LS_COLORS.  Dangling (orphan) symlinks are treated specially,
@@ -833,6 +837,7 @@ enum
   COLOR_OPTION,
   DEREFERENCE_COMMAND_LINE_SYMLINK_TO_DIR_OPTION,
   FILE_TYPE_INDICATOR_OPTION,
+  FILES0_FROM_OPTION,
   FORMAT_OPTION,
   FULL_TIME_OPTION,
   GROUP_DIRECTORIES_FIRST_OPTION,
@@ -892,6 +897,7 @@ static struct option const long_options[] =
   {"block-size", required_argument, NULL, BLOCK_SIZE_OPTION},
   {"context", no_argument, 0, 'Z'},
   {"author", no_argument, NULL, AUTHOR_OPTION},
+  {"files0-from", required_argument, NULL, FILES0_FROM_OPTION},
   {GETOPT_HELP_OPTION_DECL},
   {GETOPT_VERSION_OPTION_DECL},
   {NULL, 0, NULL, 0}
@@ -1626,11 +1632,38 @@ signal_restore (void)
   signal_setup (false);
 }
 
+static void
+read_files0(struct Tokens *tokp)
+{
+  FILE *stream;
+  if (STREQ (files_from, "-"))
+stream = stdin;
+  else
+{
+  stream = fopen (files_from, "r");
+  if (stream == NULL)
+die (EXIT_FAILURE, errno, _("cannot open %s for reading"),
+ quoteaf (files_from));
+}
+
+  readtokens0_init (tokp);
+
+  if (! readtokens0 (stream, tokp) || fclose (stream) != 0)
+die (LS_FAILURE, 0, _("cannot read file names from %s"),
+ quoteaf (files_from));
+
+  if (! tokp->n_tok)
+die (LS_FAILURE, 0, _("no input from %s"),
+ quoteaf (files_from));
+}
+
 int
 main (int argc, char **argv)
 {
   int i;
   struct pending *thispend;
+  struct Tokens tok;
+  char **files;
   int n_files;
 
   initialize_main (, );
@@ -1727,6 +1760,23 @@ main (int argc, char **argv)
   clear_files ();
 
   n_files = argc - i;
+  files = argv + i;
+
+  if (files_from)
+{
+  if (n_files > 0)
+{
+  error (0, 0, _("extra operand %s"), quoteaf