On Wed, 21 Apr 2021, Carl Edquist wrote:

Thanks Pádraig for the thoughtful reply!

You bring up some good points, which for the sake of interesting
discussion i'd like to follow up on also.  (Maybe later this week...)


So to follow up - i don't have any action items here, just some details that i thought might make for interesting discussion.


On Tue, 20 Apr 2021, p...@draigbrady.com wrote:

 In this case, xargs batches *13* separate invocations of ls; so the
 overall sorting is completely lost.

 But with the new option:

      [linux]$ find -name '*.[ch]' -print0 | ls -lrSh --files0-from=-

 The sizes all scroll in order

One can also implement this functionality with the DSU pattern like:

  nlargest=10
  find . -printf '%s\t%p\0' |
  sort -z -k1,1n | tail -z -n"$nlargest" | cut -z -f2 |
  xargs -r0 ls -lUd --color=auto --

I appreciate you taking the time to write out a full DSU example.

I was going to save this topic for a separate thread, but yeah one of my observations over the years is i would see people make (i would say oversimplified) comments that "you can do that with find", but they rarely go through the effort to show it (and thus demonstrate just how much more complicated the equivalent find-based solution is to type out).

(More in this vein a bit later.)


Arguably that's more scalable as the sort operation will not
be restricted by available RAM, and will use temp storage.

Agreed!  This is a good point to keep in mind in general for ls.

On the one hand, ls already has file metadata information available to it in raw format (eg, size, timestamps), so there isn't the overhead to convert relevant keys to sortable strings and write it all across pipes for IPC.

But on the other hand, as you are saying, for the things that find(1) can give sort(1) to sort (which is _most_ of the ls --sort modes), sort(1) can sort a bit more efficiently and can handle situations with very many file names and very limited system memory.

But by the same argument,

    find . -mindepth 1 -maxdepth 1 ! -name .\* -printf '%f\0' |
    sort -z | xargs -r0 ls $LS_OPTIONS -dU --

is more scalable than a bare 'ls' invocation, since there can be literally millions of entries in a single directory.

But 'ls' is certainly easier to type.

And if you want to sort on anything more interesting than the name itself, the DSU pipeline just gets more complex.

So at some point you can weigh scalability vs usability based on the situation.


(In the linux tree example above, the "ls -lrSh --files0-from=-" run with 45648 input source files has a maxrss of ~16M, so for instance this use case is small enough for me not to worry about mem use.)


[By the same token, the sorting of shell globs in general (which in bash can be expanded to an arbitrary number of arguments), can be more scalably done outside of the shell (with find|sort) than in a shell glob itself. And while that is not a coreutils issue, the point is that despite this, (as with ls) shell globs can still be more convenient to use for sorting a set of files, perhaps in most cases, than an equivalent multi-tool pipeline.]


But doing the sorting in ls (rather than find|sort|cut|xargs) is not just easier to type -- it's also easier to _get it right_ without a lot of trial and error.

And for casual interactive use, that's kind of a feature, too.


For instance, in your DSU example, i spy a subtle bug in the undecorate step: looks like it should be "cut -zf2-" instead of "cut -zf2" ... because after all, tabs are legal in filenames, too.

The DSU in pixelbeat's "newest" script (mentioned later) is also not free from filename handling issues [1].

(Point being - even for experienced users, writing a replacement for ls's internal sorting with an equivalent DSU pipeline can be tricky & time consuming to get right, and easy to get subtly wrong.)


Also there are some other subtle differences that might be worth keeping in mind:

- To get the same sort order as ls, you actually need "sort -rzk1,1n" instead of just "sort -zk1,1n", since for tie-breaking ls sorts the keys and the names in opposite directions for ls -t and -S (whether or not you pass -r to ls).

(Likewise if you want the same order as ls -S without the -r, you need a slightly different "sort -zk1,1nr".)

- when you put 'ls' after a pipe (as in "... | ls -lrSh --files0-from=-") you typically get the alias (eg, slackware has ls='/bin/ls $LS_OPTIONS'). Meanwhile 'xargs ls' is whichever 'ls' is first in PATH, and does not include LS_OPTIONS. Not a big deal, but, buyer beware.

- lastly (and perhaps the only thing you can't do anything about), when you use 'xargs ls' rather than a single ls invocation, you lose the aggregate width alignments for any ls formats with column output (in this case -l, but it's also true for -C and -x).


Also it's more robust as ls is the last step in the pipeline,
presenting the files to the user. If you wanted the largest 10
with ls --files0-from then file names containing a newline
would mess up further processing by tail etc.

So in the examples i gave, i actually had in mind to output the entire listing to the terminal, so it can be scrolled through for review.
(That is, without filtering, with say 'tail'.)

And the biggest or newest files are displayed prominently at the end of the listing, which is what is visible when control is returned to the user. From there i just copied off the final 5 entries.


I was actually very careful in my examples to avoid the question of doing any kind of processing of output from ls. But since you brought it up, for what it's worth i'll note a couple small thoughts:


- in my examples, i had in mind the default terminal output, which already prints '?' for nongraphic characters. So, if i wanted (as in your example) just the last 10 items, but as a *post* processing step, i would first have to add the ls -q option to retain the '?' replacement for non-tty output.

At that point, piping to 'tail' (without -z) would not be messed up at all by file names containing newlines, since they'd all be replaced with '?'s. So in that sense, robustness is actually _not_ affected.


- nevertheless, i do expect putting the 'tail' step before ls to be a bit more efficient, since it avoids ls printing the long listing for files that won't make it into the final tail output. ('ls -lrShq | tail' prints the long-format info even for lines that will be discarded.)


Similarly, say you would like to view / scroll through your extensive mp3
collection in chronological order (based on when you added the files to
your collection).  You can do it now with the new option:

    [music]$ find -name \*.mp3 -print0 | ls -lrth --files0-from=-

I've used a https://www.pixelbeat.org/scripts/newest script
for a long time with similar find|xargs technique as I described above.

Fun!

Although it looks like this script, despite using 'xargs -0', is actually a good example of handling paths with newlines *incorrectly*:

[1] https://github.com/pixelb/scripts/blob/e337b59/scripts/newest#L68-L70

if [ ! -p /proc/self/fd/1 ]; then
   tr '\n' '\0' |
   xargs -r0 ls -lUd --color=auto --

*wince*  :)



In saying all of the above, I do agree though that for consistency
commands that need to process all arguments with global context,
should have a --files0-from option.
Currently that's du and wc for total counts, and sort(1) for sorting.
Since ls has sorting functionality, it should have this option too.

Yeah, it's not that ls is so desperately in need of this option; but for completeness, it has sometimes felt like a bit of a missing feature.


Thanks again for your consideration!

Carl

Reply via email to