Bug#444067: coreutils: Please add a --files-from option to du

Bob Proulx Thu, 15 Nov 2007 00:28:21 -0800

Géraud Meyer wrote:
> Bob Proulx wrote:
> >   head datafile | xargs -r du
>
> First the command line length is limited, which in certain cases (when
> the number of files is gigantic) can make this command fail.


The xargs command is specifically design to understand the limitations
of command line length.  It will not fail even when there are a huge
number of arguments.  The limitation of ARG_MAX is almost the entire
reason that the xargs command exists.

> It is the case if someone (like me) creates lists of all the files,
> not just of some directories. N.B. The -n option of xargs cannot be
> used to overcome this limitation if one would like du to produce a
> grand total.

True.  If there are multiple invocations of du then all of them would
need to be summed up into a total.  If a total is needed then using
--files0-from would be better.

> Second by default xargs considers any blank character a separator and
> interprets certain characters, so that it must be used with still
> another option: `-d \n'.

True again.  But the --files0-from is available.

> >   find . -type f -exec du {} +
>
> Thanks but the example you give has problems similar to the one above:
> 1) The command line may not be arbitrarily long.

Incorrect.  Please see the documentation for "{} +" in the find
manual.  This handles arbitrarily large numbers of arguments.

> 2) You cannot easily filter the list after find.

I do not understand why you think it cannot be filtered after this
point.  I see no problem doing so.

However if there are multiple invocations of du then grand totals
would need an additional summation step.  This is the same as the
above using xargs.  Again it is better to use --files0-from.

> And it has an additional major drawback:
> 3) Each time you want to run du on the list, you have to recreate the
> list (which can take a lot of time).

Yes.  But if you want to cache the list output from find then using
null terminated strings are again the best way to handle the filename
data.  And null terminated strings may be fed directly into du using
the --files0-from option.

  find . -type f -print0 > cached-list
  du --files0-from=cached-list

> I never said that it is impossible to use du together with other
> programs. I said that the absence of the --files-from option makes it
> harder by bounding the user to using non default options of almost all
> the other programs, or to converting the list-file used (e.g. with ` tr
> $'\n' $'\0' ') before using it with du. See the example below.

Because it is always possible to convert from one use to the other
easily I am not convinced it is a hardship to not have the less
general purpose option added.

This:

  tr '\012' '\000' | du --files0-from=-

Is equivalent to this:
  
  du --files-from=-

Therefore I don't think this is enough to warrant an additonal option
because one is a complete superset of the other.  Also because the
newline terminated one is not general purpose I don't think we should
make it easy to use.  It would only encourage people to use it.  But
using it fails to handle general filenames.

However if patches were made available the upstream maintainer may see
things differently and incorporate the additional feature.

> What I mean is that it makes sense and a coherent coreutils package to
> also implement --files-from. Furthermore, as --files0-from is already
> implemented, it should not be much work to add the almost identical
> --files-from option.

I disagree simply because whitespace in filenames cause problems with
the non-null-terminated option.  Therefore to handle arbitrary
filenames the null-terminated option really needs to be used.  The
non-null version is only a half solution.  It would have its own set
of bug reports about it not being sufficient.  If the null terminated
one is sufficient and general purpose then it is the better one to use
all of the time.

> Example:
> $ find <dir> -<options> >files.list
>     # create a NL-separated list of files

  find <dir> -<options> -print0 > files0.list

> $ wc -l files.list
>     # wc could not be used to count NUL-separated items
>     # (another more complex program would be needed)

But wc -l cannot be used to count arbitrary filenames specifically
those that include a newline.  So the above can't work in general.  If
you really don't care about newlines in files then the information may
be recovered using xargs.

  xargs -0l1 < file0.list | wc -l

> $ sed -i -e <expr> files.list
>     # sed is designed for "regular" lines (NL-terminated)

  xargs -r0 sed -i -e <expr> < files0.list

> $ emacs files.list
>     # editing a NUL-separated list in emacs would be inconvenient

But of course emacs handles binary data in files fine.  But again, if
newlines in filesnames are not a problem the file can be converted
back to a newline delimited file.

  xargs -0n1 < file0.list > file.list

But information may be lost in that process.

> $ tr $'\n' $'\0' <files.list >files0.list
>     # an additional command that I would like not to type

Then I know you won't like the xargs conversion command either.  :-)

For those reading this later this following is a more standard way of
doing the above bash specific character sequences.

  tr '\012' '\000'

But again, that is an inexact conversion depending upon the possible
inclusion of filenames including newlines in the name.  It is not a
general purpose way to handle the files.

> $ du -csb --from-files0 files0.list | tail -n1
>     # note the additional tail command that also could be avoided
>     # if there were an option to only display the grand total

That is a separate issue.  And possibly a completely separate wishlist
item!  :-)

> $ IFS=$'\n' for i in `cat files.list`; do archive "$i"; done
>     # if the list is not too long
>     # does not work with NUL-separated lists (or I do not know how)

Oh that hurts to read!  For one there is an entirely useless cat
process in there that is completely not needed.  $(<files.list)
replaces `cat files.list`.  Plus the shell will not have a problem
with too many arguments in this case.  ARG_MAX is a limiter only when
dealing with the limitations of exec(2).

But as long as there are bash-isms such as $'\n' introduced there is
no reason not to use the bash read command with the bash specific -d
option.  This will read null-terminated filenames.

  while read -d '' i; do archive "$i";done < files0.list

Or some prefer:

  while read -d $'\0' i; do archive "$i";done < files0.list

> $ xargs -n1 -d \\n --arg-file=files.list archive
>     # if the list is very long
>     # here files0.list could be used

  xargs -0n1 --arg-file=files0.list archive

> P.S. The possibility for the --files0-from option to use the standard
> input should be added to the documentation.

Agreed.

> The info page for du mentions the fact that the --files0-from option is
> useful "when the list of file names is so long that it may exceed a
> command line length limitation."

Yes but the next part is:

     In such cases, running `du' via `xargs' is undesirable
     because it splits the list into pieces and makes `du' print a
     total for each sublist rather than for the entire list.  One way
     to produce a list of null-byte-terminated file names is with GNU
     `find', using its `-print0' predicate.

Bob

Bug#444067: coreutils: Please add a --files-from option to du

Reply via email to