James

> wc does not print the correct totals when given a large number of files:
> 
> example ("How big is the linux C-source?"):
> # find /usr/src/linux/ -type f -name '*.c' | xargs wc | grep total
>  403654 1320103 10885558 total
>  619345 2148481 17926183 total
>  291311  948158 7473011 total
>   91822  300088 2475385 total
> 
> The sum of these columns seems to be the correct total.

This is not a bug as you are not using xargs in a manor that is
expected to do what you want it to do.  But it is doing what xargs
should be doing.

> Intuitively wc should not print more than one total, and I don't
> think xargs calls wc more than once (or xargs is broken!).

Unfortunately I have to report that your intuition about this is
incorrect.  xargs might possibly call a command as many times as it
has input lines, but that would be silly.  But since it might call it
more than once you scripts should always be able to handle that case
for fully robust operation.  (I would probably also use zero
terminated files names extension if possible.  But that is another
thread of discussion.)  Normally on traditional UNIX systems xargs
buffers up LINE_MAX bytes of input data and hands it to the command
for execution.  GNU xargs says it uses "default is as large as
possible, up to 20k characters." which is usually much bigger and
would generally result in better performance through less invocations.

Regardless, xargs will execute the command as many times as needed to
drain the input source.  Note that the number of executions is
effected by the size of the data and not the number of items which
means that the number can vary significantly depending on the length
of the file names involved.

> (Is it the find/xargs combo that makes it fail?). 

Yes.

> My workaround is to do:
> 
> find /usr/src/linux/ -type f -name '*.c' | cat | wc

I think your test cases got mixed up here when you cut and pasted them
into the report.  This would just count the size of the names of the
files and not the contents.  You should definitely get different
results.  I am assuming you meant

 find /usr/src/linux/ -type f -name '*.c' | xargs cat | wc

This looks like a good solution to me.  I would be using something
like that myself.

> but it still looks like a bug in wc.
> Looks like wc behaves the same on Solaris

Why does it still look like a bug?  And if two completely different
sources say the same thing then the odds are pretty good that that
thing is probably the right thing.

Here is another way to get a hint about what is happening.  Change the
command to use echo so that you can see exactly what commands are
getting executed.

  find /usr/src/linux -type f -name '*.c' | xargs echo wc

And you will probably want to save the output to a file since I expect
it to be large.  You should find that xargs running wc multiple times,
four times in your particular example, with a huge number of arguments
each time.  Looking at it that way you can see that wc is doing
exactly what it is told to be doing.

>  - is this some obscure
> buffering problem? At least this should be documented (found nothing
> from man or info).

Hopefully by now I have convinced you that everything is working as it
should and is documented properly.  The man page for xargs says:

 DESCRIPTION
      This manual page documents the GNU version of xargs.  xargs reads
      arguments from the standard input, delimited by blanks (which can be
      protected with double or single quotes or a backslash) or newlines,
      and executes the command (default is /bin/echo) one or more times with
      any initial-arguments followed by arguments read from standard input.
      Blank lines on the standard input are ignored.

The part "executes the command [...] one or more times" is pretty
explicit.  :-)

Hope this helps

Bob

Reply via email to