bug#37093: wc runs 100% cpu when in pipeline or tee >(wc)

2019-08-20 Thread Assaf Gordon

tag 37093 notabug
close 37093
stop

Hello,

On 2019-08-19 10:44 p.m., Edward Huff wrote:

In the demo below, dd uses 0.665s to write 1GiB of zeros.
sha256sum uses 4.285s to calculate the sha256 of 1GiB of zeros.
wc uses 32.160s to count 1GiB of zeros.


[...]


baseline results:
$ dd if=/dev/zero count=$((1024*1024)) bs=1024 | tee >(sha256sum>&2) | wc
1048576+0 records in
1048576+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 32.5007 s, 33.0 MB/s
49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14  -
   0   0 1073741824
$


First,
Try to avoid UTF8 locales (i.e., force a C/POSIX locale with LC_ALL=C)
which makes 'wc' much faster.

On my computer:

With UTF8 locale:

  $ dd if=/dev/zero count=$((1024*1024)) bs=1024 \
| tee >(sha256sum>&2) | time --portability wc
  1048576+0 records in
  1048576+0 records out
  1073741824 bytes (1.1 GB, 1.0 GiB) copied, 46.5928 s, 23.0 MB/s
  49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14  -
0   0 1073741824
  real 46.59
  user 46.37
  sys 0.19

With C locale:

  $ dd if=/dev/zero count=$((1024*1024)) bs=1024 \
   | tee >(sha256sum>&2) | LC_ALL=C time --portability wc
  1048576+0 records in
  1048576+0 records out
  1073741824 bytes (1.1 GB, 1.0 GiB) copied, 8.60285 s, 125 MB/s
  49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14  -
0   0 1073741824
  real 8.60
  user 5.22
  sys 0.26


Second,
The "word counting" feature in 'wc' is the main cpu-hog.
If you avoid that (i.e. counting only lines, or only characters),
'wc' is even faster (and it automatically ignores UTF8 issues):

  $ dd if=/dev/zero count=$((1024*1024)) bs=1024 \
   | tee >(sha256sum>&2) \
   | \time --portability wc -c
  1048576+0 records in
  1048576+0 records out
  1073741824 bytes (1.1 GB, 1.0 GiB) copied, 7.59429 s, 141 MB/s
  49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14  -
  1073741824

  real 7.59
  user 0.10
  sys 0.71

Notice that the "real time" wasn't changed much (from 8.6s to 7.59s), 
but the actual work performed by 'wc' (measured in "user time") is down

drastically.


Third,
If you are comfortable with compiling Coreutils from source,
you can build it using optimized hashing function from OpenSSL, like so:

 ./configure --with-openssl
 make

Then, "sha256sum" will be faster (about 2x fast on my computer).

If you don't want to re-compile it, consider using "openssl" directly
to calculate the checksum, like so:

  dd if=/dev/zero count=1K bs=1M | tee >(openssl sha256>&2) | wc -c


Fourth,
To save few more microseconds, consider using dd with larger block size 
(bs=) and fewer blocks (count=), e.g.:


   $ time dd if=/dev/zero of=/dev/null count=1M bs=1K
   1048576+0 records in
   1048576+0 records out
   1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.865853 s, 1.2 GB/s

   real 0m0.868s
   user 0m0.288s
   sys  0m0.579s

   $ time dd if=/dev/zero of=/dev/null count=1K bs=1M
   1024+0 records in
   1024+0 records out
   1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.0998688 s, 10.8 GB/s

   real 0m0.102s
   user 0m0.000s
   sys  0m0.102s

This won't reduce the total time by much, but will result in
fewer sys-calls, and less CPU kernel time (at least by a tiny bit).
The effect is more noticeable when reading or writing to a physical disk.



Lastly,
If you use GNU time instead of the shell's built-in 'time' function,
you can specify custom output format,
and easily show the timing of each program in the pipeline.
Example:

$ FMT="\n=== CMD: %C ===\nreal %e\tuser %U\tsys %S\n"
$ \time -f "$FMT" dd if=/dev/zero count=1M bs=1K \
 | \time -f "$FMT" tee >(\time -f "$FMT" sha256sum>&2) \
 | \time -f "$FMT" wc -c
1048576+0 records in
1048576+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 7.77339 s, 138 MB/s

=== CMD: dd if=/dev/zero count=1048576 bs=1024 ===
real 7.77   user 0.36   sys 1.65


=== CMD: tee /dev/fd/63 ===
real 7.77   user 0.10   sys 1.30

49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14  -

=== CMD: sha256sum ===
real 7.77   user 7.47   sys 0.27

1073741824

=== CMD: wc -c ===
real 7.77   user 0.05   sys 0.76


As such, I'm closing this as "not a bug",
but discussion can continue by replying to this thread.

regards,
 - assaf






bug#37093: wc runs 100% cpu when in pipeline or tee >(wc)

2019-08-20 Thread Bernhard Voelker
On 8/20/19 6:44 AM, Edward Huff wrote:
> In the demo below, dd uses 0.665s to write 1GiB of zeros.
> sha256sum uses 4.285s to calculate the sha256 of 1GiB of zeros.
> wc uses 32.160s to count 1GiB of zeros.
> 
> Linux localhost 5.2.8-200.fc30.x86_64 #1 SMP Sat Aug 10 13:21:39 UTC 2019
> x86_64 x86_64 x86_64 GNU/Linux
> coreutils-8.31-2.fc30.x86_64
> dd (coreutils) 8.31
> wc (GNU coreutils) 8.31
> sha256sum (GNU coreutils) 8.31
> 
> baseline results:
> $ dd if=/dev/zero count=$((1024*1024)) bs=1024 | tee >(sha256sum>&2) | wc
> 1048576+0 records in
> 1048576+0 records out
> 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 32.5007 s, 33.0 MB/s
> 49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14  -
>   0   0 1073741824
> $
> 
> Demo script:
> $ cat wc-at-100pct
> #!/bin/bash
> set -xv
> rm pipe* tee*
> {
>{
>   time dd if=/dev/zero count=$((1024*1024)) bs=1024
>} 2>>pipedd
> } | {
>tee >(
>   {
>  time sha256sum
>   } >teesha256 2>&1
>)
> } | {
>{
>   time wc
>} > pipewc 2>&1
> }
> 
> {
>{
>   time dd if=/dev/zero count=$((1024*1024)) bs=1024
>} 2>>pipedd
> } | {
>tee >(
>   {
>  time wc
>   } >teewc 2>&1
>)
> } | {
>{
>   time sha256sum
>} > pipesha256sum 2>&1
> }
> 
> head pipe* tee*
> $
> 
> Results:
> ./wc-at-100pct
> rm pipe* tee*
> + rm pipedd pipesha256sum pipewc teesha256 teewc
> {
>{
>   time dd if=/dev/zero count=$((1024*1024)) bs=1024
>} 2>>pipedd
> } | {
>tee >(
>   {
>  time sha256sum
>   } >teesha256 2>&1
>)
> } | {
>{
>   time wc
>} > pipewc 2>&1
> }
> + tee /dev/fd/63
> 
> {
>{
>   time dd if=/dev/zero count=$((1024*1024)) bs=1024
>} 2>>pipedd
> } | {
>tee >(
>   {
>  time wc
>   } >teewc 2>&1
>)
> } | {
>{
>   time sha256sum
>} > pipesha256sum 2>&1
> }
> + tee /dev/fd/63
> 
> head pipe* tee*
> + head pipedd pipesha256sum pipewc teesha256 teewc
> ==> pipedd <==
> + dd if=/dev/zero count=1048576 bs=1024
> 1048576+0 records in
> 1048576+0 records out
> 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 32.5495 s, 33.0 MB/s
> 
> real 0m32.550s
> user 0m0.665s
> sys 0m1.503s
> + dd if=/dev/zero count=1048576 bs=1024
> 1048576+0 records in
> 
> ==> pipesha256sum <==
> + sha256sum
> 49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14  -
> 
> real 0m31.457s
> user 0m4.285s
> sys 0m0.562s
> 
> ==> pipewc <==
> + wc
>   0   0 1073741824
> 
> real 0m32.555s
> user 0m32.160s
> sys 0m0.247s
> 
> ==> teesha256 <==
> ++ sha256sum
> 49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14  -
> 
> real 0m32.553s
> user 0m4.333s
> sys 0m0.704s
> 
> ==> teewc <==
> ++ wc
>   0   0 1073741824
> 
> real 0m31.456s
> user 0m31.121s
> sys 0m0.221s

I'm not sure what your report is trying to demonstrate exactly.

Here, your script just takes ~5-6 seconds for each pass on a ~5 year-old
"Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz".

Furthermore, replacing 'time' by 'env time -v' shows that wc(1) is taking
less CPU than sha256sum - exactly as one would expect:

  $ grep CPU * | column -t
  pipedd: Percent  of  CPU  this  job  got:  27%
  pipedd: Percent  of  CPU  this  job  got:  29%
  pipesha256sum:  Percent  of  CPU  this  job  got:  99%
  pipewc: Percent  of  CPU  this  job  got:  65%
  teesha256:  Percent  of  CPU  this  job  got:  99%
  teewc:  Percent  of  CPU  this  job  got:  66%

> coreutils-8.31-2.fc30.x86_64

It looks like you are using the Fedora package of coreutils ... which
may have some additional patches compared to upstream GNU coreutils.
Did you try the upstream version as well?

Have a nice day,
Berny





bug#37093: wc runs 100% cpu when in pipeline or tee >(wc)

2019-08-19 Thread Edward Huff
In the demo below, dd uses 0.665s to write 1GiB of zeros.
sha256sum uses 4.285s to calculate the sha256 of 1GiB of zeros.
wc uses 32.160s to count 1GiB of zeros.

Linux localhost 5.2.8-200.fc30.x86_64 #1 SMP Sat Aug 10 13:21:39 UTC 2019
x86_64 x86_64 x86_64 GNU/Linux
coreutils-8.31-2.fc30.x86_64
dd (coreutils) 8.31
wc (GNU coreutils) 8.31
sha256sum (GNU coreutils) 8.31

baseline results:
$ dd if=/dev/zero count=$((1024*1024)) bs=1024 | tee >(sha256sum>&2) | wc
1048576+0 records in
1048576+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 32.5007 s, 33.0 MB/s
49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14  -
  0   0 1073741824
$

Demo script:
$ cat wc-at-100pct
#!/bin/bash
set -xv
rm pipe* tee*
{
   {
  time dd if=/dev/zero count=$((1024*1024)) bs=1024
   } 2>>pipedd
} | {
   tee >(
  {
 time sha256sum
  } >teesha256 2>&1
   )
} | {
   {
  time wc
   } > pipewc 2>&1
}

{
   {
  time dd if=/dev/zero count=$((1024*1024)) bs=1024
   } 2>>pipedd
} | {
   tee >(
  {
 time wc
  } >teewc 2>&1
   )
} | {
   {
  time sha256sum
   } > pipesha256sum 2>&1
}

head pipe* tee*
$

Results:
./wc-at-100pct
rm pipe* tee*
+ rm pipedd pipesha256sum pipewc teesha256 teewc
{
   {
  time dd if=/dev/zero count=$((1024*1024)) bs=1024
   } 2>>pipedd
} | {
   tee >(
  {
 time sha256sum
  } >teesha256 2>&1
   )
} | {
   {
  time wc
   } > pipewc 2>&1
}
+ tee /dev/fd/63

{
   {
  time dd if=/dev/zero count=$((1024*1024)) bs=1024
   } 2>>pipedd
} | {
   tee >(
  {
 time wc
  } >teewc 2>&1
   )
} | {
   {
  time sha256sum
   } > pipesha256sum 2>&1
}
+ tee /dev/fd/63

head pipe* tee*
+ head pipedd pipesha256sum pipewc teesha256 teewc
==> pipedd <==
+ dd if=/dev/zero count=1048576 bs=1024
1048576+0 records in
1048576+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 32.5495 s, 33.0 MB/s

real 0m32.550s
user 0m0.665s
sys 0m1.503s
+ dd if=/dev/zero count=1048576 bs=1024
1048576+0 records in

==> pipesha256sum <==
+ sha256sum
49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14  -

real 0m31.457s
user 0m4.285s
sys 0m0.562s

==> pipewc <==
+ wc
  0   0 1073741824

real 0m32.555s
user 0m32.160s
sys 0m0.247s

==> teesha256 <==
++ sha256sum
49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14  -

real 0m32.553s
user 0m4.333s
sys 0m0.704s

==> teewc <==
++ wc
  0   0 1073741824

real 0m31.456s
user 0m31.121s
sys 0m0.221s