Re: performance bug of `wc -m`

Peng Yu Sun, 13 May 2018 06:06:08 -0700

I am on Mac not on Linux. On Linux, I can confirm that `wc -m` is much
faster than `wcm.py`.


Here is the output on Mac.

$ seq 1000000 > num.txt
$ time wc -m < num.txt
6888896

real    0m2.751s
user    0m2.622s
sys    0m0.042s
$  time ./wcm.py < num.txt
6888896

real    0m1.401s
user    0m1.234s
sys    0m0.051s
$ cat wcm.py
#!/usr/bin/env python
# vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 fileencoding=utf-8:

import sys
l = 0
for line in sys.stdin:
    l += len(line.decode('utf-8'))
print l


On Sun, May 13, 2018 at 2:18 AM, Assaf Gordon <assafgor...@gmail.com> wrote:
> Hello,
>
> On 12/05/18 07:55 PM, Peng Yu wrote:
>>
>> The following example shows that `wc -m` is even slower than the
>> equivalent Python code. Can this performance bug be fixed?
>
>
> I'm unable to reproduce the performance issue,
> and suspect other issues are at play.
>
> First:
>>
>> import sys
>> l = 0
>> for line in sys.stdin:
>>      l += len(line.rstrip('\n').decode('utf-8'))
>> print l
>
>
> This code is not identical to "wc -m" - it does not count the newlines
> as characters. Example:
>
>   $ seq 10 | wc -m
>   21
>   $ seq 10 | ./wcm.py
>   11
>
>> $ time ./wcm.py < 1.txt
>> 6786930
>> $ time wc -m < 1.txt
>> 6796930
>
>
> The fact that you are getting the exact same results indicates that your
> input file (1.txt) does not have newlines at all:
>
>   $ seq 10 | tr -d '\n' | ./wcm.py
>   11
>   $ seq 10 | tr -d '\n' | wc -m
>   11
>
>
> Second:
> I suspect the OS's file caching plays a big role in the skewed results.
> It would be better to clear the cache and then time it:
>
>   $ seq 1000000 | tr -d '\n' > 1.txt
>   $ ls -lhog 1.txt
>   -rw-r--r-- 1 5.7M May 13 00:05 1.txt
>
>   $ sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"
>   $ time wc -m < 1.txt
>   5888896
>
>   real    0m0.136s
>   user    0m0.104s
>   sys     0m0.004s
>
> versus:
>
>    $ sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"
>    $ time ./wcm.py < 1.txt
>    5888896
>
>   real    0m0.215s
>   user    0m0.040s
>   sys     0m0.012s
>
> In my measurements python is twice as slow (for input with no newlines).
> But the file is so small (5.7MB) that measurements can vary a lot.
>
>
> Third:
> If the file does have new lines (as is more common in typical text
> files), then python becomes almost order of magnitude slower:
>
>   $ seq 1000000 > 2.txt
>   $ ls -lhog 2.txt
>   -rw-r--r-- 1 6.6M May 13 00:08 2.txt
>
>   $ sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"
>   $ time wc -m < 2.txt
>   6888896
>
>   real    0m0.158s
>   user    0m0.132s
>   sys     0m0.000s
>
>   $ sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"
>   $ time ./wcm.py < 2.txt
>   5888896
>
>   real    0m1.260s
>   user    0m1.104s
>   sys     0m0.016s
>
>
>
> Fourth,
> Unless you are certain your input files are valid,
> using python2 + utf8 is very fragile, example:
>
>   $ printf '\xEEabc\n' | ./wcm.py
>   Traceback (most recent call last):
>     File "./wcm.py", line 5, in <module>
>       l += len(line.rstrip('\n').decode('utf-8'))
>     File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
>       return codecs.utf_8_decode(input, errors, True)
>   UnicodeDecodeError: 'utf8' codec can't decode byte 0xee in position 0:
>   invalid continuation byte
>
> While 'wc -m' will continue and not crash:
>
>   $ printf '\xEEabc\n' | wc -m
>   4
>
>
>
> I hope this resolves the issue.
> If you still think this is a bug, please provide more details
> and a reproducible example.
>
> regards,
>  - assaf



-- 
Regards,
Peng

Re: performance bug of `wc -m`

Reply via email to