I am on Mac not on Linux. On Linux, I can confirm that `wc -m` is much faster than `wcm.py`.
Here is the output on Mac. $ seq 1000000 > num.txt $ time wc -m < num.txt 6888896 real 0m2.751s user 0m2.622s sys 0m0.042s $ time ./wcm.py < num.txt 6888896 real 0m1.401s user 0m1.234s sys 0m0.051s $ cat wcm.py #!/usr/bin/env python # vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 fileencoding=utf-8: import sys l = 0 for line in sys.stdin: l += len(line.decode('utf-8')) print l On Sun, May 13, 2018 at 2:18 AM, Assaf Gordon <assafgor...@gmail.com> wrote: > Hello, > > On 12/05/18 07:55 PM, Peng Yu wrote: >> >> The following example shows that `wc -m` is even slower than the >> equivalent Python code. Can this performance bug be fixed? > > > I'm unable to reproduce the performance issue, > and suspect other issues are at play. > > First: >> >> import sys >> l = 0 >> for line in sys.stdin: >> l += len(line.rstrip('\n').decode('utf-8')) >> print l > > > This code is not identical to "wc -m" - it does not count the newlines > as characters. Example: > > $ seq 10 | wc -m > 21 > $ seq 10 | ./wcm.py > 11 > >> $ time ./wcm.py < 1.txt >> 6786930 >> $ time wc -m < 1.txt >> 6796930 > > > The fact that you are getting the exact same results indicates that your > input file (1.txt) does not have newlines at all: > > $ seq 10 | tr -d '\n' | ./wcm.py > 11 > $ seq 10 | tr -d '\n' | wc -m > 11 > > > Second: > I suspect the OS's file caching plays a big role in the skewed results. > It would be better to clear the cache and then time it: > > $ seq 1000000 | tr -d '\n' > 1.txt > $ ls -lhog 1.txt > -rw-r--r-- 1 5.7M May 13 00:05 1.txt > > $ sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches" > $ time wc -m < 1.txt > 5888896 > > real 0m0.136s > user 0m0.104s > sys 0m0.004s > > versus: > > $ sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches" > $ time ./wcm.py < 1.txt > 5888896 > > real 0m0.215s > user 0m0.040s > sys 0m0.012s > > In my measurements python is twice as slow (for input with no newlines). > But the file is so small (5.7MB) that measurements can vary a lot. > > > Third: > If the file does have new lines (as is more common in typical text > files), then python becomes almost order of magnitude slower: > > $ seq 1000000 > 2.txt > $ ls -lhog 2.txt > -rw-r--r-- 1 6.6M May 13 00:08 2.txt > > $ sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches" > $ time wc -m < 2.txt > 6888896 > > real 0m0.158s > user 0m0.132s > sys 0m0.000s > > $ sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches" > $ time ./wcm.py < 2.txt > 5888896 > > real 0m1.260s > user 0m1.104s > sys 0m0.016s > > > > Fourth, > Unless you are certain your input files are valid, > using python2 + utf8 is very fragile, example: > > $ printf '\xEEabc\n' | ./wcm.py > Traceback (most recent call last): > File "./wcm.py", line 5, in <module> > l += len(line.rstrip('\n').decode('utf-8')) > File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode > return codecs.utf_8_decode(input, errors, True) > UnicodeDecodeError: 'utf8' codec can't decode byte 0xee in position 0: > invalid continuation byte > > While 'wc -m' will continue and not crash: > > $ printf '\xEEabc\n' | wc -m > 4 > > > > I hope this resolves the issue. > If you still think this is a bug, please provide more details > and a reproducible example. > > regards, > - assaf -- Regards, Peng