Re: performance bug of `wc -m` on glibc systems

Bruno Haible Sun, 20 May 2018 13:44:11 -0700

Pádraig Brady wrote:
> $ yes áááááááááááááááááááá | head -n100000 > mbc.txt
> $ yes 12345678901234567890 | head -n100000 > num.txt
> 
> ===== Before ====
> 
> $ time src/wc -Lm < mbc.txt
> 2100000      20
> real    0m0.186s
> 
> $ time src/wc -m < mbc.txt
> 2100000
> real    0m0.186s


Profiling this on a glibc system, using the valgrind + kcachegrind
combination [1], I obtain the attached treemap.
$ valgrind --tool=callgrind src/wc -m < mbc.txt
$ kcachegrind callgrind.out.10379   # switch to callee map

My reading of this output is:

  * The transformation of the byte strings to a character stream
    takes more than 50% of the CPU time. I find this extraordinary
    slow, given that what this parsing does is essentially to combine
    2 adjacent bytes in a word, according to an algorithmic formula
    (no table lookup).

    What each step does:
    - rpl_mbtowc: Implements the gnulib workaround against
        "In the C or POSIX locales, this function can return @code{(size_t) -1}
         and set @code{errno} to @code{EILSEQ}"
    - mbrtowc: Adapt the invocation conventions of mbrtowc to the invocation
      conventions of __gconv_transform_*.
    - __gconv_transform_utf8_internal: This is the transformation UTF-8 -> 
UCS-4.

    It is slow because the __gconv_transform_* functions are optimized to do
    long runs of characters all at once (i.e. for iconv() and mbstowcs). 
However,
    what 'wc' and the coreutils in general do is to convert one character, check
    status and update application logic, then convert another character, check
    status and update application logic again, etc.

    There would be two ways to make 'wc' more performant:

      - Change the 'wc' program to transform buffers at once. The problem here
        is that it comes with degradation of error-handling logic (ability
        to precisely diagnose errors and continue appropriately) and complicates
        the code, making it less maintainable.

      - Change glibc to add an optimized 'mbrtwoc' function for each encoding
        (or at least for the most frequently used encodings: the built-in and
        8-bit ones).
        I estimate that this optimization would make 60% of the mbrtwoc times
        go away, thus speeding up 'wc' by 30%.

  * The other operations (print, space, and width, for each character)
    are reasonably optimized in glibc.

Bruno

[1] http://www.pixelbeat.org/programming/profiling/

callgrind.out.10379.png
Description: application/kcachegrind

Re: performance bug of `wc -m` on glibc systems

Reply via email to