Re: [PATCH] sort: Speed up keyed field sorting significantly using memchr

Pádraig Brady Mon, 23 Mar 2026 05:03:56 -0700

On 23/03/2026 07:55, Chris Down wrote:

When sort is invoked with an explicit field separator with `-t SEP`,
begfield() and limfield() scan for the separator to locate boundaries.
Right now the implementation there uses a loop that iterates over bytes
one by one, which is not ideal since we must scan past many bytes of
non-separator data one byte at a time.


Let's replace each of these loops with memchr(). On glibc systems,
memchr() uses SIMD to scan 16 bytes per step (NEON on aarch64) or 32
bytes per step (AVX2 on x86_64), rather than 1 byte at a time, so any
field longer than a handful of bytes stands to benefit quite
significantly.

Using the following input data:

   awk 'BEGIN {
       srand(42)
       for (i = 1; i <= 500000; i++)
           printf "%*d,%*d,%d\n", 4+int(rand()*9), 0,
                                  4+int(rand()*9), 0, int(rand()*10000)
   }' > short_csv_500k

   awk 'BEGIN {
       for (i = 1; i <= 500000; i++)
           printf "%100d,%100d,%d\n", 0, 0, int(rand()*10000)
   }' > wide_csv_500k

One can benchmark with:

   hyperfine --warmup 10 --runs 50 \
     "LC_ALL=C sort_before -t, -k3,3n short_csv_500k > /dev/null" \
     "LC_ALL=C sort_after -t, -k3,3n short_csv_500k > /dev/null"

   hyperfine --warmup 10 --runs 50 \
     "LC_ALL=C sort_before -t, -k3,3n wide_csv_500k > /dev/null" \
     "LC_ALL=C sort_after -t, -k3,3n wide_csv_500k > /dev/null"

   hyperfine --warmup 10 --runs 50 \
     "LC_ALL=C sort_before wide_csv_500k > /dev/null" \
     "LC_ALL=C sort_after wide_csv_500k > /dev/null"

The results on i9-14900HX x86_64 with -O2:

   sort -t, -k3,3n (500K lines, 4-12 byte short fields):
     Before: 123.1 ms    After: 108.1 ms    (-12.2%)

   sort -t, -k3,3n (500K lines, 100 byte wide fields):
     Before: 243.5 ms    After: 165.9 ms    (-31.9%)

   sort (default, no -k, 500K lines):
     Before: 141.6 ms    After: 141.8 ms    (unchanged)

And on M1 Pro aarch64 with -O2:

   sort -t, -k3,3n (500K lines, 4-12 byte short fields):
     Before: 98.0 ms     After: 92.3 ms     (-5.8%)

   sort -t, -k3,3n (500K lines, 100 byte wide fields):
     Before: 240.8 ms    After: 183.0 ms    (-24.0%)

   sort (default, no -k, 500K lines):
     Before: 145.6 ms    After: 145.6 ms    (unchanged)

Looking at profiling, the improvement is larger on x86_64 in these runs
because glibc's memchr uses AVX2 to scan 32 bytes per step versus 16
bytes per step with NEON on aarch64.
---
  src/sort.c | 8 ++++----
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/sort.c b/src/sort.c
index e20ed2c8f..aaab47b50 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -1725,8 +1725,8 @@ begfield (struct line const *line, struct keyfield const 
*key)
    if (tab != TAB_DEFAULT)
      while (ptr < lim && sword--)
        {
-        while (ptr < lim && *ptr != tab)
-          ++ptr;
+        char *sep = memchr (ptr, tab, (size_t) (lim - ptr));
+        ptr = sep ? sep : lim;
          if (ptr < lim)
            ++ptr;
        }
@@ -1778,8 +1778,8 @@ limfield (struct line const *line, struct keyfield const 
*key)
    if (tab != TAB_DEFAULT)
      while (ptr < lim && eword--)
        {
-        while (ptr < lim && *ptr != tab)
-          ++ptr;
+        char *sep = memchr (ptr, tab, (size_t) (lim - ptr));
+        ptr = sep ? sep : lim;
          if (ptr < lim && (eword || echar))
            ++ptr;
        }

base-commit: 26b5e3d3cfe2361b1aa42ff4d449395124802a8f



Very nice!

BTW I'm in the process of updating/optimizing cut(1)
and also noticed how fast memchr() is.
It's only slower for <= 2 characters.

Since sort(1) doesn't currently handle multi-byte -t your patch is fine.
When we do handle multi-byte delimitiers (like I've implemented in the cut 
update),
we'll have to adjust this match.
   Interestingly I notice memmem() does _not_ have platform specific
optimizations in glibc (apart from s390). It does use memchr()
for len==1, but for len>=1 it's slower than say strstr()
which does have x86 specific optimizations.
So much so, I implemented memmem using strstr between (unlikely) NULs,
and it was considerably faster.

Anyway I pushed your change.

cheers,
Padraig.

Re: [PATCH] sort: Speed up keyed field sorting significantly using memchr

Reply via email to