On 23/03/2026 07:55, Chris Down wrote:
When sort is invoked with an explicit field separator with `-t SEP`, begfield() and limfield() scan for the separator to locate boundaries. Right now the implementation there uses a loop that iterates over bytes one by one, which is not ideal since we must scan past many bytes of non-separator data one byte at a time.Let's replace each of these loops with memchr(). On glibc systems, memchr() uses SIMD to scan 16 bytes per step (NEON on aarch64) or 32 bytes per step (AVX2 on x86_64), rather than 1 byte at a time, so any field longer than a handful of bytes stands to benefit quite significantly. Using the following input data: awk 'BEGIN { srand(42) for (i = 1; i <= 500000; i++) printf "%*d,%*d,%d\n", 4+int(rand()*9), 0, 4+int(rand()*9), 0, int(rand()*10000) }' > short_csv_500k awk 'BEGIN { for (i = 1; i <= 500000; i++) printf "%100d,%100d,%d\n", 0, 0, int(rand()*10000) }' > wide_csv_500k One can benchmark with: hyperfine --warmup 10 --runs 50 \ "LC_ALL=C sort_before -t, -k3,3n short_csv_500k > /dev/null" \ "LC_ALL=C sort_after -t, -k3,3n short_csv_500k > /dev/null" hyperfine --warmup 10 --runs 50 \ "LC_ALL=C sort_before -t, -k3,3n wide_csv_500k > /dev/null" \ "LC_ALL=C sort_after -t, -k3,3n wide_csv_500k > /dev/null" hyperfine --warmup 10 --runs 50 \ "LC_ALL=C sort_before wide_csv_500k > /dev/null" \ "LC_ALL=C sort_after wide_csv_500k > /dev/null" The results on i9-14900HX x86_64 with -O2: sort -t, -k3,3n (500K lines, 4-12 byte short fields): Before: 123.1 ms After: 108.1 ms (-12.2%) sort -t, -k3,3n (500K lines, 100 byte wide fields): Before: 243.5 ms After: 165.9 ms (-31.9%) sort (default, no -k, 500K lines): Before: 141.6 ms After: 141.8 ms (unchanged) And on M1 Pro aarch64 with -O2: sort -t, -k3,3n (500K lines, 4-12 byte short fields): Before: 98.0 ms After: 92.3 ms (-5.8%) sort -t, -k3,3n (500K lines, 100 byte wide fields): Before: 240.8 ms After: 183.0 ms (-24.0%) sort (default, no -k, 500K lines): Before: 145.6 ms After: 145.6 ms (unchanged) Looking at profiling, the improvement is larger on x86_64 in these runs because glibc's memchr uses AVX2 to scan 32 bytes per step versus 16 bytes per step with NEON on aarch64. --- src/sort.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/src/sort.c b/src/sort.c index e20ed2c8f..aaab47b50 100644 --- a/src/sort.c +++ b/src/sort.c @@ -1725,8 +1725,8 @@ begfield (struct line const *line, struct keyfield const *key) if (tab != TAB_DEFAULT) while (ptr < lim && sword--) { - while (ptr < lim && *ptr != tab) - ++ptr; + char *sep = memchr (ptr, tab, (size_t) (lim - ptr)); + ptr = sep ? sep : lim; if (ptr < lim) ++ptr; } @@ -1778,8 +1778,8 @@ limfield (struct line const *line, struct keyfield const *key) if (tab != TAB_DEFAULT) while (ptr < lim && eword--) { - while (ptr < lim && *ptr != tab) - ++ptr; + char *sep = memchr (ptr, tab, (size_t) (lim - ptr)); + ptr = sep ? sep : lim; if (ptr < lim && (eword || echar)) ++ptr; } base-commit: 26b5e3d3cfe2361b1aa42ff4d449395124802a8f
Very nice! BTW I'm in the process of updating/optimizing cut(1) and also noticed how fast memchr() is. It's only slower for <= 2 characters. Since sort(1) doesn't currently handle multi-byte -t your patch is fine. When we do handle multi-byte delimitiers (like I've implemented in the cut update), we'll have to adjust this match. Interestingly I notice memmem() does _not_ have platform specific optimizations in glibc (apart from s390). It does use memchr() for len==1, but for len>=1 it's slower than say strstr() which does have x86 specific optimizations. So much so, I implemented memmem using strstr between (unlikely) NULs, and it was considerably faster. Anyway I pushed your change. cheers, Padraig.
