Re: Unexpected behavior of sort -hu

Pascal Stumpf Wed, 26 Mar 2025 12:39:43 -0700

On Wed, 26 Mar 2025 13:59:23 +0100, Ingo Schwarze wrote:
> Hi Pascal,
> 
> Pascal Stumpf wrote on Tue, Mar 18, 2025 at 04:59:40PM +0100:
> > On Sat, 15 Mar 2025 20:00:42 +0100, Ingo Schwarze wrote:
> >> Van Dung Ha wrote on Wed, Mar 12, 2025 at 02:25:02PM +0100:
> 
> >>> Recently, I was parsing a few IP addresses from the snort logs
> >>> to populate a pf table and encountered something counter-intuitive.
> >>> Here is an example  source list.
> >>> [...]
> >>> 77.224.14.2
> >>> 77.224.14.21
> >>> 77.224.14.18
> >>> 77.224.14.21
> 
> >> The easiest way i found to do what you want is
> >> 
> >>   $ tr . s < test.in | sort -nuts -k1 -k2 -k3 -k4 | tr s .
> >> 
> >> nuts in the Karakorum!
> 
> > sort -V to the rescue!
> 
> Thank you fort pointing that out!
> 
> Looking at the description of -V in our sort(1) manual revealed that
> that description is completely useless:
> 
>  * The pseudo syntax PREFIX VERSION SUFFIX is highly misleading
>    because the PREFIX is optional, and additional infixes are
>    accepted in the middle of the version numbers.
>  * Leading zeros are not ignored.  For example, if you sort
>    the strings "1" and "01", the order is reversed, sorting "01" before "1".
>  * The phrase "If an input string does not match the pattern"
>    is utterly confusing because no pattern is mentioned at all.
>  * It is unclear what "the byte compare function" is supposed to be,
>    and looking at the code, i conclude that such a function likely
>    does not even exist.
>  * The description doesn't say at all how the comparison works.
> 
> So basically, i wrote a new description from scratch, see the patch below.
> 
> I think it is likely that my new description is not entirely accurate.
> Very probably, there are some edge cases that might not agree with my
> description, or where my description at least leaves ambiguity.
> 
> I think that problem is hard to avoid for two reasons.
> 
> First, our implementation makes huge numbers of choices where it is hard
> to say which of these are intentional features and which are accidental
> implementation details.  Given that the authors of the code chose to not
> document at all what the intended behaviour of the code is, neither
> under- nor over-documenting feels easy to avoid.
> 
> Even more importantly, i judge the code quality of sort(1)
> as positively atrocious.  Because i have done some documentation
> work in libcrypto, i'm quite used to documenting low-quality code,
> but i must say the code in /usr/src/usr.bin/sort does not feel better
> than typical OpenSSL quality, i.e. it is significantly below OpenBSD
> quality standards:
> 
>  - The architecture of the code lacks logical structure.
>    Instead of structuring the code in a way that reflects the
>    purposes and algorithms and makes reasoning about and auditing
>    the code easy for human beings, there are interminable cascades
>    of special cases and interminable chains of wrappers, hiding
>    the places where the actual work is being done.
> 
>  - The code does not use POSIX libc APIs but instead uses idiosyncratic
>    wrappers throughout, even for the most fundamental <string.h>
>    and <ctype.h> stuff, making it very hard to understand on the
>    micro just like on the macro level.
> 
> Thouhts?  OKs?
>   Ingo


Much better.  Thanks for improving this part as well.  Just a couple of
nits.

> Index: sort.1
> ===================================================================
> RCS file: /cvs/src/usr.bin/sort/sort.1,v
> diff -u -r1.66 sort.1
> --- sort.1    15 Mar 2025 18:20:04 -0000      1.66
> +++ sort.1    26 Mar 2025 12:49:11 -0000
> @@ -196,14 +196,31 @@
>  .It Fl r , Fl Fl reverse
>  Sort in reverse order.
>  .It Fl V , Fl Fl version-sort
> -Sort version numbers.
> -The input lines are treated as file names in form
> -PREFIX VERSION SUFFIX, where SUFFIX matches the regular expression
> -"(\.([A-Za-z~][A-Za-z0-9~]*)?)*".
> -The files are compared by their prefixes and versions (leading
> -zeros are ignored in version numbers, see example below).
> -If an input string does not match the pattern, then it is compared
> -using the byte compare function.
> +This option is intended to sort strings that contain version numbers,
> +but it can be used for other purposes as well, for example to sort
> +IPv4 addresses in dotted quad notation.
> +.Pp
> +When comparing two strings, both strings are split into substrings
> +such that the first and every odd-numbered substring
> +consists of non-digit characters only,

s/consists/consist/

> +while every even-numbered substring consists of digits only.
> +These substrings are compared in turn from left to right
> +until a difference is found.
> +The first substring can be empty; all others cannot.
> +.Pp
> +Non-digit substrings are compared alphabetically, with upper case
> +letters sorting before lower case letters, letters sorting before
> +non-letters, and non-letters sorting in
> +.Xr ascii 7
> +order.

Hmm.  This is wrong as soon as you step foot into Unicode.  I don't
think it hurts to be a bit more vague here.

> +Substrings consisting of digits are compared as integer numbers.
> +.Pp
> +At the end of each string, zero or more suffixes that start with a dot,
> +consist only of letters, digits, and tilde characters, and do not
> +start with a digit are ignored, equivalent to the regular expression
> +"(\e.([A-Za-z~][A-Za-z0-9~]*)?)*".
> +This is intended for ignoring filename suffixes such as
> +.Dq .tar.bz2 .

Maybe .tgz for consistency with the example below (and since we don't
have bzip2(1) in base)?

>  .Pp
>  For example:
>  .Bd -literal -offset indent

Maybe clarify here that the 'odd-numbered substring' is simply a dot in
the typical 'version sort' case.

Re: Unexpected behavior of sort -hu

Reply via email to