Re: Unexpected behavior of sort -hu

Pascal Stumpf Fri, 04 Apr 2025 12:28:23 -0700

Hi Ingo,

On Thu, 27 Mar 2025 13:55:29 +0100, Ingo Schwarze wrote:
> Hello Pascal,
> 
> Pascal Stumpf wrote on Wed, Mar 26, 2025 at 08:39:15PM +0100:
> > On Wed, 26 Mar 2025 13:59:23 +0100, Ingo Schwarze wrote:
> 
> >> +When comparing two strings, both strings are split into substrings
> >> +such that the first and every odd-numbered substring
> >> +consists of non-digit characters only,
> 
> > s/consists/consist/
> 
> I applied this correction before committing.
> 
> I did not use Pascal's later suggestion of "each consist" because
> i tend to agree with Jason's final conclusion that "consist is fine".
> 
> I intended the wording "the first and every odd-numbered" to signal
> 1-based numbering, but now i worry that indication is not unambigious
> because the wording fails to call the first one "odd-numbered".


I don't really see an issue here.  The first string being odd-numbered
is self-explanatory.  I think the current text is fine, or you might go
with jmc@'s suggestion.

> The following wording tweak would resolve both issues, both making
> 1-based numbering explicit and avoiding the singular/plural quibble:
> 
>   such that every odd-numbered substring including the first one
>   consists of non-digit characters only,
> 
> >> +while every even-numbered substring consists of digits only.
> >> +These substrings are compared in turn from left to right
> >> +until a difference is found.
> >> +The first substring can be empty; all others cannot.
> >> +.Pp
> >> +Non-digit substrings are compared alphabetically, with upper case
> >> +letters sorting before lower case letters, letters sorting before
> >> +non-letters, and non-letters sorting in
> >> +.Xr ascii 7
> >> +order.
> 
> > Hmm.  This is wrong as soon as you step foot into Unicode.  I don't
> > think it hurts to be a bit more vague here.
> 
> I don't think it's realistic or even a desirable goal to ever
> implement LC_COLLATE support in our libc.  The whole concept, even
> though standardized in POSIX, is nothing but an instance of horrifically
> complicated overengineering.  I talked to bapt@ about it during EuroBSDCon
> in Beograd (shortly after he had implented that nightmare for FreeBSD)
> and he kept swearing about it like a trooper.  Given that FreeBSD is not
> really known for keeping stuff simple or shunning excessive complication,
> his rage was quite telling.
> 
> That said, we are talking about this call chain here:
> 
>   versioncoll [coll.c]
>   vcmp [vsort.c]
>   cmpversions [vsort.c]
>   cmp_chars [vsort.c]
> 
> Unlike much of the other code in our sort(1), which contains unused
> rigging for wchar_t handling in many places, none of this call chain
> contains anything to handle Unicode, not even disabled dummy code.
> Even if you would enable wchar_t support in our sort, ignoring my
> screaming, none of this code chain would do any Unicode handling,
> it would continue to do what i described, explicitely using its own,
> hand-rolled re-implementation of single-byte isalpha(3).
> 
> So short of saying somethings like
> 
>   It is unspecified how the non-digit substrings are compared.
> 
> i can't think of a way to make this less specific, and i have no
> idea what the intended behaviour of -V would be in the presence
> of LC_COLLATE support.

I absolutely agree that implementing LC_COLLATE would probably not be
desirable.

I probably should have explained myself a little better.  The problem
with your explanation is that the terms "upper case" and "lower case"
letters are too broad and are not limited to ASCII.  A Greek upper case
alpha is an upper case letter, and is certainly not sorted before a
lower case ASCII 'a', even if LC_COLLATE were implemented (I think).

So I would avoid using these classifications entirely.

> Do you have an idea of what we might say to achieve a reasonable
> level of vagueness?

The first paragraph of DESCRIPTION uses the word 'lexicographically' to
describe the default comparison mode, perhaps intentionally not going
into the details anywhere in the page.

I propose something like this:


Index: sort.1
===================================================================
RCS file: /home/cvs/src/usr.bin/sort/sort.1,v
diff -u -p -r1.67 sort.1
--- sort.1      27 Mar 2025 11:43:58 -0000      1.67
+++ sort.1      27 Mar 2025 18:27:38 -0000
@@ -208,12 +208,8 @@ These substrings are compared in turn fr
 until a difference is found.
 The first substring can be empty; all others cannot.
 .Pp
-Non-digit substrings are compared alphabetically, with upper case
-letters sorting before lower case letters, letters sorting before
-non-letters, and non-letters sorting in
-.Xr ascii 7
-order.
-Substrings consisting of digits are compared as integer numbers.
+Substrings consisting of digits are compared as integer numbers, while
+all other substrings are compared lexicographically.
 .Pp
 At the end of each string, zero or more suffixes that start with a dot,
 consist only of letters, digits, and tilde characters, and do not



> >> +Substrings consisting of digits are compared as integer numbers.
> >> +.Pp
> >> +At the end of each string, zero or more suffixes that start with a dot,
> >> +consist only of letters, digits, and tilde characters, and do not
> >> +start with a digit are ignored, equivalent to the regular expression
> >> +"(\e.([A-Za-z~][A-Za-z0-9~]*)?)*".
> >> +This is intended for ignoring filename suffixes such as
> >> +.Dq .tar.bz2 .
> 
> > Maybe .tgz for consistency with the example below
> 
> I slightly prefer demonstrating here that the suffix can contain digits,
> in particular since the presence of digits in file name extensions can
> result in confusion when people apply the suffix rule and the rule
> about digit/non-digit splitting in the wrong order.

Ah, very good point.  OK.

> Besides, when you have multiple examples, i don't consider it a goal
> to have all examples demonstrate the same aspects.  To the contrary,
> having the examples cover as many different aspects as possible
> feels preferable.
> 
> > (and since we don't have bzip2(1) in base)?
> 
> I don't think that's a problem.  The base system is certainly
> equipped to handle strings containing the substring "bz2", and even
> to store files with a .bz2 file name extension.
> 
> Besides, i doubt anyone uses OpenBSD without using ports, and use
> of bzip2(1) is widespread in ports, so mentioning it in an example
> does not feel exotic at all.
> 
> >>  .Pp
> >>  For example:
> >>  .Bd -literal -offset indent
> 
> > Maybe clarify here that the 'odd-numbered substring' is simply a dot in
> > the typical 'version sort' case.
> 
> Like in the patch below?
> 
> It feels slightly wordy, any idea how to bring the point across more
> concisely?
> 
> Yours,
>   Ingo
> 
> 
> Index: sort.1
> ===================================================================
> RCS file: /cvs/src/usr.bin/sort/sort.1,v
> diff -u -r1.67 sort.1
> --- sort.1    27 Mar 2025 11:43:58 -0000      1.67
> +++ sort.1    27 Mar 2025 12:46:22 -0000
> @@ -201,8 +201,8 @@
>  IPv4 addresses in dotted quad notation.
>  .Pp
>  When comparing two strings, both strings are split into substrings
> -such that the first and every odd-numbered substring
> -consist of non-digit characters only,
> +such that every odd-numbered substring including the first one
> +consists of non-digit characters only,
>  while every even-numbered substring consists of digits only.
>  These substrings are compared in turn from left to right
>  until a difference is found.
> @@ -222,7 +222,11 @@
>  This is intended for ignoring filename suffixes such as
>  .Dq .tar.bz2 .
>  .Pp
> -For example:
> +In the following example, the first substring is
> +.Qq sort\-
> +and the other odd-numbered substrings are
> +.Qq \&.
> +each:
>  .Bd -literal -offset indent
>  $ ls sort* | sort -V
>  sort-1.022.tgz

That's OK with me.


-Pascal

Re: Unexpected behavior of sort -hu

Reply via email to