Hi Ingo, On Thu, 27 Mar 2025 13:55:29 +0100, Ingo Schwarze wrote: > Hello Pascal, > > Pascal Stumpf wrote on Wed, Mar 26, 2025 at 08:39:15PM +0100: > > On Wed, 26 Mar 2025 13:59:23 +0100, Ingo Schwarze wrote: > > >> +When comparing two strings, both strings are split into substrings > >> +such that the first and every odd-numbered substring > >> +consists of non-digit characters only, > > > s/consists/consist/ > > I applied this correction before committing. > > I did not use Pascal's later suggestion of "each consist" because > i tend to agree with Jason's final conclusion that "consist is fine". > > I intended the wording "the first and every odd-numbered" to signal > 1-based numbering, but now i worry that indication is not unambigious > because the wording fails to call the first one "odd-numbered".
I don't really see an issue here. The first string being odd-numbered is self-explanatory. I think the current text is fine, or you might go with jmc@'s suggestion. > The following wording tweak would resolve both issues, both making > 1-based numbering explicit and avoiding the singular/plural quibble: > > such that every odd-numbered substring including the first one > consists of non-digit characters only, > > >> +while every even-numbered substring consists of digits only. > >> +These substrings are compared in turn from left to right > >> +until a difference is found. > >> +The first substring can be empty; all others cannot. > >> +.Pp > >> +Non-digit substrings are compared alphabetically, with upper case > >> +letters sorting before lower case letters, letters sorting before > >> +non-letters, and non-letters sorting in > >> +.Xr ascii 7 > >> +order. > > > Hmm. This is wrong as soon as you step foot into Unicode. I don't > > think it hurts to be a bit more vague here. > > I don't think it's realistic or even a desirable goal to ever > implement LC_COLLATE support in our libc. The whole concept, even > though standardized in POSIX, is nothing but an instance of horrifically > complicated overengineering. I talked to bapt@ about it during EuroBSDCon > in Beograd (shortly after he had implented that nightmare for FreeBSD) > and he kept swearing about it like a trooper. Given that FreeBSD is not > really known for keeping stuff simple or shunning excessive complication, > his rage was quite telling. > > That said, we are talking about this call chain here: > > versioncoll [coll.c] > vcmp [vsort.c] > cmpversions [vsort.c] > cmp_chars [vsort.c] > > Unlike much of the other code in our sort(1), which contains unused > rigging for wchar_t handling in many places, none of this call chain > contains anything to handle Unicode, not even disabled dummy code. > Even if you would enable wchar_t support in our sort, ignoring my > screaming, none of this code chain would do any Unicode handling, > it would continue to do what i described, explicitely using its own, > hand-rolled re-implementation of single-byte isalpha(3). > > So short of saying somethings like > > It is unspecified how the non-digit substrings are compared. > > i can't think of a way to make this less specific, and i have no > idea what the intended behaviour of -V would be in the presence > of LC_COLLATE support. I absolutely agree that implementing LC_COLLATE would probably not be desirable. I probably should have explained myself a little better. The problem with your explanation is that the terms "upper case" and "lower case" letters are too broad and are not limited to ASCII. A Greek upper case alpha is an upper case letter, and is certainly not sorted before a lower case ASCII 'a', even if LC_COLLATE were implemented (I think). So I would avoid using these classifications entirely. > Do you have an idea of what we might say to achieve a reasonable > level of vagueness? The first paragraph of DESCRIPTION uses the word 'lexicographically' to describe the default comparison mode, perhaps intentionally not going into the details anywhere in the page. I propose something like this: Index: sort.1 =================================================================== RCS file: /home/cvs/src/usr.bin/sort/sort.1,v diff -u -p -r1.67 sort.1 --- sort.1 27 Mar 2025 11:43:58 -0000 1.67 +++ sort.1 27 Mar 2025 18:27:38 -0000 @@ -208,12 +208,8 @@ These substrings are compared in turn fr until a difference is found. The first substring can be empty; all others cannot. .Pp -Non-digit substrings are compared alphabetically, with upper case -letters sorting before lower case letters, letters sorting before -non-letters, and non-letters sorting in -.Xr ascii 7 -order. -Substrings consisting of digits are compared as integer numbers. +Substrings consisting of digits are compared as integer numbers, while +all other substrings are compared lexicographically. .Pp At the end of each string, zero or more suffixes that start with a dot, consist only of letters, digits, and tilde characters, and do not > >> +Substrings consisting of digits are compared as integer numbers. > >> +.Pp > >> +At the end of each string, zero or more suffixes that start with a dot, > >> +consist only of letters, digits, and tilde characters, and do not > >> +start with a digit are ignored, equivalent to the regular expression > >> +"(\e.([A-Za-z~][A-Za-z0-9~]*)?)*". > >> +This is intended for ignoring filename suffixes such as > >> +.Dq .tar.bz2 . > > > Maybe .tgz for consistency with the example below > > I slightly prefer demonstrating here that the suffix can contain digits, > in particular since the presence of digits in file name extensions can > result in confusion when people apply the suffix rule and the rule > about digit/non-digit splitting in the wrong order. Ah, very good point. OK. > Besides, when you have multiple examples, i don't consider it a goal > to have all examples demonstrate the same aspects. To the contrary, > having the examples cover as many different aspects as possible > feels preferable. > > > (and since we don't have bzip2(1) in base)? > > I don't think that's a problem. The base system is certainly > equipped to handle strings containing the substring "bz2", and even > to store files with a .bz2 file name extension. > > Besides, i doubt anyone uses OpenBSD without using ports, and use > of bzip2(1) is widespread in ports, so mentioning it in an example > does not feel exotic at all. > > >> .Pp > >> For example: > >> .Bd -literal -offset indent > > > Maybe clarify here that the 'odd-numbered substring' is simply a dot in > > the typical 'version sort' case. > > Like in the patch below? > > It feels slightly wordy, any idea how to bring the point across more > concisely? > > Yours, > Ingo > > > Index: sort.1 > =================================================================== > RCS file: /cvs/src/usr.bin/sort/sort.1,v > diff -u -r1.67 sort.1 > --- sort.1 27 Mar 2025 11:43:58 -0000 1.67 > +++ sort.1 27 Mar 2025 12:46:22 -0000 > @@ -201,8 +201,8 @@ > IPv4 addresses in dotted quad notation. > .Pp > When comparing two strings, both strings are split into substrings > -such that the first and every odd-numbered substring > -consist of non-digit characters only, > +such that every odd-numbered substring including the first one > +consists of non-digit characters only, > while every even-numbered substring consists of digits only. > These substrings are compared in turn from left to right > until a difference is found. > @@ -222,7 +222,11 @@ > This is intended for ignoring filename suffixes such as > .Dq .tar.bz2 . > .Pp > -For example: > +In the following example, the first substring is > +.Qq sort\- > +and the other odd-numbered substrings are > +.Qq \&. > +each: > .Bd -literal -offset indent > $ ls sort* | sort -V > sort-1.022.tgz That's OK with me. -Pascal