On Wed, 26 Mar 2025 13:59:23 +0100, Ingo Schwarze wrote: > Hi Pascal, > > Pascal Stumpf wrote on Tue, Mar 18, 2025 at 04:59:40PM +0100: > > On Sat, 15 Mar 2025 20:00:42 +0100, Ingo Schwarze wrote: > >> Van Dung Ha wrote on Wed, Mar 12, 2025 at 02:25:02PM +0100: > > >>> Recently, I was parsing a few IP addresses from the snort logs > >>> to populate a pf table and encountered something counter-intuitive. > >>> Here is an example source list. > >>> [...] > >>> 77.224.14.2 > >>> 77.224.14.21 > >>> 77.224.14.18 > >>> 77.224.14.21 > > >> The easiest way i found to do what you want is > >> > >> $ tr . s < test.in | sort -nuts -k1 -k2 -k3 -k4 | tr s . > >> > >> nuts in the Karakorum! > > > sort -V to the rescue! > > Thank you fort pointing that out! > > Looking at the description of -V in our sort(1) manual revealed that > that description is completely useless: > > * The pseudo syntax PREFIX VERSION SUFFIX is highly misleading > because the PREFIX is optional, and additional infixes are > accepted in the middle of the version numbers. > * Leading zeros are not ignored. For example, if you sort > the strings "1" and "01", the order is reversed, sorting "01" before "1". > * The phrase "If an input string does not match the pattern" > is utterly confusing because no pattern is mentioned at all. > * It is unclear what "the byte compare function" is supposed to be, > and looking at the code, i conclude that such a function likely > does not even exist. > * The description doesn't say at all how the comparison works. > > So basically, i wrote a new description from scratch, see the patch below. > > I think it is likely that my new description is not entirely accurate. > Very probably, there are some edge cases that might not agree with my > description, or where my description at least leaves ambiguity. > > I think that problem is hard to avoid for two reasons. > > First, our implementation makes huge numbers of choices where it is hard > to say which of these are intentional features and which are accidental > implementation details. Given that the authors of the code chose to not > document at all what the intended behaviour of the code is, neither > under- nor over-documenting feels easy to avoid. > > Even more importantly, i judge the code quality of sort(1) > as positively atrocious. Because i have done some documentation > work in libcrypto, i'm quite used to documenting low-quality code, > but i must say the code in /usr/src/usr.bin/sort does not feel better > than typical OpenSSL quality, i.e. it is significantly below OpenBSD > quality standards: > > - The architecture of the code lacks logical structure. > Instead of structuring the code in a way that reflects the > purposes and algorithms and makes reasoning about and auditing > the code easy for human beings, there are interminable cascades > of special cases and interminable chains of wrappers, hiding > the places where the actual work is being done. > > - The code does not use POSIX libc APIs but instead uses idiosyncratic > wrappers throughout, even for the most fundamental <string.h> > and <ctype.h> stuff, making it very hard to understand on the > micro just like on the macro level. > > Thouhts? OKs? > Ingo
Much better. Thanks for improving this part as well. Just a couple of nits. > Index: sort.1 > =================================================================== > RCS file: /cvs/src/usr.bin/sort/sort.1,v > diff -u -r1.66 sort.1 > --- sort.1 15 Mar 2025 18:20:04 -0000 1.66 > +++ sort.1 26 Mar 2025 12:49:11 -0000 > @@ -196,14 +196,31 @@ > .It Fl r , Fl Fl reverse > Sort in reverse order. > .It Fl V , Fl Fl version-sort > -Sort version numbers. > -The input lines are treated as file names in form > -PREFIX VERSION SUFFIX, where SUFFIX matches the regular expression > -"(\.([A-Za-z~][A-Za-z0-9~]*)?)*". > -The files are compared by their prefixes and versions (leading > -zeros are ignored in version numbers, see example below). > -If an input string does not match the pattern, then it is compared > -using the byte compare function. > +This option is intended to sort strings that contain version numbers, > +but it can be used for other purposes as well, for example to sort > +IPv4 addresses in dotted quad notation. > +.Pp > +When comparing two strings, both strings are split into substrings > +such that the first and every odd-numbered substring > +consists of non-digit characters only, s/consists/consist/ > +while every even-numbered substring consists of digits only. > +These substrings are compared in turn from left to right > +until a difference is found. > +The first substring can be empty; all others cannot. > +.Pp > +Non-digit substrings are compared alphabetically, with upper case > +letters sorting before lower case letters, letters sorting before > +non-letters, and non-letters sorting in > +.Xr ascii 7 > +order. Hmm. This is wrong as soon as you step foot into Unicode. I don't think it hurts to be a bit more vague here. > +Substrings consisting of digits are compared as integer numbers. > +.Pp > +At the end of each string, zero or more suffixes that start with a dot, > +consist only of letters, digits, and tilde characters, and do not > +start with a digit are ignored, equivalent to the regular expression > +"(\e.([A-Za-z~][A-Za-z0-9~]*)?)*". > +This is intended for ignoring filename suffixes such as > +.Dq .tar.bz2 . Maybe .tgz for consistency with the example below (and since we don't have bzip2(1) in base)? > .Pp > For example: > .Bd -literal -offset indent Maybe clarify here that the 'odd-numbered substring' is simply a dot in the typical 'version sort' case.