On Sun, Mar 30, 2025 at 04:02:04PM +0200, Ingo Schwarze wrote: > Hello Pascal, > > Pascal Stumpf wrote on Thu, Mar 27, 2025 at 07:33:27PM +0100: > > [...] > > I probably should have explained myself a little better. The problem > > with your explanation is that the terms "upper case" and "lower case" > > letters are too broad and are not limited to ASCII. A Greek upper case > > alpha is an upper case letter, and is certainly not sorted before a > > lower case ASCII 'a', even if LC_COLLATE were implemented (I think). > > I don't see a problem here. Our sort(1) manual page already says: > > STANDARDS > The sort utility is compliant with the IEEE Std 1003.1-2008 (POSIX.1) > specification, except that it ignores the user's locale(1) and always > assumes LC_ALL=C. > > So it's clear that we are talking about ASCII characters only and not > about Greek letters. > > > So I would avoid using these classifications entirely. > > That would be possible with option 2 below. > > > > On Thu, 27 Mar 2025 13:55:29 +0100, Ingo Schwarze wrote: > > >> Do you have an idea of what we might say to achieve a reasonable > >> level of vagueness? > > > The first paragraph of DESCRIPTION uses the word 'lexicographically' to > > describe the default comparison mode, > > That default sorting order is selected by get_sort_func() in coll.c > as wstrcoll(), which defers to bwscoll() in bwstring.c, which compares > by memcmp(3): > > $ printf "|\na\n_\nA\n=\n" | sort | perl -ne 'chomp;print' > =A_a| > > That behaviour is definitely right because POSIX says in > https://pubs.opengroup.org/onlinepubs/9799919799/utilities/sort.html > > Comparisons ... shall be performed using the collating sequence > of the current locale. > > and > https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap07.html#tag_07_03_02_06 > requires the collating order of the POSIX locale to follow ASCII. > > Calling the collating order of the current locale "lexicographically" > is maybe OK, too. Maybe. Or maybe it is slightly confusing because > POSIX uses the term "lexicographic" only for tsort(1), ctags(1) and cflow(1) > but not in relation to anything involving locales. > > Using the same term for -V seems problematic to me bevause -V does *not* > use the same order, *not* the collating sequence of the POSIX locale: > > $ printf "|\na\n_\nA\n=\n" | sort -V | perl -ne 'chomp;print' > Aa=_| > > Arguably, that is even more lexicographic than the POSIX collating sequence. > What a mess. Either way, using the same word for two different orderings > is not good. > > > perhaps intentionally not going into the details anywhere in the page. > > I doubt that whoever wrote our sort(1) manual - or the associated code, > for that matter - did anything out of wisdom or restraint. The much > more likely explanation seems to be thoughtlessness and sloppiness. > > I think we should improve the initial paragraph of the DESCRIPTION > to avoid the term "lexicographically". It is vague and confusing > in so far as POSIX does not define it. Introducing the proper term > "collation sequence" would be over the top given that the concepts > involved are very complicated and we deliberately do not support > any of them. > > I think from the user pespective, it is most helpful to clearly state > what our implementation actually does - ascii(7) ordering. In particular > since that coincides with what POSIX requires as the default. > We should not be vague given that POSIX requires specific behaviour. > > While here, let's also fix the first sentence: talking about > sorting "by lines" only to talk about sorting "by keys" right afterwards > is confusing. I guess what is meant is sorting "the lines". Also, > the "and" is dubious; sorting text and binary files together isn't > really such a great idea. Let's better regard all the files as either > text files *or* binary files. > > If we put this in (OK?), after that, i see three options for -V: > > 1. Leave the -V text as is; it is accurate and easy to understand. > 2. Say something like > in ascii(7) order, except that all letters are sorted before all > other characters > 3. Say something like > for non-digits, the sorting order is unspecified > > I'd be fine with both 1. and 2. and i like 3. less. > Saying "lexicographically" seems even worse to me than 3. because it feels > misleading. It sounds as if it would say something of substance, but it's > unclear what that is, and however you define "lexicographically", it's > likely not what -V does. For example, it certainly does not match > how we use the term "lexicographically" in strcoll(3) or strcmp(3). > > Yours, > Ingo >
hi. comments inline: > > Index: sort.1 > =================================================================== > RCS file: /cvs/src/usr.bin/sort/sort.1,v > diff -u -r1.68 sort.1 > --- sort.1 28 Mar 2025 14:35:50 -0000 1.68 > +++ sort.1 30 Mar 2025 13:19:15 -0000 > @@ -50,7 +50,7 @@ > .Sh DESCRIPTION > The > .Nm > -utility sorts text and binary files by lines. > +utility sorts the lines of text or binary files. i think the two lines are equivalent in meaning. but perhaps your version is simpler/clearer. i suppose the question is does sort(1) sort lines or files (or, at least, how do we want to represent that action)? > A line is a record separated from the subsequent record by a > newline (default) or NUL > .Ql \e0 > @@ -61,12 +61,12 @@ > .Pc . > A record can contain any printable or unprintable characters. > Comparisons are based on one or more sort keys extracted from > -each line of input, and are performed lexicographically, > -according to the specified command-line options > -that can tune the actual sorting behavior. > -By default, if keys are not given, > +each line according to the specified command line options. > +By default, > .Nm > -uses entire lines for comparison. > +uses entire lines for comparison and sorts in > +.Xr ascii 7 > +order. > .Pp > If no > .Ar file i find this second hunk much easier to read and clearer. i'm ok with your changes. jmc