Re: Unexpected behavior of sort -hu

Jason McIntyre Sun, 30 Mar 2025 09:25:24 -0700

On Sun, Mar 30, 2025 at 04:02:04PM +0200, Ingo Schwarze wrote:
> Hello Pascal,
> 
> Pascal Stumpf wrote on Thu, Mar 27, 2025 at 07:33:27PM +0100:
> 
> [...]
> > I probably should have explained myself a little better.  The problem
> > with your explanation is that the terms "upper case" and "lower case"
> > letters are too broad and are not limited to ASCII.  A Greek upper case
> > alpha is an upper case letter, and is certainly not sorted before a
> > lower case ASCII 'a', even if LC_COLLATE were implemented (I think).
> 
> I don't see a problem here.  Our sort(1) manual page already says:
> 
>   STANDARDS
>      The sort utility is compliant with the IEEE Std 1003.1-2008 (POSIX.1)
>      specification, except that it ignores the user's locale(1) and always
>      assumes LC_ALL=C.
> 
> So it's clear that we are talking about ASCII characters only and not
> about Greek letters.
> 
> > So I would avoid using these classifications entirely.
> 
> That would be possible with option 2 below.
> 
> 
> > On Thu, 27 Mar 2025 13:55:29 +0100, Ingo Schwarze wrote:
> 
> >> Do you have an idea of what we might say to achieve a reasonable
> >> level of vagueness?
> 
> > The first paragraph of DESCRIPTION uses the word 'lexicographically' to
> > describe the default comparison mode,
> 
> That default sorting order is selected by get_sort_func() in coll.c
> as wstrcoll(), which defers to bwscoll() in bwstring.c, which compares
> by memcmp(3):
> 
>    $ printf "|\na\n_\nA\n=\n" | sort | perl -ne 'chomp;print'
>   =A_a|
> 
> That behaviour is definitely right because POSIX says in
> https://pubs.opengroup.org/onlinepubs/9799919799/utilities/sort.html
> 
>   Comparisons ... shall be performed using the collating sequence
>   of the current locale.
> 
> and
> https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap07.html#tag_07_03_02_06
> requires the collating order of the POSIX locale to follow ASCII.
> 
> Calling the collating order of the current locale "lexicographically"
> is maybe OK, too.  Maybe.  Or maybe it is slightly confusing because
> POSIX uses the term "lexicographic" only for tsort(1), ctags(1) and cflow(1)
> but not in relation to anything involving locales.
> 
> Using the same term for -V seems problematic to me bevause -V does *not*
> use the same order, *not* the collating sequence of the POSIX locale:
> 
>    $ printf "|\na\n_\nA\n=\n" | sort -V | perl -ne 'chomp;print'
>   Aa=_|
> 
> Arguably, that is even more lexicographic than the POSIX collating sequence.
> What a mess.  Either way, using the same word for two different orderings
> is not good.
> 
> > perhaps intentionally not going into the details anywhere in the page.
> 
> I doubt that whoever wrote our sort(1) manual - or the associated code,
> for that matter - did anything out of wisdom or restraint.  The much
> more likely explanation seems to be thoughtlessness and sloppiness.
> 
> I think we should improve the initial paragraph of the DESCRIPTION
> to avoid the term "lexicographically".  It is vague and confusing
> in so far as POSIX does not define it.  Introducing the proper term
> "collation sequence" would be over the top given that the concepts
> involved are very complicated and we deliberately do not support
> any of them.
> 
> I think from the user pespective, it is most helpful to clearly state
> what our implementation actually does - ascii(7) ordering.  In particular
> since that coincides with what POSIX requires as the default.
> We should not be vague given that POSIX requires specific behaviour.
> 
> While here, let's also fix the first sentence: talking about
> sorting "by lines" only to talk about sorting "by keys" right afterwards
> is confusing.  I guess what is meant is sorting "the lines".  Also,
> the "and" is dubious; sorting text and binary files together isn't
> really such a great idea.  Let's better regard all the files as either
> text files *or* binary files.
> 
> If we put this in (OK?), after that, i see three options for -V:
> 
>  1. Leave the -V text as is; it is accurate and easy to understand.
>  2. Say something like
>     in ascii(7) order, except that all letters are sorted before all
>     other characters
>  3. Say something like
>     for non-digits, the sorting order is unspecified
> 
> I'd be fine with both 1. and 2. and i like 3. less.
> Saying "lexicographically" seems even worse to me than 3. because it feels
> misleading.  It sounds as if it would say something of substance, but it's
> unclear what that is, and however you define "lexicographically", it's
> likely not what -V does.  For example, it certainly does not match
> how we use the term "lexicographically" in strcoll(3) or strcmp(3).
> 
> Yours,
>   Ingo
>


hi. comments inline:

> 
> Index: sort.1
> ===================================================================
> RCS file: /cvs/src/usr.bin/sort/sort.1,v
> diff -u -r1.68 sort.1
> --- sort.1    28 Mar 2025 14:35:50 -0000      1.68
> +++ sort.1    30 Mar 2025 13:19:15 -0000
> @@ -50,7 +50,7 @@
>  .Sh DESCRIPTION
>  The
>  .Nm
> -utility sorts text and binary files by lines.
> +utility sorts the lines of text or binary files.

i think the two lines are equivalent in meaning. but perhaps your
version is simpler/clearer. i suppose the question is does sort(1) sort
lines or files (or, at least, how do we want to represent that action)?

>  A line is a record separated from the subsequent record by a
>  newline (default) or NUL
>  .Ql \e0
> @@ -61,12 +61,12 @@
>  .Pc .
>  A record can contain any printable or unprintable characters.
>  Comparisons are based on one or more sort keys extracted from
> -each line of input, and are performed lexicographically,
> -according to the specified command-line options
> -that can tune the actual sorting behavior.
> -By default, if keys are not given,
> +each line according to the specified command line options.
> +By default,
>  .Nm
> -uses entire lines for comparison.
> +uses entire lines for comparison and sorts in
> +.Xr ascii 7
> +order.
>  .Pp
>  If no
>  .Ar file

i find this second hunk much easier to read and clearer.

i'm ok with your changes.

jmc

Re: Unexpected behavior of sort -hu

Reply via email to