Re: Unexpected behavior of sort -hu

Jason McIntyre Thu, 13 Mar 2025 10:06:57 -0700

hi. reads fine to me, ingo. ok.
jmc


On 12 March 2025 19:33:35 GMT, Ingo Schwarze <schwa...@usta.de> wrote:
>Hi Stuart, hi Mark, hi Jason,
>
>Stuart Henderson wrote on Wed, Mar 12, 2025 at 03:11:27PM +0000:
>> On 2025/03/12 14:54, Mark Kettenis wrote:
>
>>> Well, that makes some sort of sense if you interpret the strings as
>>> floating point numbers and ignore everything after as garbage.
>
>> GNU's implementation of sort behaves exactly the same with -h and -n,
>> their manual says "output only the first of an equal run".
>> 
>> posix says "suppress all but one in each set of lines having equal
>> keys", and their definition of -n fits into that: 
>> 
>>     Restrict the sort key to an initial numeric string, consisting
>>     of optional <blank> characters, optional <hyphen-minus> character,
>>     and zero or more digits with an optional radix character and
>>     thousands separators (as defined in the current locale), which
>>     shall be sorted by arithmetic value. An empty digit string shall
>>     be treated as zero. Leading zeros and signs on zeros shall not
>>     affect ordering.
>> 
>>     https://pubs.opengroup.org/onlinepubs/9799919799/utilities/sort.html
>> 
>> I think our docs could be improved,
>
>In general, the quality of our sort(1) manual does not feel good
>to me.  Parts of it look wordy, other parts vague.
>
>See below for a patch to improve some of the aspects related to the
>present report.  I do not claim this patch fixes all problems in the
>vicinity, but i fear rabbit holes and prefer incremental progress.
>
>> but the -n behaviour seems valid and, importantly, matches the common
>> other implementation and does not seem to violate posix.
>> 
>> -h is of course an extension, but matching -n seems right.
>
>I agree with all of that.
>
>One aspect i still don't understand is the interaction of -n with "-t.",
>for example why "sort -n -t. -k1 -k2 -k3 -k4 < test.in" doesn't
>work on the input provided by the OP (maybe parsing "." as a decimal
>point takes precedence over the "-t." making it a field separator?
>I'm not sure).  I'm not sure how the standard expects field splitting
>and number parsing to be related to each other.  But one thing at a time,
>so here comes my diff:
>
>Rationale:
>The main point is that for all the numeric sort options, we need to say
>explicitely what the key is, because the key is what the description
>of the -u option refers to.
>
>In the order of the patch, the detailed rationale is:
> 1. "implies a stable sort (see below)" is just wrong.
>    If anything, -s is above -u, not below - but saying that would
>    be useless, it's better to just point to -s directly.
> 2. Fix -g in a similar way as -n (see below).
> 3. "handles general floating points" sounds logically wrong.
>    The text isn't talking about multiple points, but multiple numbers.
> 4. Fix -h in a similar way as -n (see below).
> 5. Fix the cross reference to df(1).
> 6. Say what the key is.
> 7. Add the missing indefinite article "an optional minus sign".
> 8. Avoid needlessly turning the postpositive participle "including"
>    into a parenthetic remark.
> 0. Add the missing indefinite article to "decimal point".
> 10. Clarify that the decimal point is optional.
>
>OK?
>  Ingo
>
>
>Index: sort.1
>===================================================================
>RCS file: /cvs/src/usr.bin/sort/sort.1,v
>diff -u -r1.65 sort.1
>--- sort.1     31 Mar 2022 17:27:27 -0000      1.65
>+++ sort.1     12 Mar 2025 19:26:15 -0000
>@@ -121,7 +121,8 @@
> is not defined.
> .It Fl u , Fl Fl unique
> Unique: suppress all but one in each set of lines having equal keys.
>-This option implies a stable sort (see below).
>+This option implies
>+.Fl s .
> If used with
> .Fl C
> or
>@@ -148,24 +149,25 @@
> Consider all lowercase characters that have uppercase
> equivalents to be the same for purposes of comparison.
> .It Fl g , Fl Fl general-numeric-sort , Fl Fl sort Ns = Ns Cm general-numeric
>-Sort by general numerical value.
>+Use an initial numeric string as the key and sort numerically.
> As opposed to
> .Fl n ,
>-this option handles general floating points.
>+this option handles general floating point numbers.
> It has a more
> permissive format than that allowed by
> .Fl n
> but it has a significant performance drawback.
> .It Fl h , Fl Fl human-numeric-sort , Fl Fl sort Ns = Ns Cm human-numeric
>-Sort by numerical value, but take into account the SI suffix,
>-if present.
>+Use an initial numeric string with an optional SI suffix as the key.
> Sorts first by numeric sign (negative, zero, or
> positive); then by SI suffix (either empty, or `k' or `K', or one
> of `MGTPEZY', in that order); and finally by numeric value.
> The SI suffix must immediately follow the number.
> For example, '12345K' sorts before '1M', because M is "larger" than K.
> This sort option is useful for sorting the output of a single invocation
>-of 'df' command with
>+of a
>+.Xr df 1
>+command with
> .Fl h
> or
> .Fl H
>@@ -176,9 +178,9 @@
> Sort by month abbreviations.
> Unknown strings are considered smaller than valid month names.
> .It Fl n , Fl Fl numeric-sort , Fl Fl sort Ns = Ns Cm numeric
>-An initial numeric string, consisting of optional blank space, optional
>-minus sign, and zero or more digits (including decimal point)
>-is sorted by arithmetic value.
>+Use an initial numeric string as the key, consisting of optional
>+blank space, an optional minus sign, and zero or more digits including
>+an optional decimal point, and sort numerically.
> Leading blank characters are ignored.
> .It Fl R , Fl Fl random-sort , Fl Fl sort Ns = Ns Cm random
> Sort lines in random order.

Re: Unexpected behavior of sort -hu

Reply via email to