Re: question about behavior of sort -n -t,

Eric Blake Wed, 09 Oct 2013 16:10:38 -0700

On 10/09/2013 03:06 PM, Gabriel Gaster wrote:
>> No, because POSIX requires that -n parse as many characters as
>> possible regardless of locale, unless you explicitly ask to limit
>> the sort to a specific key.
> 
> 
> That's interesting. Could you perhaps point me to that section (if you
> know it off the top of your head)? The POSIX requirement that -n parse
> as many characters regardless of locale seems to directly
> contradict the other requirement (that at least made sense to me)
> that you mentioned earlier that -n parse as many characters until
> it sees a non numeric (which is locale dependent).


http://pubs.opengroup.org/onlinepubs/9699919799/utilities/sort.html

>> -n
>>     Restrict the sort key to an initial numeric string, consisting of 
>> optional <blank> characters, optional minus-sign, and zero or more digits 
>> with an optional radix character and thousands separators (as defined in the 
>> current locale), which shall be sorted by arithmetic value. An empty digit 
>> string shall be treated as zero. Leading zeros and signs on zeros shall not 
>> affect ordering.
>> 
>> -t  char
>>     Use char as the field separator character; char shall not be considered 
>> to be part of a field (although it can be included in a sort key).

As I read that, I see no limit on length; '-n' (and any other sort key)
is free to snarf up characters including the field boundary that
separates what is otherwise multiple fields, unless you use -k to state
otherwise.  In the absence of any other limit, how many characters get
snarfed depends on the locale definition of radix character, thousands
separator, and any other locale-specific numeric forms.

> 
>> Perhaps less likely to be used in real life, but still apropos to
>> the example:
>> $ printf '1202\n2011\n' | LC_ALL=C sort --debug -t0 -s -n -k1,1
>> sort: using simple byte comparison 2011 _ 1202 __
>> $ printf '1202\n2011\n' | LC_ALL=C sort --debug -t0 -s -n sort:
>> using simple byte comparison 1202 ____ 2011 ____
>> And you'll get the same behavior on Solaris or BSD sort (at least,
>> assuming they don't have blatant POSIX compliance bugs). Once you
>> understand WHY the above example has two different sorts, based on
>> whether -k is used, you'll understand why we can't stop parsing -n
>> at a comma even for -t, in a non-C locale.
>>
> 
> I understand why the above examples give two different sorts right
> now. I just think that, in your example, -t0 should mean that 0 is no longer
> a numeric character but a field-separator (regardless of locale) and 
> therefore that sort should stop on the first line at 2.

Admittedly, that might be a nice intuitive meaning; but it's not
historically accurate, so POSIX didn't specify it as such - and we can't
change it without risking breaking someone that depends on POSIX
semantics.  Without -k to stop things, the -t0 means that '0' serves as
BOTH a separator AND a numeric character - you are sorting on numbers
that span multiple fields.  The only way to make numeric parsing stop at
a field boundary it so use -k to tell sort to stop its key comparison at
that boundary (or to add a new option to request something different
than POSIX, but we're reluctant to add new options to sort that would be
very corner case in their usage).

>> Rather, the lack of -k determines how far -n will parse, regardless
>> of locale; it's just that some locales let -n parse farther than
>> others.
> 
> 
> Don't you actually mean here that "the lack of -k determines how far -n will
> parse, depending on locale."

Or even: "the lack of -k has a locale-independent effect of letting -n
parse as far as possible; then -n has a locale-dependent effect of how
far that actually is".

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

signature.asc
Description: OpenPGP digital signature

Re: question about behavior of sort -n -t,

Reply via email to