bug#51011: [GNU sort] Numerical sort with delimiter may be broken (bug)
On 10/8/21 7:32 PM, Pádraig Brady wrote: it's not a thousands separator, rather a grouping character, and groups can be in 2, 3, 4, and even 5. Sure, but 'sort' could determine the group sizes from the locale, and reject digit strings that are formatted improperly according to the group-size rules. (Not that I plan to write the code to do that)
bug#51011: [GNU sort] Numerical sort with delimiter may be broken (bug)
On 08/10/2021 21:48, Paul Eggert wrote: On 10/8/21 6:37 AM, Pádraig Brady wrote: The difference here is due to ',' being treated as a thousands sep, not a decimal point. Oh, thanks. Of course! I should have figured that out myself. It is unfortunate that "," is treated as a thousands seperator even though it's obviously not one (as it's not followed by 3 decimal digits). I don't think POSIX requires this behavior; it's not clear to me that POSIX even allows it. Well in general it's not a thousands separator, rather a grouping character, and groups can be in 2, 3, 4, and even 5. So I don't think we should change the logic here. This bug report suggests that we should alter the code so that 'sort -n' acts more like common practice, and requires thousands separators to be in the right places in order to treat nearby digits to be part of the number. Alternatively, we could document the existing behavior (even if it's not clear that it conforms to POSIX). What we can do is have --debug warn when there is an overlap in --field-separator and the grouping and decimal characters when using numeric keys. I'll have a look at that tomorrow. cheers, Pádraig
bug#51011: [GNU sort] Numerical sort with delimiter may be broken (bug)
On 10/8/21 6:37 AM, Pádraig Brady wrote: The difference here is due to ',' being treated as a thousands sep, not a decimal point. Oh, thanks. Of course! I should have figured that out myself. It is unfortunate that "," is treated as a thousands seperator even though it's obviously not one (as it's not followed by 3 decimal digits). I don't think POSIX requires this behavior; it's not clear to me that POSIX even allows it. This bug report suggests that we should alter the code so that 'sort -n' acts more like common practice, and requires thousands separators to be in the right places in order to treat nearby digits to be part of the number. Alternatively, we could document the existing behavior (even if it's not clear that it conforms to POSIX).
bug#51011: [GNU sort] Numerical sort with delimiter may be broken (bug)
On 04/10/2021 21:01, Paul Eggert wrote: On 10/4/21 08:58, Pádraig Brady wrote: The --debug option points out the issue: $ printf '%s\n' 1,a 0,9 | sort --debug -nk1 -t , sort: key 1 is numeric and spans multiple fields 1,a _ ___ 0,9 ___ ___ As Juncheng points out, it is a bit odd that -n and -g disagree here, even in locales where ',' is not a decimal point. For example: $ printf '1,a\n0,9\n' | sort -gk1 -t, --debug sort: text ordering performed using ‘en_US.UTF-8’ sorting rules sort: key 1 is numeric and spans multiple fields 0,9 _ ___ 1,a _ ___ $ printf '1,a\n0,9\n' | sort -nk1 -t, --debug sort: text ordering performed using ‘en_US.UTF-8’ sorting rules sort: key 1 is numeric and spans multiple fields 1,a _ ___ 0,9 ___ ___ The difference here is due to ',' being treated as a thousands sep, not a decimal point. So Juncheng to specifically answer your question, 0,9 is being interpreted as 9, which sorts after 1,a. For e.g. consider: $ printf '%s\n' 1,a 0,900 | sort -s -k1,1g --debug 0,900 _ 1,a _ $ printf '%s\n' 1,a 0,900 | sort -s -k1,1n --debug 1,a _ 0,900 _ Given the various groupings possible (depending on locale one can group in 2, 3, 4, 5 digits) we effectively just ignore the grouping separator in numeric mode, hence the difference. Note in locales where , is a decimal point we do get consistent order between -g and -n as expected: $ printf '%s\n' '1,a' '0,9' | LC_ALL=fr_FR.utf8 sort -s -k1,1n --debug sort: tri du texte réalisé en utilisant les règles de tri « fr_FR.utf8 » 0,9 ___ 1,a __ $ printf '%s\n' '1,a' '0,9' | LC_ALL=fr_FR.utf8 sort -s -k1,1g --debug sort: tri du texte réalisé en utilisant les règles de tri « fr_FR.utf8 » 0,9 ___ 1,a __ For completeness we do have another issue with grouping separators, where we don't support multi-byte separators appropriately. For e.g. fr_FR.utf8 uses "narrow non breaking space" as the separator, which we don't support: $ sep=$(LC_ALL=fr_FR.utf8 locale thousands_sep) $ printf '%s\n' 0800 "0${sep}900" | LC_ALL=fr_FR.utf8 sort -s -k1,1n --debug sort: tri du texte réalisé en utilisant les règles de tri « fr_FR.utf8 » 0 900 _ 0800 cheers, Pádraig