Support in sort for human-readable numbers

Vitali Lovich Fri, 02 Jan 2009 15:33:20 -0800

I've read the proposed patches that have been batted around on the mailing
list (after coming up with my own implementation :D of course).  My proposed
solution is less generic, but I believe more robust, than the other
approaches.


I've proposed my reasoning below, but I've posted it as a bug on launchpad
to track this issue 313152 <https://bugs.launchpad.net/bugs/313152>.  The
patch is against 6.10 instead of trunk mainly because I was too lazy to get
the build-system set up on Ubuntu.  That being said, I'm pretty sure the
patch should still work against the trunk.  In any case, if it's necessary,
I could also do the diff against the trunk.

Code review?
What would I need to do to get this mainlined (aside from adding the
documentation changes)?

REASONING:

One of my major assumption is that all the numbers are well formatted.

In other words, there's an explicit demarcation in the number line (at least
internal to the input being sorted) after which the suffix increases and the
number starts again near 0.  For instance, if M represents 1050 Kilobytes,
then there's no 1051K - it's represented as 1.001M or something along those
lines.  Again, this would only rely on the input being internally consistent
- sort needs no knowledge or hints of what those suffixes represent.

Also, there can be no exponential numbers when in this mode mainly because
it's unclear whether an `E' represents the beginning of the exponent or an
exabyte.  Since both would be uncommon as use cases.  Exabytes are really
really big right now, and exponents would be meaningless since they could
only be used for extremely small numbers or numbers that are bigger than a Y
suffix.  However, from a consistent behaviour and a flexibility standpoint
(suffixes can be extended much easier in a consistent manner without
worrying about precision), exponents lose out.  Also, at the end of the day,
the common use case is the du & df utilities (at least, those are the only
ones that I consistently see this come up as an issue for presumably because
ls has its own internal sort).

The suffix is case insensitive - `k' is equivalent to `K'.  There's
arguments that can be made either way, and I could be easily persuaded on
this issue (maybe even add a flag to determine behaviour in this case).

The advantage this has is that the code is far simpler, faster, and more
accurate.
It's simpler because there's no need to worry about what the suffix actually
represents (power of 10, power of 2).
It's faster because there's no expensive conversion to a double as with the
other proposed solutions I've seen.
It's more accurate because it uses the numeric string comparison rather than
converting to a numerical form which could have precision & overflow issues.
_______________________________________________
Bug-coreutils mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/bug-coreutils

Support in sort for human-readable numbers

Reply via email to