On 09 Sep 2020 17:46:36 -0700, L A Walsh wrote: > > > On 9/8/2020 9:27 AM, Glenn Golden wrote: > > The attached patchset addresses a minor issue with program behavior vs. > > documentation of the df, du, and ls tools from coreutils-8.32, when using > > the --si option. > > > > It resurrects an issue that was brought up in 2014 [3] and eventually > > closed in 2018 [4] with a wontfix (after minimal discussion in the > > intervening time). > > > > > > Summary > > ------- > > > > Output from df, du, ls tools with the --si option display results using > > single-letter units suffixes "k", "M", "G", etc., rather than "kB", "MB", > > "GB". > > > > With or without --si? >
The unpatched code uses "k", "M", "G", etc. when the SI option is used. The patched code uses "kB", "MB", "GB" when the SI option is used. This accords with coreutils.info Section 2.3, and is also self-consistent with all other usage of suffixes "kB", "MB", etc, to imply decimal base. > > If you want to change 'si' output, I'm not against that, however I am very > much against changing default or -h format. > The proposed patch has no (intended) effect on any behavior other than the suffixes emitted when the SI option (in any of its various forms) is used. > > I don't need to know 'B' as a suffix when talking about disks. > Disks under an OS report space in base-2 multiples of bits (2**3 bits=1B). > Fine, but within the context of coreutils, the "B" appended to a single-letter suffix doesn't really mean "bytes". It's used in an ersatz way to indicate that the computation base for the associated numerical value is decimal. According to coreutils.info Section 2.3, a bare "M" implies binary base, and "MB" implies decimal base. So that entire sub-issue -- whether the B "ought" or "ought not" be present in the blocksize indicator suffix -- is not limited only to the SI option; there is an entire family of blocksize options, selectable using --block-size=XXX, that indicate which base (decimal or binary) is to be used to compute the numerical value, and which indicator suffixes (M, MB, MiB) are to be appended for each case. With the patched code, the SI option would simply be one among several options that append "B" to a single-letter suffixes. Other blocksize options, besides SI, already append B (e.g. --block-size=MB) with the semantic that "MB" means 1000^2. So the patch is not introducing the use of "B" within coreutils. It simply makes the semantic of the appended "B" globally self-consistent among those tools in all cases when it _is_ used. To reiterate even more directly: What the proposed patch does is to fully consistentize the behavior of {df, du, ls} with respect to the relationship between computation base and the associated indicator suffixes. With the patched code, the following are always true, with no exceptions whatsoever: bare "M" always means 1024^2, no exceptions "MiB" always means 1024^2, no exceptions "MB" always means 1000^2, no exceptions The above behavior is also what coreutils.info Section 2.3 specifies, and is also consistent with the "coreutils gotchas" (ref [2] from original post). "In general the units representations in coreutils are unfortunate, but an accident of history. POSIX species 'k' and 'b' to mean 1024 and 512 respectively. Standards wise 'k' should really mean 1000 and 'K' 1024. Then extending from that we now have (which we can't change for compatibility reasons): k=K=kiB=KiB=1024 kb=KB=1000 M=MiB=1024^2 MB=1000^2" With the unpatched 8.32 code, the above nice consistency is voided by the use of the SI option: bare "M" _sometimes_ means 1024^2 and _sometimes_ means 1000^2: * When the SI option is used, M means 1000^2 * When the SI option is not used, M means 1024^2 "MiB" always means 1024^2, no exceptions "MB" always means 1000^2, no exceptions Furthermore, the "bare M" semantic inconsistency can be subtle, not always discernable simply by inspecting the commandline, because the SI option can be invoked via environment variables. (As an aside, this is how I got burned by it, motivating the patch. Imo, this is a nasty inconsistency.) > > In memory and disk utils used in operating systems, the assumed unit is > the base-2 unit, the Byte and base-2 multiples thereof. You cannot and > should not try to mix bases when reporting sizes -- if you use Bytes (as > on a computer), then K,M,G,... are base-2 multiples of a base-2 unit. > Yes, and that's how Section 2.3 describes the semantics of those bare suffixes. But coreutils also supports, for historical reasons unrelated to the proposed patch, the use of kB, MB, GB, etc. to indicate decimal base. > > If you are talking 'b'its, it seems that is the closest practical unit > for measurement of information. With a single unit, base10 kbit, gbit, mbit > or kb,gb,mb seem fine. There is no possibility of confusion with prefixes > for fractional values as you can't have a milli-bit or such. > > I find kB confusing, since it is using the lower case 'k' as used for > km (kilometer) and shouldn't be used where 1024 is meant. > The upper- vs. lower-case k issue is historical and outside the scope of the proposed patch. The patch leaves existing {k,K}-case behavior exactly as-is. See comments from [2] (above) regarding the history of this wart. > > I.e. when measuring values of space -- standard hard disks require 512B. > Various utils also use 1K as a disk space size and recent hard disks > have a 4K sector size. > Fine, but coreutils.info Section 2.3 makes it explicit that the blocksizes reported by {df, du, ls} are not to be interpreted as having any particular relationship to filesystem (or implicitly, on-disk) block sizes: "The block size used for display is independent of any file system block size. Fractional block counts are rounded up to the nearest integer." > > When reporting space, you can't allocate fractional sectors so they have to > be a multiple of 512, 1K or 4K and --si should have no place among base-2 > machines regarding disk space. > OK, but again, the entire issue of whether coreutils "ought" or "ought not" support options allowing the display of disk space in decimal base is outside the scope of the proposed patch. Whether one agrees with it or not, coreutils does presently support decimal base, and that support is not limited only to the SI option. The patch has no effect on that support; its only effect is to fascistly force consistent use of the suffixes implying binary vs. decimal base, when decimal base output format is requested by the user, e.g. via --block-size=MB or --block-size=si. > > Memory is the same. It isn't allocated in numbers that are multiples of 10. > Anyone using nomenclature suggesting such, is demonstrating how little they > know about computers -- and those who use computers should listen to them > about how to describe space? > > That said, since metric more famous usage for prefixes >1 has been > 'k', I prefer lower case for metric to be consistent with long standing > usage of 'km' = kilometer. Larger values, _I_ feel should be consistent > with metric's largest usage: How often do you see a sign showing anything > in mega-meters or giga-meters. You ever see anything measured in > mega-liters (let alone giga or tera liters). > > Metric has standard units where the prefixes apply to the singular unit. > A Byte isn't a singular unit of information, a 'bit' is. Therefore standard > 's.i.' units shouldn't really be used with non-singular units (is there > a counter example, like where one talks about mega-[some non unary unit], > like a gross of eggs being 1.2 deka-dozen eggs (I think)? > > The main problem is that base-10 isn't a good fit for a base-2 environment, > though I would regularly accept base-10 prefixes with bits. > > So can you reserve lower case for SI, since they use lowercase 'k' for > 1000-m and leave Uppercase (with or without B) for base-2? > I would resist doing that, because the effect in the wild would be even more extensive than the proposed patch. The proposed patch is very simple: It has no effect on whether "k" is displayed in upper- or lower- case, and intentionally so: Because changing the case as you propose above ("reserve lower case for SI") would affect more than just scripts that use the SI option. For example: $ df --block-size=KB /mnt/test # Unpatched or patched code Filesystem 1kB-blocks Used Available Use% Mounted on /dev/sda7 1879782kB 2888kB 1857459kB 1% /mnt/test If the behavior were changed as you propose, then "kB" would change to "KB", even though the SI option was not used. It was an explicit goal of the proposed patch to have zero effect on existing behavior _except_ when the SI option is used, in which case its only effect is to substitute "MB" for bare "M", so as to enforce global consistency in the semantics of those suffixes, among all the tools. > > It may not be what's authoritative, but it is what makes sense. > I don't disagree with that, but given the extreme (and understandable) sensitivity to the output-scraping issue, the patch as-is seems to be about the best one can do without making that issue worse: It affects only the output produced when the SI option is explicitly specified by the caller, and leaves everything else exactly as-is, warts [2] and all. > > It would also allow scripts that use existing behavior in non-base10 to > continue working. Though might break scripts using --si. But how many > scripts would use that? > Imo, very few, which is why I proposed the patch as it is: The patch brings the suffix indicators into 100% self-consistency within coreutils, and at the only cost of possibly breaking scripts which actually use the SI option. And my sense -- similar to yours (I'm assuming, from your phrasing above) -- is that there are probably not many of those. If that's true, then it seems like it may be a worthwhile tradeoff: Imposition of self-consistent suffix semantics once and for all, vs. breaking a (probably) small number of scripts in the wild that explicitly use the SI option. Glenn Golden