Re: numfmt (=print 'human' sizes) updates

2013-02-04 Thread Pádraig Brady

I've squashed what we have now to:
http://www.pixelbeat.org/patches/coreutils/numfmt-squashed.patch.xz
which is unchanged from the previous except for a bug fix
in my negative handling, and further tweaks to TODO items.

tests, clang-analyzer, valgrind are Ok with it at least.

I'll push the patch in the morning.

thanks,
Pádraig.



Re: numfmt (=print 'human' sizes) updates

2013-01-08 Thread Assaf Gordon
Hello,

The attached patch adds 'numfmt' to the coreutils documentation.

Regards, 
 -gordon


numfmt.12.patch.xz
Description: application/xz


Re: numfmt (=print 'human' sizes) updates

2012-12-27 Thread Assaf Gordon
Hello,

Assaf Gordon wrote, On 12/26/2012 05:40 PM:
 
 Attached is an updated numfmt, with the following two changes:
 1. --format support
 2. optionally ignoring input errors.

The attached patch (incremental to the above full patch) adds few more tests, 
and fixes 4 issues found with the clang static analyzer.

There's no change in functionality.

-gordon



numfmt.11.patch.xz
Description: application/xz


Re: numfmt (=print 'human' sizes) updates

2012-12-26 Thread Assaf Gordon
Hello, 

Pádraig Brady wrote, On 12/21/2012 12:42 PM:
 
 I'm starting to think the original idea of having a --format option
 would be more a more concise, well known and extendible interface
 rather than having --padding, --grouping, --base, ...
 It wouldn't have to be passed directly to printf, and could
 be parsed and preprocessed something like we do in seq(1).
 

Regarding 'format' option, there are some intricacies that are worth discussing:

1. Depending on the requested conversion, the output can be a string (e.g. 
1.4Ki) or a long double (e.g. 140).

2. Internally, the program uses long doubles - so the real format is %Lf - 
regardless of what the user will give (e.g. %f).

3. printf accepts all sorts of options, some of which aren't relevant to 
numfmt, or only relevant when printing non-humanized values.
e.g.:
$ LC_ALL=en_US.utf8 seq -f %0'14.5f 1000 1001
0001,000.0
0001,001.0

4. The assumption was that humanized numbers are always maximum 4 characters in 
SI/IEC (e.g. 1024 or 4.5M) or 5 characters with iec-i (e.g. 999Ti).
With the new 'format', if given %'2.9f - should the output be still 4 
characters (e.g. 4.5T), or respect the .9 format (e.g. 4.5T) ? 
and does the suffix character counts in the 2.9 format ?


My preference is to keep things simple, and accept just a limited subset of the 
format syntax:
1. grouping (the ' character)
2. padding (the number after '%' and before the 'f'
3. alignment (optional '-' after '%')
4. Any prefix/suffix before/after the '%' option.
5. Accept just %f, but internally treat it as '%s' or '%Lf', depending on the 
output.

All other options will be silently ignored, or trigger errors.

Example:
$ numfmt --format xx%20fxx --to=si 5000
[[ internally, treats as --padding 20 ]]
xx5.0Kxx

$ numfmt --format xx%'-10fxx 5000
[[ internally, treats as --padding -10 --grouping ]]
xx5,000 xx

$ numfmt --format xx%0#'+010llfxx 5000
[[ reject as 'too complicated' / unsupported printf options ]]


WDYT?

-gordon





Re: numfmt (=print 'human' sizes) updates

2012-12-26 Thread Pádraig Brady

On 12/26/2012 03:28 PM, Assaf Gordon wrote:

Hello,

Pádraig Brady wrote, On 12/21/2012 12:42 PM:


I'm starting to think the original idea of having a --format option
would be more a more concise, well known and extendible interface
rather than having --padding, --grouping, --base, ...
It wouldn't have to be passed directly to printf, and could
be parsed and preprocessed something like we do in seq(1).



Regarding 'format' option, there are some intricacies that are worth discussing:

1. Depending on the requested conversion, the output can be a string (e.g. 
1.4Ki) or a long double (e.g. 140).


I would expect the numeric format applies to the number portion only,
so there should be little confusion there.


2. Internally, the program uses long doubles - so the real format is %Lf - regardless 
of what the user will give (e.g. %f).


Yes. There is a long_double_format() function in seq.c that converts from
specified format to L.. internally.


3. printf accepts all sorts of options, some of which aren't relevant to 
numfmt, or only relevant when printing non-humanized values.
e.g.:
 $ LC_ALL=en_US.utf8 seq -f %0'14.5f 1000 1001
 0001,000.0
 0001,001.0

4. The assumption was that humanized numbers are always maximum 4 characters in SI/IEC (e.g. 1024 or 
4.5M) or 5 characters with iec-i (e.g. 999Ti).
With the new 'format', if given %'2.9f - should the output be still 4 characters (e.g. 4.5T), or respect 
the .9 format (e.g. 4.5T) ? and does the suffix character counts in the 2.9 format ?


I would expect that specifying a precision after the '.' with 
--to={si,iec,iec-i}
would override any default precision, and that the width for the field would 
adjusted accordingly.
The suffix character would be significant to the field/padding width rather 
than the precision
(similarly to the first point that format only directly applies to the numeric 
portion).
So:

$ printf %s\n 94500 95000 | numfmt --to=iec
 94.5K
 95.0K
$ printf %s\n 94500 95000 | numfmt --to=iec-i
 94.5Ki
 95.0Ki
$ printf %s\n 94500 95000 | numfmt --to=iec-i --format='[%f]' # Keep default 
of %7.1Lf for --to=iec-i
[ 94.5Ki]
[ 95.0Ki]
$ printf %s\n 94500 95000 | numfmt --to=iec-i --format='[%g]' # %g - %7Lg 
for --to=iec-i (adjusted for padding)
[ 94.5Ki]
[ 95  Ki]
$ printf %s\n 94500 95000 | numfmt --to=iec-i --format='[%10.f]' # Any 
padding or precision overrides defaults
[  94Ki]
[  95Ki]
$ printf %s\n 94500 95000 | numfmt --to=iec-i --format='%.f ' # Common would 
be to strip decimals and add a space
 94 Ki
 95 Ki


My preference is to keep things simple, and accept just a limited subset of the 
format syntax:
1. grouping (the ' character)
2. padding (the number after '%' and before the 'f'
3. alignment (optional '-' after '%')
4. Any prefix/suffix before/after the '%' option.
5. Accept just %f, but internally treat it as '%s' or '%Lf', depending on the 
output.


The above are consistent with my examples I think.


All other options will be silently ignored, or trigger errors.

Example:
 $ numfmt --format xx%20fxx --to=si 5000
 [[ internally, treats as --padding 20 ]]
 xx5.0Kxx

 $ numfmt --format xx%'-10fxx 5000
 [[ internally, treats as --padding -10 --grouping ]]
 xx5,000 xx

 $ numfmt --format xx%0#'+010llfxx 5000
 [[ reject as 'too complicated' / unsupported printf options ]]


Too complicated/unsupported might just completely override our
defaults and padding etc. and be passed to printf?
I.E. follow the it's better to ask forgiveness than permission idiom,
i.e. use as much logic below you as possible.

$ echo  | numfmt --format=xx%0#'+010llfxx\n
xx+1,111.00xx

cheers,
Pádraig.



Re: numfmt (=print 'human' sizes) updates

2012-12-21 Thread Pádraig Brady

On 12/14/2012 08:53 AM, Pádraig Brady wrote:

On 12/14/2012 05:38 AM, Assaf Gordon wrote:

Hello,

Attached is a slightly improved patch - minor code changes, and many more tests.
Line coverage is 98%, and branch coverage is now 93% , and most of the 
non-covered branches are simply unreachable (I'm checking the reachable ones).

The comments below still apply.


It's looking like a really useful cohesive command.

I verified all the new tests pass and
syntax check is fine too.
Your attention to detail is _much_ appreciated.

I'll note here some _future_ TODO items, which
don't need to be added for the initial version:
  1. Parse numbers containing grouping separators.
  2. Have --fields support multiple values.

I did some basic perf testing and verified no leaks with:
$ yes 1 2 | pv  /dev/null
^C09MB 0:00:03 [ 102MB/s] [   =   
  ]
$ yes 1 2 | (ulimit -v 2; src/numfmt --field 2 --to=iec --from=auto) | pv 
 /dev/null
^C.9MB 0:00:11 [6.32MB/s] [   =   
  ]
$ yes 1 2 | head -n10 | valgrind src/numfmt --field 2 --to=iec --from=auto 
21 | grep leaks
==23446== All heap blocks were freed -- no leaks are possible

I notice the UNIT options: seciont is mis-formatted by help2man.

I notice use of fprintf (stderr, ...) where error (0, 0, ...) should be used.

indent -nut src/numfmt.c makes lots of changes, most valid.


So on to some usage...

This shows auto padding works, with before and after displayed through tee.

$ df | head -n3 | tee (src/numfmt --header --field=2 --to=iec)
Filesystem   1K-blocks  Used Available Use% Mounted on
rootfs13102208  12544324424816  97% /
udev   1454732 0   1454732   0% /dev
Filesystem   1K-blocks  Used Available Use% Mounted on
rootfs 13M  12544324424816  97% /
udev  1.4M 0   1454732   0% /dev

This shows padding increases the field width as required in edge cases:

$ src/df /home | tee (src/numfmt --header --field=2 --from-unit=1024 
--to-unit=1 --group)
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sdb1  100791728 71565732  24105996  75% /home
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sdb1 103,210,729,472 71565732  24105996  75% /home

I notice the unit on both --to=iec and --to=si is the same.
I was thinking that --to=iec would add a 'i' to distinguish?

$ /bin/ls -l | head -n4 | src/numfmt --to=iec --field=5
total 7252
-rw-rw-r--.  1 padraig padraig 92K Aug 23  2011 ABOUT-NLS
-rw-rw-r--.  1 padraig padraig 49K Dec 21 16:22 aclocal.m4
-rw-rw-r--.  1 padraig padraig3.6K Dec 21 16:15 AUTHORS

I notice we exit immediately with invalid numbers.
How about printing the warning, outputting the unparsable number as is
and continuing?

$ printf %s\n 1 one 2 | src/numfmt
1
src/numfmt: invalid number: 'one'


As suggested missing fields are ignored now...

$ echo -ne 1\n2 2\n | src/numfmt --field=2
1
2 2

...and we get warnings with --debug. However they mix awkwardly with
the output. Perhaps delay the warning until line causing the warning is 
displayed.

$ echo -ne 1\n2 2\n | src/numfmt --field=2 --debug
src/numfmt: no conversion option specified
1src/numfmt: input line is too short, no numbers found to convert in field 2

2 2

--devdebug should probably renamed to ---devdebug

I'm starting to think the original idea of having a --format option
would be more a more concise, well known and extendible interface
rather than having --padding, --grouping, --base, ...
It wouldn't have to be passed directly to printf, and could
be parsed and preprocessed something like we do in seq(1).

thanks!
Pádraig.



Re: numfmt (=print 'human' sizes) updates

2012-12-21 Thread Assaf Gordon
Hello Pádraig,

Thanks for the review and the feedback.

Pádraig Brady wrote, On 12/21/2012 12:42 PM:
 
 It's looking like a really useful cohesive command.
 

The attached patch addresses the following issues:

1. changed the ---devdebug option
2. incorporated most of 'indent -nut' recommendations. 
3. improved the usage string, to generate somewhat better man page.
  (help2man is a bit finicky about formatting, so it's not optimal). 

4. i suffix with iec input/output:

I understand the reason for always adding i, as it is The Right Thing to do.
But I think some (most?) people still treat single-letter suffix (K/M/G/T/etc.) 
as a valid suffix for both SI and IEC, and deduce the scale from the context of 
whatever they're working on. Forcing them to use Ki might be too obtrusive.
It could also mess-up automated scripted, when IEC scale is needed, but only 
with single-letter suffix (and there are many programs like that).

As a compromise, I've added yet another scale option: ieci .
When used with --from=ieci, a two-letter suffix is required.
When used with --to=ieci, i will always be appended.

Examples:

$ ./src/numfmt --to=iec 4096
4.0K
$ ./src/numfmt --to=ieci 4096
4.0Ki
$ ./src/numfmt --from=iec 4K
4096
$ ./src/numfmt --from=ieci 4Ki
4096
$ ./src/numfmt --from=auto 4Ki
4096
$ ./src/numfmt --from=auto 4K
4000

# 'ieci' requires the 'i':
$ ./src/numfmt --from=ieci 4K 
./src/numfmt: missing 'i' suffix in input: '4K' (e.g Ki/Mi/Gi)

# 'iec' does not accept 'i':
$ ./src/numfmt --from=iec 4Ki
./src/numfmt: invalid suffix in input '4Ki': 'i'


I hope this cover all the options, while maintaining consistent and expected 
behavior.
(Optionally, we can change iec to behave like ieci, and rename ieci to 
something else).


I will send --format and error message updates soon.

Regards,
 -gordon


numfmt.9.patch.xz
Description: application/xz


Re: numfmt (=print 'human' sizes) updates

2012-12-21 Thread Pádraig Brady

On 12/22/2012 12:51 AM, Assaf Gordon wrote:

Hello Pádraig,

Thanks for the review and the feedback.

Pádraig Brady wrote, On 12/21/2012 12:42 PM:


It's looking like a really useful cohesive command.



The attached patch addresses the following issues:

1. changed the ---devdebug option
2. incorporated most of 'indent -nut' recommendations.
3. improved the usage string, to generate somewhat better man page.
   (help2man is a bit finicky about formatting, so it's not optimal).


Yes it's awkward to work around the help2man constraints.


4. i suffix with iec input/output:

I understand the reason for always adding i, as it is The Right Thing to do.
But I think some (most?) people still treat single-letter suffix (K/M/G/T/etc.) as a 
valid suffix for both SI and IEC, and deduce the scale from the context of whatever 
they're working on. Forcing them to use Ki might be too obtrusive.
It could also mess-up automated scripted, when IEC scale is needed, but only 
with single-letter suffix (and there are many programs like that).

As a compromise, I've added yet another scale option: ieci .
When used with --from=ieci, a two-letter suffix is required.
When used with --to=ieci, i will always be appended.

Examples:

$ ./src/numfmt --to=iec 4096
4.0K
$ ./src/numfmt --to=ieci 4096
4.0Ki
$ ./src/numfmt --from=iec 4K
4096
$ ./src/numfmt --from=ieci 4Ki
4096
$ ./src/numfmt --from=auto 4Ki
4096
$ ./src/numfmt --from=auto 4K
4000

# 'ieci' requires the 'i':
$ ./src/numfmt --from=ieci 4K
./src/numfmt: missing 'i' suffix in input: '4K' (e.g Ki/Mi/Gi)

# 'iec' does not accept 'i':
$ ./src/numfmt --from=iec 4Ki
./src/numfmt: invalid suffix in input '4Ki': 'i'


I hope this cover all the options, while maintaining consistent and expected 
behavior.
(Optionally, we can change iec to behave like ieci, and rename ieci to 
something else).


I like the 'ieci' compromise.
Also because it has two 'i's in it, there will be
less confusion as to which is which between 'si' and 'ieci'.
I suppose an alternative to 'ieci' is 'ibi', but I slightly
prefer your 'ieci'.


I will send --format and error message updates soon.


thanks!
Pádraig.



Re: numfmt (=print 'human' sizes) updates

2012-12-21 Thread Eric Blake
On 12/21/2012 07:07 PM, Pádraig Brady wrote:

 As a compromise, I've added yet another scale option: ieci .
 When used with --from=ieci, a two-letter suffix is required.
 When used with --to=ieci, i will always be appended.

That's a little hard to read.  Would --from=iec-i be any easier, where
we are separating the IEC acronym from the output of 'i'?

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: numfmt (=print 'human' sizes) updates

2012-12-18 Thread Assaf Gordon
Hello,

Attached is a first shot at documenting 'numfmt' .

Comments are welcomed,
-gordon


numfmt_doc.patch.gz
Description: GNU Zip compressed data


Re: numfmt (=print 'human' sizes) updates

2012-12-14 Thread Pádraig Brady

On 12/14/2012 05:38 AM, Assaf Gordon wrote:

Hello,

Attached is a slightly improved patch - minor code changes, and many more tests.
Line coverage is 98%, and branch coverage is now 93% , and most of the 
non-covered branches are simply unreachable (I'm checking the reachable ones).

The comments below still apply.


wow great stuff.


Assaf Gordon wrote, On 12/13/2012 01:02 AM:


Most of the previously raised issues have been addressed, except handling 
locale'd grouping in the input numbers (locale'd decimal-point is handled 
correctly).


Cool, we can address that if needed later.



Added support for header, auto-whitespace-padding, floating-point input .
Internally, all values are now stored as long double (instead of previously 
uintmax_t) - enables working with Yotta-scale values.

The following should now 'just work' :
 df | ./src/numfmt --header --field 2 --to=si
 ls -l | ./src/numfmt --header --field 5 --to=iec
 ls -lh | ./src/numfmt --header --field 5 --from=iec --padding=10

The --debug option now behaves more like sort's --debug: prints messages to 
STDERR about possible bad combinations and inputs (which are not fatal errors):

 $./src/numfmt --debug 6
 ./src/numfmt: no conversion option specified
 6


Maybe --verbose would be more appropriate, if this option is needed at all.
sort is exceedingly complex whereas hopefully numfmt will not.


The --devdebug option can be used to show internal states (perhaps will be 
removed once the program is finalized?).


Yes we may remove, or at least not document.



The test file 'tests/misc/numfmt.pl' contains many more tests and details about 
possible inputs/outputs.

If the functionality is acceptable, the next steps are cleaner code and better 
documentations.


I hope to review this, this weekend.

thanks!
Pádraig.



Re: numfmt (=print 'human' sizes) updates

2012-12-14 Thread Eric Blake
On 12/14/2012 01:53 AM, Pádraig Brady wrote:
 The --devdebug option can be used to show internal states (perhaps
 will be removed once the program is finalized?).
 
 Yes we may remove, or at least not document.

If you choose to not document it, then name it '---devdebug' (yes, with
three dashes), so that it doesn't interfere with any use of '--d' as an
unambiguous prefix of a documented long option starting with d.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: numfmt (=print 'human' sizes) updates

2012-12-13 Thread Assaf Gordon
Hello,

Attached is a slightly improved patch - minor code changes, and many more tests.
Line coverage is 98%, and branch coverage is now 93% , and most of the 
non-covered branches are simply unreachable (I'm checking the reachable ones).

The comments below still apply.

- gordon


Assaf Gordon wrote, On 12/13/2012 01:02 AM:
 
 Most of the previously raised issues have been addressed, except handling 
 locale'd grouping in the input numbers (locale'd decimal-point is handled 
 correctly).
 
 Added support for header, auto-whitespace-padding, floating-point input .
 Internally, all values are now stored as long double (instead of previously 
 uintmax_t) - enables working with Yotta-scale values.
 
 The following should now 'just work' :
 df | ./src/numfmt --header --field 2 --to=si
 ls -l | ./src/numfmt --header --field 5 --to=iec
 ls -lh | ./src/numfmt --header --field 5 --from=iec --padding=10
 
 The --debug option now behaves more like sort's --debug: prints messages 
 to STDERR about possible bad combinations and inputs (which are not fatal 
 errors):
 
 $./src/numfmt --debug 6
 ./src/numfmt: no conversion option specified
 6
 
 The --devdebug option can be used to show internal states (perhaps will be 
 removed once the program is finalized?).
 
 The test file 'tests/misc/numfmt.pl' contains many more tests and details 
 about possible inputs/outputs.
 
 If the functionality is acceptable, the next steps are cleaner code and 
 better documentations.
 



numfmt.8.patch.xz
Description: application/xz