Re: [PATCH 01/14] numparse: new module for parsing integral numbers

2015-03-20 Thread Junio C Hamano
Michael Haggerty  writes:

> +static int parse_precheck(const char *s, unsigned int *flags)
> +{
> + const char *number;
> +
> + if (isspace(*s)) {
> + if (!(*flags & NUM_LEADING_WHITESPACE))
> + return -NUM_LEADING_WHITESPACE;
> + do {
> + s++;
> + } while (isspace(*s));
> + }
> +
> + if (*s == '+') {
> + if (!(*flags & NUM_PLUS))
> + return -NUM_PLUS;
> + number = s + 1;
> + *flags &= ~NUM_NEGATIVE;
> + } else if (*s == '-') {
> + if (!(*flags & NUM_MINUS))
> + return -NUM_MINUS;
> + number = s + 1;
> + *flags |= NUM_NEGATIVE;
> + } else {
> + number = s;
> + *flags &= ~NUM_NEGATIVE;
> + }
> +
> + if (!(*flags & NUM_BASE_SPECIFIER)) {
> + int base = *flags & NUM_BASE_MASK;
> + if (base == 0) {
> + /* This is a pointless combination of options. */
> + die("BUG: base=0 specified without NUM_BASE_SPECIFIER");
> + } else if (base == 16 && starts_with(number, "0x")) {
> + /*
> +  * We want to treat this as zero terminated by
> +  * an 'x', whereas strtol()/strtoul() would
> +  * silently eat the "0x". We accomplish this
> +  * by treating it as a base 10 number:
> +  */
> + *flags = (*flags & ~NUM_BASE_MASK) | 10;
> + }
> + }
> + return 0;
> +}

I somehow feel that a pre-processing that only _inspects_ part of
the string, without munging that string (e.g. notice '-' but feed
that to underlying strtol(3)) somewhat a brittle approach.  When I
read the above for the first time, I somehow expected that the code
would notice leading '-', strip that leading '-' and remember the
fact that it did so in the *flags, let the strtol(3) to parse the
remainder _and_ always make sure the returned result is not negative
(because that would imply that the original input had two leading
minuses and digits), and give the sign based on what this preprocess
found out in *flags, and then seeing that there is no sign of such
processing in the caller I scratched my head.

I still have not convinced myself that what I am seeing in the
base==16 part in the above is correct.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/14] numparse: new module for parsing integral numbers

2015-03-20 Thread Eric Sunshine
On Wed, Mar 18, 2015 at 6:47 PM, Michael Haggerty  wrote:
> On 03/18/2015 07:27 PM, Eric Sunshine wrote:
>> On Tuesday, March 17, 2015, Michael Haggerty  wrote:
>>> Implement wrappers for strtol() and strtoul() that are safer and more
>>> convenient to use.
>>> + * The lowest 6 bits of flags hold the numerical base that should be
>>> + * used to parse the number, 2 <= base <= 36. If base is set to 0,
>>> + * then NUM_BASE_SPECIFIER must be set too; in this case, the base is
>>> + * detected automatically from the string's prefix.
>>
>> Does this restriction go against the goal of making these functions
>> convenient, even while remaining strict? Is there a strong reason for
>> not merely inferring NUM_BASE_SPECIFIER when base is 0? Doing so would
>> make it consistent with strto*l() without (I think) introducing any
>> ambiguity.
>
> I thought about doing this. If it were possible to eliminate
> NUM_BASE_SPECIFIER altogether, then there is no doubt that it would be a
> good change. But NUM_BASE_SPECIFIER also has an effect when base==16;
> namely, that an "0x" prefix, if present, is consumed. So
>
> parse_i("0xff", 16 | NUM_BASE_SPECIFIER, &result, &endptr)
>
> gives result==255 and endptr==s+4, whereas
>
> parse_i("0xff", 16, &result, &endptr)
>
> gives result==0 and entptr==s+1 (it treats the "x" as the end of the
> string).
>
> We could forgo that feature, effectively allowing a base specifier if
> and only if base==0. But I didn't want to rule out allowing an optional
> base specifier for base==16, in which case NUM_BASE_SPECIFIER can't be
> dispensed with entirely.

Making base==0 a special case doesn't have to mean ruling out or
eliminating NUM_BASE_SPECIFIER for the base==16 case. However, a
special case base==0 would make the API a bit non-orthogonal and
require extra documentation. But for those familiar with how strto*l()
treats base==0 specially, the rule of least surprise may apply.

> If you agree with that, then the remaining question is: which policy is
> less error-prone? My thinking was that forcing the caller to specify
> NUM_BASE_SPECIFIER explicitly when they select base==0 will serve as a
> reminder that the two features are intertwined.

Since base==0 would unambiguously imply NUM_BASE_SPECIFIER, being able
to tersely say convert_i(s, 0, &result) would be a win from a
convenience perspective. However...

> Because another
> imaginable policy (arguably more consistent with the policy for base!=0)
> would be that
>
> convert_i(s, 0, &result)
>
> , because it *doesn't* specify NUM_BASE_SPECIFIER, doesn't allow a base
> prefix, and therefore indirectly only allows base-10 numbers.
>
> But I don't feel strongly about this.

I don't feel strongly about it either, and can formulate arguments
either way. Assuming you stick with your current design, if the
strictness of having to specify NUM_BASE_SPECIFIER for base==0 proves
too burdensome, it can be loosened later by making base==0 a special
case.

On the other hand, if you go with Junio's suggestion of choosing names
for these functions which more closely mirror strto*l() names, then
the rule of least surprise might suggest that a special case for
base==0 has merit.

>>> + * Return 0 on success or a negative value if there was an error. On
>>> + * failure, *result and *entptr are left unchanged.
>>> + *
>>> + * Please note that if NUM_TRAILING is not set, then it is
>>> + * nevertheless an error if there are any characters between the end
>>> + * of the number and the end of the string.
>>
>> Again, on the subject of convenience, why this restriction? The stated
>> purpose of the parse_*() functions is to parse the number from the
>> front of the string and return a pointer to the first non-numeric
>> character following. As  a reader of this API, I interpret that as
>> meaning that NUM_TRAILING is implied. Is there a strong reason for not
>> inferring NUM_TRAILING for the parse_*() functions at the API level?
>> (I realize that the convert_*() functions are built atop parse_*(),
>> but that's an implementation detail.)
>
> Yes, I'd also thought about that change:
>
> * Make NUM_TRAILING private.
> * Make the current parse_*() functions private.
> * Add new parse_*(s, flags, result, endptr) functions that imply
> NUM_TRAILING (and maybe they should *require* a non-null endptr argument).
> * Change the convert_*() to not allow the NUM_TRAILING flag.
>
> This would add a little bit of code, so I didn't do it originally. But
> since you seem to like the idea too, I guess I will make the change.

The other option (as Junio also suggested) is to collapse this API
into just the four parse_*() functions. All of the call-sites you
converted in this series invoked convert_*(), but it's not clear that
having two sets of functions which do almost the same thing is much of
a win. If the default is for NUM_TRAILING to be off, and if NULL
'endptr' is allowed, then:

parse_ul(s, 10, &result, NULL);

doesn't seem particularly burdensome comp

Re: [PATCH 01/14] numparse: new module for parsing integral numbers

2015-03-18 Thread Michael Haggerty
On 03/18/2015 07:27 PM, Eric Sunshine wrote:
> On Tuesday, March 17, 2015, Michael Haggerty  wrote:
>> Implement wrappers for strtol() and strtoul() that are safer and more
>> convenient to use.
>>
>> Signed-off-by: Michael Haggerty 
>> ---
>> diff --git a/numparse.c b/numparse.c
>> new file mode 100644
>> index 000..90b44ce
>> --- /dev/null
>> +++ b/numparse.c
>> @@ -0,0 +1,180 @@
>> +int parse_l(const char *s, unsigned int flags, long *result, char **endptr)
>> +{
>> +   long l;
>> +   const char *end;
>> +   int err = 0;
>> +
>> +   err = parse_precheck(s, &flags);
>> +   if (err)
>> +   return err;
>> +   /*
>> +* Now let strtol() do the heavy lifting:
>> +*/
> 
> Perhaps use a /* one-line style comment */ to reduce vertical space
> consumption a bit, thus make it (very slightly) easier to run the eye
> over the code.

Yes, will change.

>> +   errno = 0;
>> +   l = strtol(s, (char **)&end, flags & NUM_BASE_MASK);
>> +   if (errno) {
>> +   if (errno == ERANGE) {
>> +   if (!(flags & NUM_SATURATE))
>> +   return -NUM_SATURATE;
>> +   } else {
>> +   return -NUM_OTHER_ERROR;
>> +   }
>> +   }
> 
> Would it reduce cognitive load slightly (and reduce vertical space
> consumption) to rephrase the conditionals as:
> 
> if (errno == ERANGE && !(flags & NUM_SATURATE))
> return -NUM_SATURATE;
> 
> if (errno && errno != ERANGE)
> return -NUM_OTHER_ERROR;
> 
> or something similar?

The most common case by far should be that errno is zero. The code as
written only needs one test for that case, whereas your code needs two
tests. I think it is worth compromising on code clarity a tiny bit to
avoid the extra test.

> More below.
> 
>> +   if (end == s)
>> +   return -NUM_NO_DIGITS;
>> +
>> +   if (*end && !(flags & NUM_TRAILING))
>> +   return -NUM_TRAILING;
>> +
>> +   /* Everything was OK */
>> +   *result = l;
>> +   if (endptr)
>> +   *endptr = (char *)end;
>> +   return 0;
>> +}
>> diff --git a/numparse.h b/numparse.h
>> new file mode 100644
>> index 000..4de5e10
>> --- /dev/null
>> +++ b/numparse.h
>> @@ -0,0 +1,207 @@
>> +#ifndef NUMPARSE_H
>> +#define NUMPARSE_H
>> +
>> +/*
>> + * Functions for parsing integral numbers.
>> + *
>> + * Examples:
>> + *
>> + * - Convert s into a signed int, interpreting prefix "0x" to mean
>> + *   hexadecimal and "0" to mean octal. If the value doesn't fit in an
>> + *   unsigned int, set result to INT_MIN or INT_MAX.
> 
> Did you mean s/unsigned int/signed int/ ?

Thanks for catching this. I will fix it.

>> + *
>> + * if (convert_i(s, NUM_SLOPPY, &result))
>> + * die("...");
>> + */
>> +
>> +/*
>> + * The lowest 6 bits of flags hold the numerical base that should be
>> + * used to parse the number, 2 <= base <= 36. If base is set to 0,
>> + * then NUM_BASE_SPECIFIER must be set too; in this case, the base is
>> + * detected automatically from the string's prefix.
> 
> Does this restriction go against the goal of making these functions
> convenient, even while remaining strict? Is there a strong reason for
> not merely inferring NUM_BASE_SPECIFIER when base is 0? Doing so would
> make it consistent with strto*l() without (I think) introducing any
> ambiguity.

I thought about doing this. If it were possible to eliminate
NUM_BASE_SPECIFIER altogether, then there is no doubt that it would be a
good change. But NUM_BASE_SPECIFIER also has an effect when base==16;
namely, that an "0x" prefix, if present, is consumed. So

parse_i("0xff", 16 | NUM_BASE_SPECIFIER, &result, &endptr)

gives result==255 and endptr==s+4, whereas

parse_i("0xff", 16, &result, &endptr)

gives result==0 and entptr==s+1 (it treats the "x" as the end of the
string).

We could forgo that feature, effectively allowing a base specifier if
and only if base==0. But I didn't want to rule out allowing an optional
base specifier for base==16, in which case NUM_BASE_SPECIFIER can't be
dispensed with entirely.

If you agree with that, then the remaining question is: which policy is
less error-prone? My thinking was that forcing the caller to specify
NUM_BASE_SPECIFIER explicitly when they select base==0 will serve as a
reminder that the two features are intertwined. Because another
imaginable policy (arguably more consistent with the policy for base!=0)
would be that

convert_i(s, 0, &result)

, because it *doesn't* specify NUM_BASE_SPECIFIER, doesn't allow a base
prefix, and therefore indirectly only allows base-10 numbers.

But I don't feel strongly about this.

>> + */
>> +/*
>> + * Number parsing functions:
>> + *
>> + * The following functions parse a number (long, unsigned long, int,
>> + * or unsigned int respectively) from the front of s, storing the
>> + * value to *result and storing a pointer to the first character afte

Re: [PATCH 01/14] numparse: new module for parsing integral numbers

2015-03-18 Thread Eric Sunshine
On Tuesday, March 17, 2015, Michael Haggerty  wrote:
> Implement wrappers for strtol() and strtoul() that are safer and more
> convenient to use.
>
> Signed-off-by: Michael Haggerty 
> ---
> diff --git a/numparse.c b/numparse.c
> new file mode 100644
> index 000..90b44ce
> --- /dev/null
> +++ b/numparse.c
> @@ -0,0 +1,180 @@
> +int parse_l(const char *s, unsigned int flags, long *result, char **endptr)
> +{
> +   long l;
> +   const char *end;
> +   int err = 0;
> +
> +   err = parse_precheck(s, &flags);
> +   if (err)
> +   return err;
> +   /*
> +* Now let strtol() do the heavy lifting:
> +*/

Perhaps use a /* one-line style comment */ to reduce vertical space
consumption a bit, thus make it (very slightly) easier to run the eye
over the code.

> +   errno = 0;
> +   l = strtol(s, (char **)&end, flags & NUM_BASE_MASK);
> +   if (errno) {
> +   if (errno == ERANGE) {
> +   if (!(flags & NUM_SATURATE))
> +   return -NUM_SATURATE;
> +   } else {
> +   return -NUM_OTHER_ERROR;
> +   }
> +   }

Would it reduce cognitive load slightly (and reduce vertical space
consumption) to rephrase the conditionals as:

if (errno == ERANGE && !(flags & NUM_SATURATE))
return -NUM_SATURATE;

if (errno && errno != ERANGE)
return -NUM_OTHER_ERROR;

or something similar?

More below.

> +   if (end == s)
> +   return -NUM_NO_DIGITS;
> +
> +   if (*end && !(flags & NUM_TRAILING))
> +   return -NUM_TRAILING;
> +
> +   /* Everything was OK */
> +   *result = l;
> +   if (endptr)
> +   *endptr = (char *)end;
> +   return 0;
> +}
> diff --git a/numparse.h b/numparse.h
> new file mode 100644
> index 000..4de5e10
> --- /dev/null
> +++ b/numparse.h
> @@ -0,0 +1,207 @@
> +#ifndef NUMPARSE_H
> +#define NUMPARSE_H
> +
> +/*
> + * Functions for parsing integral numbers.
> + *
> + * Examples:
> + *
> + * - Convert s into a signed int, interpreting prefix "0x" to mean
> + *   hexadecimal and "0" to mean octal. If the value doesn't fit in an
> + *   unsigned int, set result to INT_MIN or INT_MAX.

Did you mean s/unsigned int/signed int/ ?

> + *
> + * if (convert_i(s, NUM_SLOPPY, &result))
> + * die("...");
> + */
> +
> +/*
> + * The lowest 6 bits of flags hold the numerical base that should be
> + * used to parse the number, 2 <= base <= 36. If base is set to 0,
> + * then NUM_BASE_SPECIFIER must be set too; in this case, the base is
> + * detected automatically from the string's prefix.

Does this restriction go against the goal of making these functions
convenient, even while remaining strict? Is there a strong reason for
not merely inferring NUM_BASE_SPECIFIER when base is 0? Doing so would
make it consistent with strto*l() without (I think) introducing any
ambiguity.

> + */
> +/*
> + * Number parsing functions:
> + *
> + * The following functions parse a number (long, unsigned long, int,
> + * or unsigned int respectively) from the front of s, storing the
> + * value to *result and storing a pointer to the first character after
> + * the number to *endptr. flags specifies how the number should be
> + * parsed, including which base should be used. flags is a combination
> + * of the numerical base (2-36) and the NUM_* constants above (see).

"(see)" what?

> + * Return 0 on success or a negative value if there was an error. On
> + * failure, *result and *entptr are left unchanged.
> + *
> + * Please note that if NUM_TRAILING is not set, then it is
> + * nevertheless an error if there are any characters between the end
> + * of the number and the end of the string.

Again, on the subject of convenience, why this restriction? The stated
purpose of the parse_*() functions is to parse the number from the
front of the string and return a pointer to the first non-numeric
character following. As  a reader of this API, I interpret that as
meaning that NUM_TRAILING is implied. Is there a strong reason for not
inferring NUM_TRAILING for the parse_*() functions at the API level?
(I realize that the convert_*() functions are built atop parse_*(),
but that's an implementation detail.)

> + */
> +
> +int parse_l(const char *s, unsigned int flags,
> +   long *result, char **endptr);

Do we want to perpetuate the ugly (char **) convention for 'endptr'
from strto*l()? Considering that the incoming string is const, it
seems undesirable to return a non-const pointer to some place inside
that string.

> +/*
> + * Number conversion functions:
> + *
> + * The following functions parse a string into a number. They are
> + * identical to the parse_*() functions above, except that the endptr
> + * is not returned. These are most useful when parsing a whole string
> + * into a number; i.e., when (flags & NUM_TRAILING) is unset.

I can formulate arguments for allowing or disallowin

[PATCH 01/14] numparse: new module for parsing integral numbers

2015-03-17 Thread Michael Haggerty
Implement wrappers for strtol() and strtoul() that are safer and more
convenient to use.

Signed-off-by: Michael Haggerty 
---
 Makefile   |   1 +
 numparse.c | 180 +
 numparse.h | 207 +
 3 files changed, 388 insertions(+)
 create mode 100644 numparse.c
 create mode 100644 numparse.h

diff --git a/Makefile b/Makefile
index 44f1dd1..6c0cfcc 100644
--- a/Makefile
+++ b/Makefile
@@ -732,6 +732,7 @@ LIB_OBJS += notes.o
 LIB_OBJS += notes-cache.o
 LIB_OBJS += notes-merge.o
 LIB_OBJS += notes-utils.o
+LIB_OBJS += numparse.o
 LIB_OBJS += object.o
 LIB_OBJS += pack-bitmap.o
 LIB_OBJS += pack-bitmap-write.o
diff --git a/numparse.c b/numparse.c
new file mode 100644
index 000..90b44ce
--- /dev/null
+++ b/numparse.c
@@ -0,0 +1,180 @@
+#include "git-compat-util.h"
+#include "numparse.h"
+
+#define NUM_NEGATIVE (1 << 16)
+
+
+static int parse_precheck(const char *s, unsigned int *flags)
+{
+   const char *number;
+
+   if (isspace(*s)) {
+   if (!(*flags & NUM_LEADING_WHITESPACE))
+   return -NUM_LEADING_WHITESPACE;
+   do {
+   s++;
+   } while (isspace(*s));
+   }
+
+   if (*s == '+') {
+   if (!(*flags & NUM_PLUS))
+   return -NUM_PLUS;
+   number = s + 1;
+   *flags &= ~NUM_NEGATIVE;
+   } else if (*s == '-') {
+   if (!(*flags & NUM_MINUS))
+   return -NUM_MINUS;
+   number = s + 1;
+   *flags |= NUM_NEGATIVE;
+   } else {
+   number = s;
+   *flags &= ~NUM_NEGATIVE;
+   }
+
+   if (!(*flags & NUM_BASE_SPECIFIER)) {
+   int base = *flags & NUM_BASE_MASK;
+   if (base == 0) {
+   /* This is a pointless combination of options. */
+   die("BUG: base=0 specified without NUM_BASE_SPECIFIER");
+   } else if (base == 16 && starts_with(number, "0x")) {
+   /*
+* We want to treat this as zero terminated by
+* an 'x', whereas strtol()/strtoul() would
+* silently eat the "0x". We accomplish this
+* by treating it as a base 10 number:
+*/
+   *flags = (*flags & ~NUM_BASE_MASK) | 10;
+   }
+   }
+   return 0;
+}
+
+int parse_l(const char *s, unsigned int flags, long *result, char **endptr)
+{
+   long l;
+   const char *end;
+   int err = 0;
+
+   err = parse_precheck(s, &flags);
+   if (err)
+   return err;
+
+   /*
+* Now let strtol() do the heavy lifting:
+*/
+   errno = 0;
+   l = strtol(s, (char **)&end, flags & NUM_BASE_MASK);
+   if (errno) {
+   if (errno == ERANGE) {
+   if (!(flags & NUM_SATURATE))
+   return -NUM_SATURATE;
+   } else {
+   return -NUM_OTHER_ERROR;
+   }
+   }
+   if (end == s)
+   return -NUM_NO_DIGITS;
+
+   if (*end && !(flags & NUM_TRAILING))
+   return -NUM_TRAILING;
+
+   /* Everything was OK */
+   *result = l;
+   if (endptr)
+   *endptr = (char *)end;
+   return 0;
+}
+
+int parse_ul(const char *s, unsigned int flags,
+unsigned long *result, char **endptr)
+{
+   unsigned long ul;
+   const char *end;
+   int err = 0;
+
+   err = parse_precheck(s, &flags);
+   if (err)
+   return err;
+
+   /*
+* Now let strtoul() do the heavy lifting:
+*/
+   errno = 0;
+   ul = strtoul(s, (char **)&end, flags & NUM_BASE_MASK);
+   if (errno) {
+   if (errno == ERANGE) {
+   if (!(flags & NUM_SATURATE))
+   return -NUM_SATURATE;
+   } else {
+   return -NUM_OTHER_ERROR;
+   }
+   }
+   if (end == s)
+   return -NUM_NO_DIGITS;
+
+   /*
+* strtoul(), perversely, accepts negative numbers, converting
+* them to the positive number with the same bit pattern. We
+* don't ever want that.
+*/
+   if ((flags & NUM_NEGATIVE) && ul) {
+   if (!(flags & NUM_SATURATE))
+   return -NUM_SATURATE;
+   ul = 0;
+   }
+
+   if (*end && !(flags & NUM_TRAILING))
+   return -NUM_TRAILING;
+
+   /* Everything was OK */
+   *result = ul;
+   if (endptr)
+   *endptr = (char *)end;
+   return 0;
+}
+
+int parse_i(const char *s, unsigned int flags, int *result, char **endptr)
+{
+   long l;
+   int err;
+   char *end;
+
+   err = parse_l(s, flags, &l, &end);
+