Re: [PATCH 01/14] numparse: new module for parsing integral numbers

2015-03-20 Thread Junio C Hamano
Michael Haggerty mhag...@alum.mit.edu writes:

 +static int parse_precheck(const char *s, unsigned int *flags)
 +{
 + const char *number;
 +
 + if (isspace(*s)) {
 + if (!(*flags  NUM_LEADING_WHITESPACE))
 + return -NUM_LEADING_WHITESPACE;
 + do {
 + s++;
 + } while (isspace(*s));
 + }
 +
 + if (*s == '+') {
 + if (!(*flags  NUM_PLUS))
 + return -NUM_PLUS;
 + number = s + 1;
 + *flags = ~NUM_NEGATIVE;
 + } else if (*s == '-') {
 + if (!(*flags  NUM_MINUS))
 + return -NUM_MINUS;
 + number = s + 1;
 + *flags |= NUM_NEGATIVE;
 + } else {
 + number = s;
 + *flags = ~NUM_NEGATIVE;
 + }
 +
 + if (!(*flags  NUM_BASE_SPECIFIER)) {
 + int base = *flags  NUM_BASE_MASK;
 + if (base == 0) {
 + /* This is a pointless combination of options. */
 + die(BUG: base=0 specified without NUM_BASE_SPECIFIER);
 + } else if (base == 16  starts_with(number, 0x)) {
 + /*
 +  * We want to treat this as zero terminated by
 +  * an 'x', whereas strtol()/strtoul() would
 +  * silently eat the 0x. We accomplish this
 +  * by treating it as a base 10 number:
 +  */
 + *flags = (*flags  ~NUM_BASE_MASK) | 10;
 + }
 + }
 + return 0;
 +}

I somehow feel that a pre-processing that only _inspects_ part of
the string, without munging that string (e.g. notice '-' but feed
that to underlying strtol(3)) somewhat a brittle approach.  When I
read the above for the first time, I somehow expected that the code
would notice leading '-', strip that leading '-' and remember the
fact that it did so in the *flags, let the strtol(3) to parse the
remainder _and_ always make sure the returned result is not negative
(because that would imply that the original input had two leading
minuses and digits), and give the sign based on what this preprocess
found out in *flags, and then seeing that there is no sign of such
processing in the caller I scratched my head.

I still have not convinced myself that what I am seeing in the
base==16 part in the above is correct.

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/14] numparse: new module for parsing integral numbers

2015-03-20 Thread Eric Sunshine
On Wed, Mar 18, 2015 at 6:47 PM, Michael Haggerty mhag...@alum.mit.edu wrote:
 On 03/18/2015 07:27 PM, Eric Sunshine wrote:
 On Tuesday, March 17, 2015, Michael Haggerty mhag...@alum.mit.edu wrote:
 Implement wrappers for strtol() and strtoul() that are safer and more
 convenient to use.
 + * The lowest 6 bits of flags hold the numerical base that should be
 + * used to parse the number, 2 = base = 36. If base is set to 0,
 + * then NUM_BASE_SPECIFIER must be set too; in this case, the base is
 + * detected automatically from the string's prefix.

 Does this restriction go against the goal of making these functions
 convenient, even while remaining strict? Is there a strong reason for
 not merely inferring NUM_BASE_SPECIFIER when base is 0? Doing so would
 make it consistent with strto*l() without (I think) introducing any
 ambiguity.

 I thought about doing this. If it were possible to eliminate
 NUM_BASE_SPECIFIER altogether, then there is no doubt that it would be a
 good change. But NUM_BASE_SPECIFIER also has an effect when base==16;
 namely, that an 0x prefix, if present, is consumed. So

 parse_i(0xff, 16 | NUM_BASE_SPECIFIER, result, endptr)

 gives result==255 and endptr==s+4, whereas

 parse_i(0xff, 16, result, endptr)

 gives result==0 and entptr==s+1 (it treats the x as the end of the
 string).

 We could forgo that feature, effectively allowing a base specifier if
 and only if base==0. But I didn't want to rule out allowing an optional
 base specifier for base==16, in which case NUM_BASE_SPECIFIER can't be
 dispensed with entirely.

Making base==0 a special case doesn't have to mean ruling out or
eliminating NUM_BASE_SPECIFIER for the base==16 case. However, a
special case base==0 would make the API a bit non-orthogonal and
require extra documentation. But for those familiar with how strto*l()
treats base==0 specially, the rule of least surprise may apply.

 If you agree with that, then the remaining question is: which policy is
 less error-prone? My thinking was that forcing the caller to specify
 NUM_BASE_SPECIFIER explicitly when they select base==0 will serve as a
 reminder that the two features are intertwined.

Since base==0 would unambiguously imply NUM_BASE_SPECIFIER, being able
to tersely say convert_i(s, 0, result) would be a win from a
convenience perspective. However...

 Because another
 imaginable policy (arguably more consistent with the policy for base!=0)
 would be that

 convert_i(s, 0, result)

 , because it *doesn't* specify NUM_BASE_SPECIFIER, doesn't allow a base
 prefix, and therefore indirectly only allows base-10 numbers.

 But I don't feel strongly about this.

I don't feel strongly about it either, and can formulate arguments
either way. Assuming you stick with your current design, if the
strictness of having to specify NUM_BASE_SPECIFIER for base==0 proves
too burdensome, it can be loosened later by making base==0 a special
case.

On the other hand, if you go with Junio's suggestion of choosing names
for these functions which more closely mirror strto*l() names, then
the rule of least surprise might suggest that a special case for
base==0 has merit.

 + * Return 0 on success or a negative value if there was an error. On
 + * failure, *result and *entptr are left unchanged.
 + *
 + * Please note that if NUM_TRAILING is not set, then it is
 + * nevertheless an error if there are any characters between the end
 + * of the number and the end of the string.

 Again, on the subject of convenience, why this restriction? The stated
 purpose of the parse_*() functions is to parse the number from the
 front of the string and return a pointer to the first non-numeric
 character following. As  a reader of this API, I interpret that as
 meaning that NUM_TRAILING is implied. Is there a strong reason for not
 inferring NUM_TRAILING for the parse_*() functions at the API level?
 (I realize that the convert_*() functions are built atop parse_*(),
 but that's an implementation detail.)

 Yes, I'd also thought about that change:

 * Make NUM_TRAILING private.
 * Make the current parse_*() functions private.
 * Add new parse_*(s, flags, result, endptr) functions that imply
 NUM_TRAILING (and maybe they should *require* a non-null endptr argument).
 * Change the convert_*() to not allow the NUM_TRAILING flag.

 This would add a little bit of code, so I didn't do it originally. But
 since you seem to like the idea too, I guess I will make the change.

The other option (as Junio also suggested) is to collapse this API
into just the four parse_*() functions. All of the call-sites you
converted in this series invoked convert_*(), but it's not clear that
having two sets of functions which do almost the same thing is much of
a win. If the default is for NUM_TRAILING to be off, and if NULL
'endptr' is allowed, then:

parse_ul(s, 10, result, NULL);

doesn't seem particularly burdensome compared with:

convert_ul(s, 10, result);

A more compact API, with only the parse_*() 

Re: [PATCH 01/14] numparse: new module for parsing integral numbers

2015-03-18 Thread Michael Haggerty
On 03/18/2015 07:27 PM, Eric Sunshine wrote:
 On Tuesday, March 17, 2015, Michael Haggerty mhag...@alum.mit.edu wrote:
 Implement wrappers for strtol() and strtoul() that are safer and more
 convenient to use.

 Signed-off-by: Michael Haggerty mhag...@alum.mit.edu
 ---
 diff --git a/numparse.c b/numparse.c
 new file mode 100644
 index 000..90b44ce
 --- /dev/null
 +++ b/numparse.c
 @@ -0,0 +1,180 @@
 +int parse_l(const char *s, unsigned int flags, long *result, char **endptr)
 +{
 +   long l;
 +   const char *end;
 +   int err = 0;
 +
 +   err = parse_precheck(s, flags);
 +   if (err)
 +   return err;
 +   /*
 +* Now let strtol() do the heavy lifting:
 +*/
 
 Perhaps use a /* one-line style comment */ to reduce vertical space
 consumption a bit, thus make it (very slightly) easier to run the eye
 over the code.

Yes, will change.

 +   errno = 0;
 +   l = strtol(s, (char **)end, flags  NUM_BASE_MASK);
 +   if (errno) {
 +   if (errno == ERANGE) {
 +   if (!(flags  NUM_SATURATE))
 +   return -NUM_SATURATE;
 +   } else {
 +   return -NUM_OTHER_ERROR;
 +   }
 +   }
 
 Would it reduce cognitive load slightly (and reduce vertical space
 consumption) to rephrase the conditionals as:
 
 if (errno == ERANGE  !(flags  NUM_SATURATE))
 return -NUM_SATURATE;
 
 if (errno  errno != ERANGE)
 return -NUM_OTHER_ERROR;
 
 or something similar?

The most common case by far should be that errno is zero. The code as
written only needs one test for that case, whereas your code needs two
tests. I think it is worth compromising on code clarity a tiny bit to
avoid the extra test.

 More below.
 
 +   if (end == s)
 +   return -NUM_NO_DIGITS;
 +
 +   if (*end  !(flags  NUM_TRAILING))
 +   return -NUM_TRAILING;
 +
 +   /* Everything was OK */
 +   *result = l;
 +   if (endptr)
 +   *endptr = (char *)end;
 +   return 0;
 +}
 diff --git a/numparse.h b/numparse.h
 new file mode 100644
 index 000..4de5e10
 --- /dev/null
 +++ b/numparse.h
 @@ -0,0 +1,207 @@
 +#ifndef NUMPARSE_H
 +#define NUMPARSE_H
 +
 +/*
 + * Functions for parsing integral numbers.
 + *
 + * Examples:
 + *
 + * - Convert s into a signed int, interpreting prefix 0x to mean
 + *   hexadecimal and 0 to mean octal. If the value doesn't fit in an
 + *   unsigned int, set result to INT_MIN or INT_MAX.
 
 Did you mean s/unsigned int/signed int/ ?

Thanks for catching this. I will fix it.

 + *
 + * if (convert_i(s, NUM_SLOPPY, result))
 + * die(...);
 + */
 +
 +/*
 + * The lowest 6 bits of flags hold the numerical base that should be
 + * used to parse the number, 2 = base = 36. If base is set to 0,
 + * then NUM_BASE_SPECIFIER must be set too; in this case, the base is
 + * detected automatically from the string's prefix.
 
 Does this restriction go against the goal of making these functions
 convenient, even while remaining strict? Is there a strong reason for
 not merely inferring NUM_BASE_SPECIFIER when base is 0? Doing so would
 make it consistent with strto*l() without (I think) introducing any
 ambiguity.

I thought about doing this. If it were possible to eliminate
NUM_BASE_SPECIFIER altogether, then there is no doubt that it would be a
good change. But NUM_BASE_SPECIFIER also has an effect when base==16;
namely, that an 0x prefix, if present, is consumed. So

parse_i(0xff, 16 | NUM_BASE_SPECIFIER, result, endptr)

gives result==255 and endptr==s+4, whereas

parse_i(0xff, 16, result, endptr)

gives result==0 and entptr==s+1 (it treats the x as the end of the
string).

We could forgo that feature, effectively allowing a base specifier if
and only if base==0. But I didn't want to rule out allowing an optional
base specifier for base==16, in which case NUM_BASE_SPECIFIER can't be
dispensed with entirely.

If you agree with that, then the remaining question is: which policy is
less error-prone? My thinking was that forcing the caller to specify
NUM_BASE_SPECIFIER explicitly when they select base==0 will serve as a
reminder that the two features are intertwined. Because another
imaginable policy (arguably more consistent with the policy for base!=0)
would be that

convert_i(s, 0, result)

, because it *doesn't* specify NUM_BASE_SPECIFIER, doesn't allow a base
prefix, and therefore indirectly only allows base-10 numbers.

But I don't feel strongly about this.

 + */
 +/*
 + * Number parsing functions:
 + *
 + * The following functions parse a number (long, unsigned long, int,
 + * or unsigned int respectively) from the front of s, storing the
 + * value to *result and storing a pointer to the first character after
 + * the number to *endptr. flags specifies how the number should be
 + * parsed, including which base should be used. flags is a combination
 + * of the numerical base (2-36) 

Re: [PATCH 01/14] numparse: new module for parsing integral numbers

2015-03-18 Thread Eric Sunshine
On Tuesday, March 17, 2015, Michael Haggerty mhag...@alum.mit.edu wrote:
 Implement wrappers for strtol() and strtoul() that are safer and more
 convenient to use.

 Signed-off-by: Michael Haggerty mhag...@alum.mit.edu
 ---
 diff --git a/numparse.c b/numparse.c
 new file mode 100644
 index 000..90b44ce
 --- /dev/null
 +++ b/numparse.c
 @@ -0,0 +1,180 @@
 +int parse_l(const char *s, unsigned int flags, long *result, char **endptr)
 +{
 +   long l;
 +   const char *end;
 +   int err = 0;
 +
 +   err = parse_precheck(s, flags);
 +   if (err)
 +   return err;
 +   /*
 +* Now let strtol() do the heavy lifting:
 +*/

Perhaps use a /* one-line style comment */ to reduce vertical space
consumption a bit, thus make it (very slightly) easier to run the eye
over the code.

 +   errno = 0;
 +   l = strtol(s, (char **)end, flags  NUM_BASE_MASK);
 +   if (errno) {
 +   if (errno == ERANGE) {
 +   if (!(flags  NUM_SATURATE))
 +   return -NUM_SATURATE;
 +   } else {
 +   return -NUM_OTHER_ERROR;
 +   }
 +   }

Would it reduce cognitive load slightly (and reduce vertical space
consumption) to rephrase the conditionals as:

if (errno == ERANGE  !(flags  NUM_SATURATE))
return -NUM_SATURATE;

if (errno  errno != ERANGE)
return -NUM_OTHER_ERROR;

or something similar?

More below.

 +   if (end == s)
 +   return -NUM_NO_DIGITS;
 +
 +   if (*end  !(flags  NUM_TRAILING))
 +   return -NUM_TRAILING;
 +
 +   /* Everything was OK */
 +   *result = l;
 +   if (endptr)
 +   *endptr = (char *)end;
 +   return 0;
 +}
 diff --git a/numparse.h b/numparse.h
 new file mode 100644
 index 000..4de5e10
 --- /dev/null
 +++ b/numparse.h
 @@ -0,0 +1,207 @@
 +#ifndef NUMPARSE_H
 +#define NUMPARSE_H
 +
 +/*
 + * Functions for parsing integral numbers.
 + *
 + * Examples:
 + *
 + * - Convert s into a signed int, interpreting prefix 0x to mean
 + *   hexadecimal and 0 to mean octal. If the value doesn't fit in an
 + *   unsigned int, set result to INT_MIN or INT_MAX.

Did you mean s/unsigned int/signed int/ ?

 + *
 + * if (convert_i(s, NUM_SLOPPY, result))
 + * die(...);
 + */
 +
 +/*
 + * The lowest 6 bits of flags hold the numerical base that should be
 + * used to parse the number, 2 = base = 36. If base is set to 0,
 + * then NUM_BASE_SPECIFIER must be set too; in this case, the base is
 + * detected automatically from the string's prefix.

Does this restriction go against the goal of making these functions
convenient, even while remaining strict? Is there a strong reason for
not merely inferring NUM_BASE_SPECIFIER when base is 0? Doing so would
make it consistent with strto*l() without (I think) introducing any
ambiguity.

 + */
 +/*
 + * Number parsing functions:
 + *
 + * The following functions parse a number (long, unsigned long, int,
 + * or unsigned int respectively) from the front of s, storing the
 + * value to *result and storing a pointer to the first character after
 + * the number to *endptr. flags specifies how the number should be
 + * parsed, including which base should be used. flags is a combination
 + * of the numerical base (2-36) and the NUM_* constants above (see).

(see) what?

 + * Return 0 on success or a negative value if there was an error. On
 + * failure, *result and *entptr are left unchanged.
 + *
 + * Please note that if NUM_TRAILING is not set, then it is
 + * nevertheless an error if there are any characters between the end
 + * of the number and the end of the string.

Again, on the subject of convenience, why this restriction? The stated
purpose of the parse_*() functions is to parse the number from the
front of the string and return a pointer to the first non-numeric
character following. As  a reader of this API, I interpret that as
meaning that NUM_TRAILING is implied. Is there a strong reason for not
inferring NUM_TRAILING for the parse_*() functions at the API level?
(I realize that the convert_*() functions are built atop parse_*(),
but that's an implementation detail.)

 + */
 +
 +int parse_l(const char *s, unsigned int flags,
 +   long *result, char **endptr);

Do we want to perpetuate the ugly (char **) convention for 'endptr'
from strto*l()? Considering that the incoming string is const, it
seems undesirable to return a non-const pointer to some place inside
that string.

 +/*
 + * Number conversion functions:
 + *
 + * The following functions parse a string into a number. They are
 + * identical to the parse_*() functions above, except that the endptr
 + * is not returned. These are most useful when parsing a whole string
 + * into a number; i.e., when (flags  NUM_TRAILING) is unset.

I can formulate arguments for allowing or disallowing NUM_TRAILING
with convert_*(), however, given their purpose of parsing the