Re: [PATCH 01/14] numparse: new module for parsing integral numbers
Michael Haggerty mhag...@alum.mit.edu writes: +static int parse_precheck(const char *s, unsigned int *flags) +{ + const char *number; + + if (isspace(*s)) { + if (!(*flags NUM_LEADING_WHITESPACE)) + return -NUM_LEADING_WHITESPACE; + do { + s++; + } while (isspace(*s)); + } + + if (*s == '+') { + if (!(*flags NUM_PLUS)) + return -NUM_PLUS; + number = s + 1; + *flags = ~NUM_NEGATIVE; + } else if (*s == '-') { + if (!(*flags NUM_MINUS)) + return -NUM_MINUS; + number = s + 1; + *flags |= NUM_NEGATIVE; + } else { + number = s; + *flags = ~NUM_NEGATIVE; + } + + if (!(*flags NUM_BASE_SPECIFIER)) { + int base = *flags NUM_BASE_MASK; + if (base == 0) { + /* This is a pointless combination of options. */ + die(BUG: base=0 specified without NUM_BASE_SPECIFIER); + } else if (base == 16 starts_with(number, 0x)) { + /* + * We want to treat this as zero terminated by + * an 'x', whereas strtol()/strtoul() would + * silently eat the 0x. We accomplish this + * by treating it as a base 10 number: + */ + *flags = (*flags ~NUM_BASE_MASK) | 10; + } + } + return 0; +} I somehow feel that a pre-processing that only _inspects_ part of the string, without munging that string (e.g. notice '-' but feed that to underlying strtol(3)) somewhat a brittle approach. When I read the above for the first time, I somehow expected that the code would notice leading '-', strip that leading '-' and remember the fact that it did so in the *flags, let the strtol(3) to parse the remainder _and_ always make sure the returned result is not negative (because that would imply that the original input had two leading minuses and digits), and give the sign based on what this preprocess found out in *flags, and then seeing that there is no sign of such processing in the caller I scratched my head. I still have not convinced myself that what I am seeing in the base==16 part in the above is correct. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/14] numparse: new module for parsing integral numbers
On Wed, Mar 18, 2015 at 6:47 PM, Michael Haggerty mhag...@alum.mit.edu wrote: On 03/18/2015 07:27 PM, Eric Sunshine wrote: On Tuesday, March 17, 2015, Michael Haggerty mhag...@alum.mit.edu wrote: Implement wrappers for strtol() and strtoul() that are safer and more convenient to use. + * The lowest 6 bits of flags hold the numerical base that should be + * used to parse the number, 2 = base = 36. If base is set to 0, + * then NUM_BASE_SPECIFIER must be set too; in this case, the base is + * detected automatically from the string's prefix. Does this restriction go against the goal of making these functions convenient, even while remaining strict? Is there a strong reason for not merely inferring NUM_BASE_SPECIFIER when base is 0? Doing so would make it consistent with strto*l() without (I think) introducing any ambiguity. I thought about doing this. If it were possible to eliminate NUM_BASE_SPECIFIER altogether, then there is no doubt that it would be a good change. But NUM_BASE_SPECIFIER also has an effect when base==16; namely, that an 0x prefix, if present, is consumed. So parse_i(0xff, 16 | NUM_BASE_SPECIFIER, result, endptr) gives result==255 and endptr==s+4, whereas parse_i(0xff, 16, result, endptr) gives result==0 and entptr==s+1 (it treats the x as the end of the string). We could forgo that feature, effectively allowing a base specifier if and only if base==0. But I didn't want to rule out allowing an optional base specifier for base==16, in which case NUM_BASE_SPECIFIER can't be dispensed with entirely. Making base==0 a special case doesn't have to mean ruling out or eliminating NUM_BASE_SPECIFIER for the base==16 case. However, a special case base==0 would make the API a bit non-orthogonal and require extra documentation. But for those familiar with how strto*l() treats base==0 specially, the rule of least surprise may apply. If you agree with that, then the remaining question is: which policy is less error-prone? My thinking was that forcing the caller to specify NUM_BASE_SPECIFIER explicitly when they select base==0 will serve as a reminder that the two features are intertwined. Since base==0 would unambiguously imply NUM_BASE_SPECIFIER, being able to tersely say convert_i(s, 0, result) would be a win from a convenience perspective. However... Because another imaginable policy (arguably more consistent with the policy for base!=0) would be that convert_i(s, 0, result) , because it *doesn't* specify NUM_BASE_SPECIFIER, doesn't allow a base prefix, and therefore indirectly only allows base-10 numbers. But I don't feel strongly about this. I don't feel strongly about it either, and can formulate arguments either way. Assuming you stick with your current design, if the strictness of having to specify NUM_BASE_SPECIFIER for base==0 proves too burdensome, it can be loosened later by making base==0 a special case. On the other hand, if you go with Junio's suggestion of choosing names for these functions which more closely mirror strto*l() names, then the rule of least surprise might suggest that a special case for base==0 has merit. + * Return 0 on success or a negative value if there was an error. On + * failure, *result and *entptr are left unchanged. + * + * Please note that if NUM_TRAILING is not set, then it is + * nevertheless an error if there are any characters between the end + * of the number and the end of the string. Again, on the subject of convenience, why this restriction? The stated purpose of the parse_*() functions is to parse the number from the front of the string and return a pointer to the first non-numeric character following. As a reader of this API, I interpret that as meaning that NUM_TRAILING is implied. Is there a strong reason for not inferring NUM_TRAILING for the parse_*() functions at the API level? (I realize that the convert_*() functions are built atop parse_*(), but that's an implementation detail.) Yes, I'd also thought about that change: * Make NUM_TRAILING private. * Make the current parse_*() functions private. * Add new parse_*(s, flags, result, endptr) functions that imply NUM_TRAILING (and maybe they should *require* a non-null endptr argument). * Change the convert_*() to not allow the NUM_TRAILING flag. This would add a little bit of code, so I didn't do it originally. But since you seem to like the idea too, I guess I will make the change. The other option (as Junio also suggested) is to collapse this API into just the four parse_*() functions. All of the call-sites you converted in this series invoked convert_*(), but it's not clear that having two sets of functions which do almost the same thing is much of a win. If the default is for NUM_TRAILING to be off, and if NULL 'endptr' is allowed, then: parse_ul(s, 10, result, NULL); doesn't seem particularly burdensome compared with: convert_ul(s, 10, result); A more compact API, with only the parse_*()
Re: [PATCH 01/14] numparse: new module for parsing integral numbers
On 03/18/2015 07:27 PM, Eric Sunshine wrote: On Tuesday, March 17, 2015, Michael Haggerty mhag...@alum.mit.edu wrote: Implement wrappers for strtol() and strtoul() that are safer and more convenient to use. Signed-off-by: Michael Haggerty mhag...@alum.mit.edu --- diff --git a/numparse.c b/numparse.c new file mode 100644 index 000..90b44ce --- /dev/null +++ b/numparse.c @@ -0,0 +1,180 @@ +int parse_l(const char *s, unsigned int flags, long *result, char **endptr) +{ + long l; + const char *end; + int err = 0; + + err = parse_precheck(s, flags); + if (err) + return err; + /* +* Now let strtol() do the heavy lifting: +*/ Perhaps use a /* one-line style comment */ to reduce vertical space consumption a bit, thus make it (very slightly) easier to run the eye over the code. Yes, will change. + errno = 0; + l = strtol(s, (char **)end, flags NUM_BASE_MASK); + if (errno) { + if (errno == ERANGE) { + if (!(flags NUM_SATURATE)) + return -NUM_SATURATE; + } else { + return -NUM_OTHER_ERROR; + } + } Would it reduce cognitive load slightly (and reduce vertical space consumption) to rephrase the conditionals as: if (errno == ERANGE !(flags NUM_SATURATE)) return -NUM_SATURATE; if (errno errno != ERANGE) return -NUM_OTHER_ERROR; or something similar? The most common case by far should be that errno is zero. The code as written only needs one test for that case, whereas your code needs two tests. I think it is worth compromising on code clarity a tiny bit to avoid the extra test. More below. + if (end == s) + return -NUM_NO_DIGITS; + + if (*end !(flags NUM_TRAILING)) + return -NUM_TRAILING; + + /* Everything was OK */ + *result = l; + if (endptr) + *endptr = (char *)end; + return 0; +} diff --git a/numparse.h b/numparse.h new file mode 100644 index 000..4de5e10 --- /dev/null +++ b/numparse.h @@ -0,0 +1,207 @@ +#ifndef NUMPARSE_H +#define NUMPARSE_H + +/* + * Functions for parsing integral numbers. + * + * Examples: + * + * - Convert s into a signed int, interpreting prefix 0x to mean + * hexadecimal and 0 to mean octal. If the value doesn't fit in an + * unsigned int, set result to INT_MIN or INT_MAX. Did you mean s/unsigned int/signed int/ ? Thanks for catching this. I will fix it. + * + * if (convert_i(s, NUM_SLOPPY, result)) + * die(...); + */ + +/* + * The lowest 6 bits of flags hold the numerical base that should be + * used to parse the number, 2 = base = 36. If base is set to 0, + * then NUM_BASE_SPECIFIER must be set too; in this case, the base is + * detected automatically from the string's prefix. Does this restriction go against the goal of making these functions convenient, even while remaining strict? Is there a strong reason for not merely inferring NUM_BASE_SPECIFIER when base is 0? Doing so would make it consistent with strto*l() without (I think) introducing any ambiguity. I thought about doing this. If it were possible to eliminate NUM_BASE_SPECIFIER altogether, then there is no doubt that it would be a good change. But NUM_BASE_SPECIFIER also has an effect when base==16; namely, that an 0x prefix, if present, is consumed. So parse_i(0xff, 16 | NUM_BASE_SPECIFIER, result, endptr) gives result==255 and endptr==s+4, whereas parse_i(0xff, 16, result, endptr) gives result==0 and entptr==s+1 (it treats the x as the end of the string). We could forgo that feature, effectively allowing a base specifier if and only if base==0. But I didn't want to rule out allowing an optional base specifier for base==16, in which case NUM_BASE_SPECIFIER can't be dispensed with entirely. If you agree with that, then the remaining question is: which policy is less error-prone? My thinking was that forcing the caller to specify NUM_BASE_SPECIFIER explicitly when they select base==0 will serve as a reminder that the two features are intertwined. Because another imaginable policy (arguably more consistent with the policy for base!=0) would be that convert_i(s, 0, result) , because it *doesn't* specify NUM_BASE_SPECIFIER, doesn't allow a base prefix, and therefore indirectly only allows base-10 numbers. But I don't feel strongly about this. + */ +/* + * Number parsing functions: + * + * The following functions parse a number (long, unsigned long, int, + * or unsigned int respectively) from the front of s, storing the + * value to *result and storing a pointer to the first character after + * the number to *endptr. flags specifies how the number should be + * parsed, including which base should be used. flags is a combination + * of the numerical base (2-36)
Re: [PATCH 01/14] numparse: new module for parsing integral numbers
On Tuesday, March 17, 2015, Michael Haggerty mhag...@alum.mit.edu wrote: Implement wrappers for strtol() and strtoul() that are safer and more convenient to use. Signed-off-by: Michael Haggerty mhag...@alum.mit.edu --- diff --git a/numparse.c b/numparse.c new file mode 100644 index 000..90b44ce --- /dev/null +++ b/numparse.c @@ -0,0 +1,180 @@ +int parse_l(const char *s, unsigned int flags, long *result, char **endptr) +{ + long l; + const char *end; + int err = 0; + + err = parse_precheck(s, flags); + if (err) + return err; + /* +* Now let strtol() do the heavy lifting: +*/ Perhaps use a /* one-line style comment */ to reduce vertical space consumption a bit, thus make it (very slightly) easier to run the eye over the code. + errno = 0; + l = strtol(s, (char **)end, flags NUM_BASE_MASK); + if (errno) { + if (errno == ERANGE) { + if (!(flags NUM_SATURATE)) + return -NUM_SATURATE; + } else { + return -NUM_OTHER_ERROR; + } + } Would it reduce cognitive load slightly (and reduce vertical space consumption) to rephrase the conditionals as: if (errno == ERANGE !(flags NUM_SATURATE)) return -NUM_SATURATE; if (errno errno != ERANGE) return -NUM_OTHER_ERROR; or something similar? More below. + if (end == s) + return -NUM_NO_DIGITS; + + if (*end !(flags NUM_TRAILING)) + return -NUM_TRAILING; + + /* Everything was OK */ + *result = l; + if (endptr) + *endptr = (char *)end; + return 0; +} diff --git a/numparse.h b/numparse.h new file mode 100644 index 000..4de5e10 --- /dev/null +++ b/numparse.h @@ -0,0 +1,207 @@ +#ifndef NUMPARSE_H +#define NUMPARSE_H + +/* + * Functions for parsing integral numbers. + * + * Examples: + * + * - Convert s into a signed int, interpreting prefix 0x to mean + * hexadecimal and 0 to mean octal. If the value doesn't fit in an + * unsigned int, set result to INT_MIN or INT_MAX. Did you mean s/unsigned int/signed int/ ? + * + * if (convert_i(s, NUM_SLOPPY, result)) + * die(...); + */ + +/* + * The lowest 6 bits of flags hold the numerical base that should be + * used to parse the number, 2 = base = 36. If base is set to 0, + * then NUM_BASE_SPECIFIER must be set too; in this case, the base is + * detected automatically from the string's prefix. Does this restriction go against the goal of making these functions convenient, even while remaining strict? Is there a strong reason for not merely inferring NUM_BASE_SPECIFIER when base is 0? Doing so would make it consistent with strto*l() without (I think) introducing any ambiguity. + */ +/* + * Number parsing functions: + * + * The following functions parse a number (long, unsigned long, int, + * or unsigned int respectively) from the front of s, storing the + * value to *result and storing a pointer to the first character after + * the number to *endptr. flags specifies how the number should be + * parsed, including which base should be used. flags is a combination + * of the numerical base (2-36) and the NUM_* constants above (see). (see) what? + * Return 0 on success or a negative value if there was an error. On + * failure, *result and *entptr are left unchanged. + * + * Please note that if NUM_TRAILING is not set, then it is + * nevertheless an error if there are any characters between the end + * of the number and the end of the string. Again, on the subject of convenience, why this restriction? The stated purpose of the parse_*() functions is to parse the number from the front of the string and return a pointer to the first non-numeric character following. As a reader of this API, I interpret that as meaning that NUM_TRAILING is implied. Is there a strong reason for not inferring NUM_TRAILING for the parse_*() functions at the API level? (I realize that the convert_*() functions are built atop parse_*(), but that's an implementation detail.) + */ + +int parse_l(const char *s, unsigned int flags, + long *result, char **endptr); Do we want to perpetuate the ugly (char **) convention for 'endptr' from strto*l()? Considering that the incoming string is const, it seems undesirable to return a non-const pointer to some place inside that string. +/* + * Number conversion functions: + * + * The following functions parse a string into a number. They are + * identical to the parse_*() functions above, except that the endptr + * is not returned. These are most useful when parsing a whole string + * into a number; i.e., when (flags NUM_TRAILING) is unset. I can formulate arguments for allowing or disallowing NUM_TRAILING with convert_*(), however, given their purpose of parsing the