Eduardo Habkost <ehabk...@redhat.com> writes: > On Fri, Jan 18, 2013 at 11:01:14AM +0100, Markus Armbruster wrote: >> Eduardo Habkost <ehabk...@redhat.com> writes: >> >> > There are lots of duplicate parsing code using strto*() in QEMU, and >> > most of that code is broken in one way or another. Even the visitors >> > code have duplicate integer parsing code[1]. This introduces functions >> > to help parsing unsigned int values: parse_uint() and parse_uint_full(). >> > >> > Parsing functions for signed ints and floats will be submitted later. >> > >> > parse_uint_full() has all the checks made by opts_type_uint64() at >> > opts-visitor.c: >> > >> > - Check for NULL (returns -EINVAL) >> > - Check for negative numbers (returns -ERANGE) >> > - Check for empty string (returns -EINVAL) >> > - Check for overflow or other errno values set by strtoll() (returns >> > -errno) >> > - Check for end of string (reject invalid characters after number) >> > (returns -EINVAL) >> > >> > parse_uint() does everything above except checking for the end of the >> > string, so callers can continue parsing the remainder of string after >> > the number. >> > >> > Unit tests included. >> > >> > [1] string-input-visitor.c:parse_int() could use the same parsing code >> > used by opts-visitor.c:opts_type_int(), instead of duplicating that >> > logic. >> [...] >> > diff --git a/util/cutils.c b/util/cutils.c >> > index 80bb1dc..7f09251 100644 >> > --- a/util/cutils.c >> > +++ b/util/cutils.c >> > @@ -270,6 +270,85 @@ int64_t strtosz(const char *nptr, char **end) >> > return strtosz_suffix(nptr, end, STRTOSZ_DEFSUFFIX_MB); >> > } >> > >> > +/* Try to parse an unsigned integer >> > + * >> > + * Error checks done by the function: >> > + * - NULL pointer will return -EINVAL. >> > + * - Empty strings will return -EINVAL. >> >> Same for strings containing only whitespace. > > Oh, you're right. > >> >> > + * - Overflow errors or other errno values set by strtoull() will >> >> Extra space before 'set'. > > Oops. > >> >> > + * return -errno (-ERANGE in case of overflow). >> > + * - Differently from strtoull(), values starting with a minus sign are >> > + * rejected (returning -ERANGE). >> > + * >> > + * Sets endptr to point to the first invalid character. >> >> Mention that unlike strtol() & friends, endptr must not be null? >> Probably not necessary. > > I believe it is implicit if I document it as "Sets *endptr". But worth > documenting as it is different from strtoull() behavior. > >> >> The two ERANGE cases treat endptr differently: it's either set to point >> just behind the out-of-range number, or to point to the beginning of the >> out-of-range number. Ugh. > > This should be fixed in the newer version I sent. > >> >> > + Callers may rely >> > + * on *value and *endptr to be always set by the function, even in case of >> > + * errors. >> >> You neglect to specify what gets assigned to *value in error cases. >> > > Undefined. :-) > > Anyway, I agree it not very useful to document that "*value and *endptr > will be always set" if it is undefined. I will try to reword it.
Yes, please. I'm not fond of undefined behavior, unless there's a good reason. If you don't want to assign a defined value, consider not assigning anything. >> Your list of error checks isn't quite complete. Here's my try: >> >> /** >> * Parse unsigned integer >> * >> * @s: String to parse >> * @value: Destination for parsed integer value >> * @endptr: Destination for pointer to first character not consumed >> * @base: integer base, between 2 and 36 inclusive, or 0 >> * >> * Parsed syntax is just like strol()'s: arbitrary whitespace, a single >> * optional '+' or '-', an optional "0x" if @base is 0 or 16, one or >> * more digits. > > The newer version has '-' as _not_ part of the syntax. > >> * >> * If @s is null, or @base is invalid, or @s doesn't start with an >> * integer in the syntax above, set *@value to 0, *@endptr to @s, and >> * return -EINVAL. > > This matches the behavior of the newer version. But I would like to keep > *value documented as undefined in case of errors. > >> * >> * Set @endptr to point right beyond the parsed integer. >> * >> * If the integer is negative, set *@value to 0, and return -ERANGE. > > This isn't the case in the newer version. > >> * If the integer overflows unsigned long long, set *@value to >> * ULLONG_MAX, and return -ERANGE. >> * Else, set *@value to the parsed integer, and return 0. >> */ > > Thanks for taking the time to write this. I hate having to look up > what's the right syntax to use in docstrings, so I often just write > plain comments before functions. :-) I don't particularly like GLib-style doc-strings, but it's what we use when we use anything, which is rarely. > I have a minor problem with the documentation above: as a developer, the > most important question I have is: what's the difference between using > parse_uint() and using strtoull() directly? But maybe it is a good > thing: instead of referencing the strtoull() specification (and an > implementation detail), now we have a well-defined and well-documented > behavior. I understand why you're asking for the difference to stroull(). Trouble is strtoull() is complex, and often misunderstood. >> > + * >> > + * The 'base' parameter has the same meaning of 'base' on strtoull(). >> > + * >> > + * Returns 0 on success, negative errno value on error. >> > + */ >> > +int parse_uint(const char *s, unsigned long long *value, char **endptr, >> > + int base) >> > +{ >> > + int r = 0; >> > + char *endp = (char *)s; >> > + unsigned long long val = 0; >> > + >> > + if (!s) { >> > + r = -EINVAL; >> > + goto out; >> > + } >> > + >> > + /* make sure we reject negative numbers: */ >> > + while (isspace((unsigned char)*s)) { >> > + ++s; endp++; >> >> Mixing pre- and post-increment like that is ugly. Recommend to stick to >> post-increment. > > Agreed. I blame the copy & paste I made from opts-visitor. Later I added > the postfix increment without noticing how ugly it looked like > > Anyway, this was changed in the newer patch version. > >> >> I'd set endp after the loop. Right before goto. > > Fixed in the newer version. > >> >> Or simply change the specification to set *endptr to point beyond the >> integer instead of to the '-'. I took the liberty to do exactly that in >> my proposed function comment. > > The newer version was changed to return -EINVAL and set endptr to the > beginning of the string. > >> >> > + } >> > + if (*s == '-') { >> > + r = -ERANGE; >> > + goto out; >> > + } >> >> This creates the special case that made me go "ugh" above. Suggest to >> drop the if here, and check right after strtoull() instead, as shown >> below. That way, we get the same endptr behavior for all out-of-range >> numbers. > > Is returning -EINVAL acceptable for you, as well? In your proposal we > consider "-1234" part of the language but out-of-range. Returning > -EINVAL, we consider "-1234" not part of the language, and invalid > input. The newer version returns -EINVAL. I prefer -ERANGE on underflow, for symmetry with parsing *signed* integers. A similar parsing function for signed integers would surely assign LLONG_MIN and return -ERANGE then. A secondary, weak argument: caller can more easily distinguish the failure modes: * -EINVAL: @s invalid, @base invalid (both programming errors), or @s doesn't start with an integer. I use "integer" in the mathematical sense here. * -ERANGE, *value == 0: integer underflows *value * -ERANGE, *value == ULLONG_MAX: integer overflows *value. If we lump underflow into the -EINVAL case, you have to check *endptr == '-' to find it. The argument is weak, because I don't have a caller ready that actually wants to find it. >> > + >> > + errno = 0; >> > + val = strtoull(s, &endp, base); >> >> if (*s = '-') { >> r = -ERANGE; >> val = 0; >> goto out; >> } > > This has another bug: makes endptr point to the middle of the string in > case we are parsing " xxx" or " -1234". It makes endptr point right behind the integer, doesn't it? When parsing " xxx", it points to the first 'x' (we skipped the spaces already). When parsing " -1234", it points behind '4'. >> > + if (errno) { >> > + r = -errno; >> > + goto out; >> > + } >> > + >> > + if (endp == s) { >> > + r = -EINVAL; >> > + goto out; >> > + } >> > + >> > +out: >> > + *value = val; >> > + *endptr = endp; >> > + return r; >> > +} >> > + >> > +/* Try to parse an unsigned integer, making sure the whole string is >> > parsed >> > + * >> > + * Have the same behavior of parse_uint(), but with an additional check >> > + * for additional data after the parsed number (in that case, the function >> > + * will return -EINVAL). >> >> What's assigned to *value then? > > Undefined. :-) > > >> >> > + */ >> >> Alternatively, you could make into parse_uint() do that check when >> endptr is null, and drop this function. Matter of taste. > > I considered doing that, but I believe it would be too subtle. > >> >> > +int parse_uint_full(const char *s, unsigned long long *value, int base) >> > +{ >> > + char *endp; >> > + int r; >> > + >> > + r = parse_uint(s, value, &endp, base); >> > + if (r < 0) { >> > + return r; >> > + } >> > + if (*endp) { >> >> Answering my own question: the parsed integer is assigned to *value then. > > I prefer to document it as undefined. If you want to use the parsed > integer even if there's additional data after it, you should use > parse_int() instead. Maybe. What I want is a proper function contract. That includes how *value is changed. *Especially* when it's an unspecified change. >> > + return -EINVAL; >> > + } >> > + >> > + return 0; >> > +} >> > + >> > int qemu_parse_fd(const char *param) >> > { >> > int fd;