These are some points stimulated by reading about C history (and related in their implementation).


1) On some platforms

as.integer("0xA")
[1] 10

but not all (not on Solaris nor Windows). We do not define what is allowed, and rely on the OS's implementation of strtod (yes, not strtol). It seems that glibc does allow hex: C99 mandates it but C89 seems not to allow it.

I think that was a mistake, and strtol should have been used. Then C89
does mandate the handling of hex constants and also octal ones. So changing to strtol would change the meaning of as.integer("011").


Proposal: we handle this ourselves and define what values are acceptable,
namely for as.integer:

[+|-][0-9]+
NA
0[x|X][0-9A-fa-f]+

in all cases such that the converted value is in-range. (This does mean as.integer("1e+05") would be invalid, but is it clear that is allowed now?)

For as.numeric(), probably the C99 rules (which include NaN, Inf, Infinity, and we need to add NA).

Alternatively, make and document the semantics to be
as.integer(as.numeric(char_string)) (which is effectively what we have now, although it causes surprises).


[As a side point, some locales may accept non-Roman digits. I think we need to exclude those everywhere, not just some places like parsing.]


2) R does not have integer constants. It would be convenient if it did, and I can see no difficulty in allowing the same conversions when parsing as when coercing. This would have the side effect that 100 would be integer (but the coercion rules would come into play) but 200000000000000000 would be double. And x <-0xce80 would be valid.



3) We do allow setting LC_NUMERIC, but it partially breaks R if the decimal point is not ".". (I know of no locale in which it is not "." or ",", and we cannot allow "," as part of numeric constants when parsing.) E.g.:


Sys.setlocale("LC_NUMERIC", "fr_FR")
[1] "fr_FR"
Warning message:
setting 'LC_NUMERIC' may cause R to function strangely in: setlocale(category, locale)
x <- 3.12
x
[1] 3
as.numeric("3,12")
[1] 3,12
as.numeric("3.12")
[1] NA
Warning message:
NAs introduced by coercion

We could do better by insisting that "." was the decimal point in all interval conversions _to_ numeric. Then the effect of setting LC_NUMERIC would primarily be on conversions _from_ numeric, especially printing and graphical output. (One issue would be what to do with scan(), which has a `dec' argument but is implemented assuming LC_NUMERIC=C. I would hope to continue to have `dec' but perhaps with a locale-dependent default.) The resulting asymmetry (R would not be able to parse its own output) would be unhappy, but seems inevitable. (This could be implemented easily by having a `dec' arg to EncodeReal and EncodeComplex, and using LC_NUMERIC to control that rather than actually setting the local category. For example, deparsing needs to be done in LC_NUMERIC=C.)


All of these could be implemented by customized versions of strtod/strtol.



-- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595

______________________________________________
R-devel@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to