On Sat, 5 Jan 2008, hadley wickham wrote: > On Jan 5, 2008 1:40 AM, Prof Brian Ripley <[EMAIL PROTECTED]> wrote: >> I presume you want this only in a UTF-8 locale? > > Yes, although my assumption is that this will become an increasing > common locale as time goes by.
Probably, except on Windows. UTF-8 support in commercial Unixen is often poor (i.e. it is nominally there but does not work well). The need for other non-8-bit encodings is diminishing, e.g. the various shift encodings for Japanese will likely die out but over a long time. >> Currently this is done by >> >> static int SkipSpace(void) >> { >> int c; >> while ((c = xxgetc()) == ' ' || c == '\t' || c == '\f') >> /* nothing */; >> return c; >> } >> >> in gram.c. We could make use of isspace and its wide-char equivalent >> iswspace. However: >> >> >> - there is the perennial debate over whether \v is whitespace. >> >> R-lang says >> >> Although not strictly tokens, stretches of whitespace characters >> (spaces and tabs) serve to delimit tokens in case of ambiguity, >> >> which suggests it has a minimal view of whitespace. >> >> >> - iswspace is often rather unreliable. E.g. glibc says >> >> The wide character class "space" always contains at least the space >> character and the control characters '\f', '\n', '\r', '\t', '\v'. >> >> and I think it usually does not contain other forms of spaces. More >> seriously >> >> The behaviour of iswspace() depends on the LC_CTYPE category of the >> current locale. >> >> so what is a space will depend on the encoding (hence my question about >> UTF-8). And Ei-ji Makama was replaced iswspace on MacOS X, because >> apparently it is wrongly implemented. >> >> >> - it would complicate the parser as look-ahead would be needed (you would >> need to read the next mbcs, check it it were whitespace and pushback if >> needed). We do that elsewhere, though. > > I had assumed the parser would be unicode/mb aware already and so > would be an easy fix. It's not, because it has to work on non-Unicode platforms (e.g. Windows 9x until R 2.7.0), and even platforms in which wchar_t is not Unicode. (More precisely, we need to avoid making assumptions we can't verify on such platforms.) There's a problem with 'whitespace': \n is whitespace but is a command terminator -- so what should Unicode line and para separators map to? I decided to use only blanks (in the sense of iswblank) as whitespace, and further only to use the table that Ei-ji Nakama provided for us in rlocale_data.h (adding NBSP). So the new rules are that 'whitespace' in parsing is \t, \f (not \v, for historical reasons, I presume) NBSP in 8-bit Windows locales Unicode blanks in UTF-8 locales on internally Unicode machines (and I doubt UTF-8 locales exist anywhere else). > The locale issues are clearly important and > can't easily be swept under the rug. > >> The only one of these 'spaces' I have much sympathy for is NBSP (which is >> also fairly easy to generate in CP1252). It would be easy to add that. >> Otherwise I am not convinced it is worth the work (and added uncertainty). > > That's reasonable. Another related request would be treating curly > quotes (single and double) the same way as normal quotes, but I'd > imagine similar caveats would apply there. And a bit more: they are directional so you would (I presume) only want to match \u2029 to \u2028 etc. That would be a lot of extra work. > You could also imagine using unicode arrows in place of <- and ->, but > that's probably heading too far down the apl/fortress road! -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595 ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel