On Thu, 17 May 2007, Petr Savicky wrote: >> strncasecmp is not standard C (not even C99), but R does have a substitute >> for it. Unfortunately strncasecmp is not usable with multibyte charsets: >> Linux systems have wcsncasecmp but that is not portable. In these days of >> widespread use of UTF-8 that is a blocking issue, I am afraid. > > What could help are the functions mbrtowc and towctrans and simple > long integer comparison. Are the functions mbrtowc and towctrans > available under Windows? mbrtowc seems to be available as Rmbrtowc > in src/gnuwin32/extra.c. > > I did not find towctrans defined in R sources, but it is in > gnuwin32/Rdll.hide
I don't see it in Rdll.hide. It is a C99 function (see your unix man page). > and used in do_tolower. Does this mean that tolower is not usable > with utf-8 under Windows? UTF-8 is not usable under Windows, but tolower works in Windows DBCS (in so far as that makes sense: Chinese chars do not have 'case'). Rmbrtowc reflects an attempt to add UTF-8 support on Windows, but that is not currently active. >> In the case of grep I think all you need is >> >> grep(tolower(pattern), tolower(x), fixed = TRUE) >> >> and similarly for regexpr. > > Yes. this is correct, but it has disadvantages. It needs more > space and, if value=TRUE, we would have to do something like > x[grep(tolower(pattern), tolower(x), fixed = TRUE, value=FALSE)] > This is hard to implement in src/library/base/R/grep.R, > where the call to .Internal(grep(pattern,...)) is the last command > and I think this should be preserved. > >>> Ignore case option is not meaningfull in gsub. >> >> sub("abc", "123", c("ABCD", "abcd"), ignore.case=TRUE) >> >> is different from 'ignore.case=FALSE', and I see the meaning as clear. >> So what did you mean? (Unfortunately the tolower trick does not work for >> [g]sub.) > > The meaning of ignore.case in [g]sub is problematic due to the following. > sub("abc", "xyz", c("ABCD", "abcd"), ignore.case=TRUE) > produces > [1] "xyzD" "xyzd" > but the user may in fact need the following > [1] "XYZD" "xyzd" He may, but that is not what 'ignore case' means, more like 'case honouring'. > It is correct that "xyzD" "xyzd" is produced, but the user > should be aware of the fact that several substitutions like > x <- sub("abc", "xyz", c("ABCD", "abcd")) # ignore.case=FALSE > sub("ABC", "XYZ", x) # ignore.case=FALSE > may be more useful. > > I have another question concerning the speed of grep. I expected that > fgrep_one function is slower than calling a library routine > for regular expressions. In particular, if the pattern has a lot of > long partial matches in the target string, I expected that it may be much > slower. A short example is > y <- "aaaaaaaaab" > x <- "aaaaaaaaaaaaaaaaaaab" > grep(y,x) > which requires 110 comparisons (10 comparisons for each of 11 possible > beginnings of y in x). In general, the complexity in the worst case is > O(m*n), where m,n are the lengths of y,x resp. I would expect that > the library function for matching regular expressions needs > time O(m+n) and so will be faster. However, the result obtained > on a larger example is > > > x1 <- paste(c(rep("a", times = 1000), "b"), collapse = "") > > x2 <- paste(c("b", rep("a", times = 1000)), collapse = "") > > y <- paste(c(rep("a", times = 10000), x2), collapse = "") > > z <- rep(y, times = 100) > > > system.time(i <- grep(x1, z, fixed = T)) > [1] 1.970 0.000 1.985 0.000 0.000 > > > system.time(i <- grep(x1, z, fixed = F)) # reg. expr. surprisingly slow > (*) > [1] 40.374 0.003 40.381 0.000 0.000 > > > system.time(i <- grep(x2, z, fixed = T)) > [1] 0.113 0.000 0.113 0.000 0.000 > > > system.time(i <- grep(x2, z, fixed = F)) # reg. expr. faster than > fgrep_one > [1] 0.019 0.000 0.019 0.000 0.000 > > Do you have an explanation of these results, in particular (*)? Yes, there is a comment on the help page to that effect. But these are highly atypical uses. Try perl=TRUE, and be aware that the locale matters a lot in such tests (via the charset). No one is attempting to make R a fast string-processing language and so developers resources are spent on performance where it matters to more typical usage. (E.g. reducing duplication in as.double and friends speeds up just about every R session, and speeds up some numerical sessions dramatically.) -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595 ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel