On Jan 25, 2011, at 5:49 AM, Karl Ove Hufthammer wrote: > Matthew Dowle wrote: > >> I'm not sure, but note the difference in locale between >> Linux (UTF-8) and Windows (non UTF-8). As far as I >> understand it R much prefers UTF-8, which Windows doesn't >> natively support. Otherwise you could just change your >> Windows locale to a UTF-8 locale to make R happier. >> > [...] >> >> If anybody knows a way to trick R on Linux into thinking it has >> an encoding similar to Windows then I may be able to take a >> look if I can reproduce the problem in Linux. > > Changing the locale to an ISO 8859-1 locale, i.e.: > > export LC_ALL="en_US.ISO-8859-1" > export LANG="en_US.ISO-8859-1" > > I could *not* reproduce it; that is, ‘table’ is as fast on the non-ASCII > factor as it is on the ASCII factor. >
Strange - are you sure you get the right locale names? Make sure it's listed in locale -a. The above works on my Mac but on my Linux system I have to use LANG=en_US.iso88591 and is *is* replicable albeit with a much smaller hit: > benchmark( table(x.num), table(x.fac.ascii), table(x.fac.nascii), > table(unclass(x.fac.nascii)), replications=20 ) test replications elapsed relative user.self sys.self user.child sys.child 4 table(unclass(x.fac.nascii)) 20 1.028 2.269316 1.020 0.004 0 0 2 table(x.fac.ascii) 20 0.453 1.000000 0.452 0.004 0 0 3 table(x.fac.nascii) 20 2.683 5.922737 2.684 0.000 0 0 1 table(x.num) 20 1.028 2.269316 1.020 0.008 0 0 The main reason is that table() calls factor() which does as.character() which means 10^5 character conversions - a bad idea in that case. Why the penalty is so much higher on Windows that I can't answer at the moment as I'm not on a machine with Windows VM. FWIW if you care about speed you should use tabulate() instead - it's much faster and incurs no penalty: > benchmark( tabulate(x.num), tabulate(x.fac.ascii), tabulate(x.fac.nascii), > tabulate(unclass(x.fac.nascii)), replications=20 ) test replications elapsed relative user.self sys.self user.child sys.child 4 tabulate(unclass(x.fac.nascii)) 20 0.027 1.421053 0.024 0 0 0 2 tabulate(x.fac.ascii) 20 0.023 1.210526 0.024 0 0 0 3 tabulate(x.fac.nascii) 20 0.024 1.263158 0.020 0 0 0 1 tabulate(x.num) 20 0.019 1.000000 0.020 0 0 0 Cheers, Simon ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel