A few comments, though (I've been offline through much of this, and away from a Windows machine for almost all).

1) You could have narrowed down the cause by saving and restarting the session. In particular it would have shown that the issue was not in sub() as you reported, since saving the object after the sub() call and starting a new session caused the problem in the second session.

2) Using gctorture() makes such things happen on much smaller problems and more reliably (if no faster). (The underlying cause was more than one missing PROTECT.)

3) The difference between fixed=TRUE (which you should have used in the first place) and the extended and PCRE versions is often in 2.10.x in the encoding of the result: use Encoding() to find out. Not only is fixed = TRUE much faster, it avoids repeated re-encodings.

4) Using UTF-8 encoded strings in a non-UTF-8 locale (and in particular on Windows) is a convenience but has performance implications. Unless you need text not representable in the current locale, convert your strings to the current charset. If you are using non-ASCII text and an 8-bit locale (e.g. CP1252 on Windows) then regexp computations will work somewhat faster in R-devel since they are performed in bytes (whereas 2.10.x uses wchar_t and for [g]sub returns the result in UTF-8).

5) These reports show yet again that people are not doing enough to help in the alpha/beta testing period of 2.x.0. The R developers are almost exclusively using ASCII data or UTF-8 locales, so people doing extensive text processing in other locales please do take note of requests to test new versions of R.


On Tue, 15 Dec 2009, g.russ...@eos-solutions.com wrote:

The new version of R-devel from yesterday morning seems to have fixed bug=20
14114! Thanks a lot for your help.

Duncan Murdoch <murd...@stats.uwo.ca> schrieb am 14.12.2009 13:34:35:

On 10/12/2009 4:20 AM, k...@huftis.org wrote:
Full=5FName: Karl Ove Hufthammer
Version: 2.10.0
OS: Windows XP
Submission from: (NULL) (93.124.134.66)
=20
=20
I have found a rather strange bug in R 2.10.0 on Windows, where=20
the choice of
characters used in a string make R crash (i.e., Windows shows a=20
dialogue saying
that the application has a problem, and must be closed).
=20
This was related to encoding changes.  It likely appeared=20
Windows-specific because Windows uses a different default encoding than=20
most Linux systems.  I believe it is fixed now in R-devel, and it will=20
soon make it into 2.10.1-patched, but it came too late to make it into=20
today's release.
=20
I believe PR#14114 was the same issue and is also fixed, but I did less=20
testing of it.  I'd appreciate it if those who saw either bug in real=20
code test the patches.  They should be in today's tarball of R-devel,=20
and did make it into the Windows binary build of R-devel this morning.
=20
Duncan
=20
=20
I can reproduce the bug on two separate systems running Windows XP,=20
and with
both R 2.10.0 and the latest R.2.10.1 RC.
=20
The following commands trigger the crash for me:
=20
n=3D1e5
k=3D10
x=3Dsample(k,n,replace=3DTRUE)
y=3Dsample(k,n,replace=3DTRUE)
xy=3Dpaste(x,y,sep=3D" =D7 ")
z=3Dsample(n)
d=3Ddata.frame(xy,z)
=20
The last step takes very long time, and R crashes before it's=20
finished. Note
that if I reduce n, the problem disappears. Also, if I change the =D7 (a
multiplication symbol) to a x (a letter), the problem also=20
disappears (and the
last command takes almost no time to run).
=20
I originally discovered this (or a related?) bug while using=20
'unique' on a data
frame similar to the 'd' data frame defined above, where R would=20
often, but not
always, crash.=20
=20
sessionInfo()
R version 2.10.0 (2009-10-26)=20
i386-pc-mingw32=20
=20
locale:
[1] LC=5FCOLLATE=3DNorwegian-Nynorsk=5FNorway.1252=20
[2] LC=5FCTYPE=3DNorwegian-Nynorsk=5FNorway.1252=20
[3] LC=5FMONETARY=3DNorwegian-Nynorsk=5FNorway.1252
[4] LC=5FNUMERIC=3DC=20
[5] LC=5FTIME=3DNorwegian-Nynorsk=5FNorway.1252=20
=20
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
=20
=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=
=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
=20

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


--
Brian D. Ripley,                  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to