subject:"\[R\] Strange characters that block import"

[R] Strange characters that block import

2009-10-14 Thread arnaud Mosnier

Dear useRs,

I try to import a text file that contain some strange characters coming from
the misinterpretation of foreign language characters by another software
(see below).


Here is an example of text with a line containing characters that bug the
import

name;number
zdsfbg;2
 ;3
dtryjh;4



R do not want to import lines after those strange characters (i.e. import
only the first two lines, one is the header, the second the first line of
data).

I already try to import using other encoding such as latin1 or UTF-8 but it
does not solve the problem.

Replacing those character in a text editor before importing solve the
solution, but I want that the user of my script do not have to edit the text
before the analysis in R.

Any hint ??

Thanks

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Strange characters that block import

2009-10-14 Thread Duncan Murdoch


On 10/14/2009 8:25 AM, arnaud Mosnier wrote:

Dear useRs,

I try to import a text file that contain some strange characters coming from
the misinterpretation of foreign language characters by another software
(see below).


Here is an example of text with a line containing characters that bug the
import

name;number
zdsfbg;2
 ;3
dtryjh;4



R do not want to import lines after those strange characters (i.e. import
only the first two lines, one is the header, the second the first line of
data).

I already try to import using other encoding such as latin1 or UTF-8 but it
does not solve the problem.

Replacing those character in a text editor before importing solve the
solution, but I want that the user of my script do not have to edit the text
before the analysis in R.

Any hint ??


Those funny characters are octal 032, Ctrl-Z.  Years ago that was 
defined on DOS/Windows as an end of file marker, and I guess our code 
still honours that.


You can work around it by stating that you're reading from a binary 
file, not a text file:


f - file(text.txt, rb)

Then read.csv2(f) fails, but readLines(f) succeeds, so this works:

 f - file(c:/temp/test.txt, rb)
 read.csv2(textConnection(readLines(f)))
   name number
1zdsfbg  2
2 \032\032 \032\032  3
3dtryjh  4

 close(f)

I don't know if there are any characters that would cause readLines to 
fail, but there might be, so I'd suggest replacing the buggy software 
that caused all the problems in the first place.


Duncan Murdoch





Thanks

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Strange characters that block import

2009-10-14 Thread Prof Brian Ripley


On Wed, 14 Oct 2009, Duncan Murdoch wrote:


On 10/14/2009 8:25 AM, arnaud Mosnier wrote:

Dear useRs,

I try to import a text file that contain some strange characters coming 
from

the misinterpretation of foreign language characters by another software
(see below).


Here is an example of text with a line containing characters that bug the
import

name;number
zdsfbg;2
 ;3
dtryjh;4



R do not want to import lines after those strange characters (i.e. import
only the first two lines, one is the header, the second the first line of
data).

I already try to import using other encoding such as latin1 or UTF-8 but it
does not solve the problem.


If these are control characters (that is ^Z is Ctrl-Z, but we've no 
real information) then those are the same in every encoding that uses 
bytes (or at least those known to iconv).



Replacing those character in a text editor before importing solve the
solution, but I want that the user of my script do not have to edit the 
text

before the analysis in R.

Any hint ??


Those funny characters are octal 032, Ctrl-Z.  Years ago that was defined on 
DOS/Windows as an end of file marker, and I guess our code still honours 
that.


More to the point, the Windows C run-time does (AFAIK Ctrl-Z is still 
current as EOF under Windows, and Wikipedia says so too), but nothing 
in the original posting mentioned this was on Windows, and ctrl-Z has 
no effect on the two other OSes I tried which read such a file 
successfully.


So without a single piece of the 'at a minimum' information requested 
in the posting guide, we are guessing (and I am guessing your example 
was done under Windows, too).


You can work around it by stating that you're reading from a binary file, not 
a text file:


f - file(text.txt, rb)

Then read.csv2(f) fails, but readLines(f) succeeds, so this works:


f - file(c:/temp/test.txt, rb)
read.csv2(textConnection(readLines(f)))

  name number
1zdsfbg  2
2 \032\032 \032\032  3
3dtryjh  4


close(f)


I don't know if there are any characters that would cause readLines to fail, 
but there might be, so I'd suggest replacing the buggy software that caused 
all the problems in the first place.


This is all a function of the OS's C runtime: I suspect Ctrl-D (eot) 
is interpreted as end-of-file on some OSes.  Nul (\0) will terminate 
strings (that's standard in C, and enforced in recent versions of R).



Duncan Murdoch


--
Brian D. Ripley,  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Strange characters that block import

Re: [R] Strange characters that block import

Re: [R] Strange characters that block import

3 matches

Site Navigation

Mail list logo

Footer information