Re: [R-SIG-Mac] Reading in a table originally with ISO-latin1 encoding (in Linux)

Antti Arppe Fri, 23 Jun 2006 02:23:25 -0700

Dear colleagues,

With the help of a colleague of mine here in Helsinki (Seppo Nyrkkö)who looked at the innards of the R source code for Mac it turned outthat this was in the end indeed an issue concerning the Mac locale andits settings and not R.

Though we had tried this earlier by changing the LANG variable to'fi_FI', we hadn't looked hard enough in the available encodings (withlocale -a) to select the exactly correct value, being:

LANG=fi_FI.IS08859-1;export LANG;

With this configuration R was able to happily read in my originaltable with the Scandinavian characters in the header, without no fuss.


Thanks for your advice, and wishing all a good Midsummer,

        -Antti Arppe

On Mon, 12 Jun 2006 [EMAIL PROTECTED] wrote:

  1.  Reading in a table originally with ISO-latin1 encoding
     (Linux) (J ? rg Beyer)
----------------------------------------------------------------------

Message: 1
Date: Sun, 11 Jun 2006 13:30:35 +0200
From: J ? rg Beyer <[EMAIL PROTECTED]>
Subject: [R-SIG-Mac] Reading in a table originally with ISO-latin1
        encoding (Linux)
To: <[email protected]>
Message-ID: <[EMAIL PROTECTED]>
Content-Type: text/plain;       charset="US-ASCII"

Antti,

I think I can offer some help. I can add the following for
  R 2.1.1 w/ R.app 1.14
  Mac OS X 10.4.6, PPC G4/400 (Oct. 1999)

If you are only interested in the solution, you can skip the following
report and jump to the last paragraph.

A tabulated data file with German umlauts in some column headers shows the
same behavior as yours, if I use your command
  data <- read.table(file("<filename>", encoding="<encoding>"),
  header=TRUE)
or these variations
  data <- read.table(file("<filename>"), header=TRUE)
  data <- read.table(file("<filename>"), header=FALSE)

In all these case, the same strange behavior results
-- respectless whether the file is encoded as "latin1", "utf-8" or the
generic "Mac Roman"
-- respectless whether you choose UTF-8 with or without BOM
-- respectless whether you choose Mac, DOS, or UNIX line feeds
-- respectless whether you choose Apple's TextEdit, TextWrangler or BBEdit
for setting/changing the encoding (I prefer the latter for its fine tuning,
automation, and scripting features)
-- respectless whether you try to read the file with R on the terminal, or
with R.app (the Mac GUI)
-- strange enough, R *croaks about "incomplete lines"* even if there are no
accented characters (or multibyte characters) in your data file at all,
*just plain ASCII*... indicating that the problem may be located deeper in
the parsing process, not in the character set.

At this point I read (again) the "read.table" help page and found it a bit
misleading -- the sep=""-option reads as if by default the file is read line
by line (1st step), and then every line is split into columns wherever a
stream of white space is found (2nd step).
I think this is not the case. If you modify your command and explicitly add
the separator option (tab, in this case)
 data <- read.table(file("<filename>", encoding="<encoding>"), sep="\t",
 header=TRUE)

 my file reads in without any problems, be it Latin-1 or UTF-8 (not sure
how to handle Mac Roman files, at the moment).
But keep in mind that multibyte characters are possible, but not recommended
in variable names (or column headers).

Hope this helps.
Cheers

Joerg
------------------------------

_______________________________________________
R-SIG-Mac mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/r-sig-mac

Re: [R-SIG-Mac] Reading in a table originally with ISO-latin1 encoding (in Linux)

Reply via email to