Re: [R] The behaviour of read.csv().

2010-12-05 Thread Duncan Murdoch

On 03/12/2010 7:08 AM, Duncan Murdoch wrote:

On 02/12/2010 9:59 PM, Rolf Turner wrote:


On 3/12/2010, at 3:48 PM, David Scott wrote:


   On 03/12/10 14:33, Duncan Murdoch wrote:


SNIP


I think the fill=TRUE option arrived about 10 years ago, in R 1.2.0.
The comment in the NEWS file suggests it was in response to some strange
csv file coming out of Excel.

The real problem with the CSV format is that there really isn't a well
defined standard for it.  The first RFC about it was published in 2005,
and it doesn't claim to be authoritative.  Excel is kind of a standard,
but it does some very weird things.  (For example:  enter the string 01
into a field.  To keep the leading 0, you need to type it as '01.  Save
the file, read it back:  goodbye 0.  At least that's what a website I
was just on says about Excel, and what OpenOffice does.)

I've been burned so many times by storing data in .csv files, that I
just avoid them whenever I can.

Absolutely agree with this Duncan. Playing around with .csv files is
like playing with some sort of unstable explosive. I also avoid them as
much as possible.


Where I work, everybody but me uses (yeuuccchhh!!!) Excel or SPSS.  If
we are to share data sets, *.csv files seem to be the most efficacious,
if not the only, way to go.


I was going to suggest using DIF rather than CSV.  It contains more
internal information about the file (including the type of each entry),
but has the disadvantage of being less readable, even though it is ascii.

However, in putting together a little demo, I found a couple of bugs in
the R implementation of read.DIF, and it looks as though it ignores the
internal type information.  Sigh.


As of r53778, the bugs I noticed should be fixed.  read.DIF now respects 
the internal type information, so it will keep character strings like 
001 as type character (unless you ask it to change the type).


Duncan Murdoch



Duncan Murdoch




So far, we've had very few problems.  The one that started off this thread
is the only one I can think of that related to the *.csv format.

At least *.csv files have the virtue of being ASCII files, whence if things
go wrong it is at least possible to dig into them with a text editor and
figure out just what the problem is.

cheers,

Rolf




__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] The behaviour of read.csv().

2010-12-05 Thread Rolf Turner

On 6/12/2010, at 3:00 AM, Duncan Murdoch wrote:


 I was going to suggest using DIF rather than CSV.  It contains more
 internal information about the file (including the type of each entry),
 but has the disadvantage of being less readable, even though it is ascii.

I don't think DIF is really the answer. My colleagues are familiar
with the *.csv concept; they have never heard of ``DIF''.

As I have said, we have had but few problems using *.csv.  Better the
devil you know ...

Furthermore I have to deal with data provided by various sources 
``external''
to the research project that I work for. I have to use the data that 
these
sources provide, in the format in which they provide it.  If they give 
me
*.csv files I count myself lucky.

Finally, there seems to be no ``write.DIF'' function, i.e. there is no 
way
to produce *.DIF output, as far as I can tell.  Hence it would not seem
practical to use *.DIF as a data exchange standard.
 
 However, in putting together a little demo, I found a couple of bugs in
 the R implementation of read.DIF, and it looks as though it ignores the
 internal type information.  Sigh.
 
 As of r53778, the bugs I noticed should be fixed.  read.DIF now respects 
 the internal type information, so it will keep character strings like 
 001 as type character (unless you ask it to change the type).


What does ``r53778'' mean?

cheers,

Rolf
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] The behaviour of read.csv().

2010-12-05 Thread David Winsemius


On Dec 5, 2010, at 2:14 PM, Rolf Turner wrote:



On 6/12/2010, at 3:00 AM, Duncan Murdoch wrote:

As of r53778, the bugs I noticed should be fixed.  read.DIF now  
respects

the internal type information, so it will keep character strings like
001 as type character (unless you ask it to change the type).



What does ``r53778'' mean?


I assumed it was a version sequence number:

http://cran.r-project.org/src/base-prerelease/



cheers,

Rolf



David Winsemius, MD
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] The behaviour of read.csv().

2010-12-05 Thread Duncan Murdoch

On 05/12/2010 2:14 PM, Rolf Turner wrote:


On 6/12/2010, at 3:00 AM, Duncan Murdoch wrote:



I was going to suggest using DIF rather than CSV.  It contains more
internal information about the file (including the type of each entry),
but has the disadvantage of being less readable, even though it is ascii.


I don't think DIF is really the answer. My colleagues are familiar
with the *.csv concept; they have never heard of ``DIF''.

As I have said, we have had but few problems using *.csv.  Better the
devil you know ...

Furthermore I have to deal with data provided by various sources 
``external''
to the research project that I work for. I have to use the data that 
these
sources provide, in the format in which they provide it.  If they give 
me
*.csv files I count myself lucky.

Finally, there seems to be no ``write.DIF'' function, i.e. there is no 
way
to produce *.DIF output, as far as I can tell.  Hence it would not seem
practical to use *.DIF as a data exchange standard.


Sure, those are good points.



However, in putting together a little demo, I found a couple of bugs in
the R implementation of read.DIF, and it looks as though it ignores the
internal type information.  Sigh.


As of r53778, the bugs I noticed should be fixed.  read.DIF now respects
the internal type information, so it will keep character strings like
001 as type character (unless you ask it to change the type).



What does ``r53778'' mean?


Revision 53778 from the version control system.  When you start 
R-patched or R-devel it will print this in the startup message, e.g.


R version 2.13.0 Under development (unstable) (2010-12-05 r53775)
  ^^

(from just before I saved the changes).

Duncan Murdoch



cheers,

Rolf


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] The behaviour of read.csv().

2010-12-03 Thread Duncan Murdoch

On 02/12/2010 9:59 PM, Rolf Turner wrote:


On 3/12/2010, at 3:48 PM, David Scott wrote:


  On 03/12/10 14:33, Duncan Murdoch wrote:


SNIP


I think the fill=TRUE option arrived about 10 years ago, in R 1.2.0.
The comment in the NEWS file suggests it was in response to some strange
csv file coming out of Excel.

The real problem with the CSV format is that there really isn't a well
defined standard for it.  The first RFC about it was published in 2005,
and it doesn't claim to be authoritative.  Excel is kind of a standard,
but it does some very weird things.  (For example:  enter the string 01
into a field.  To keep the leading 0, you need to type it as '01.  Save
the file, read it back:  goodbye 0.  At least that's what a website I
was just on says about Excel, and what OpenOffice does.)

I've been burned so many times by storing data in .csv files, that I
just avoid them whenever I can.

Absolutely agree with this Duncan. Playing around with .csv files is
like playing with some sort of unstable explosive. I also avoid them as
much as possible.


Where I work, everybody but me uses (yeuuccchhh!!!) Excel or SPSS.  If
we are to share data sets, *.csv files seem to be the most efficacious,
if not the only, way to go.


I was going to suggest using DIF rather than CSV.  It contains more 
internal information about the file (including the type of each entry), 
but has the disadvantage of being less readable, even though it is ascii.


However, in putting together a little demo, I found a couple of bugs in 
the R implementation of read.DIF, and it looks as though it ignores the 
internal type information.  Sigh.


Duncan Murdoch




So far, we've had very few problems.  The one that started off this thread
is the only one I can think of that related to the *.csv format.

At least *.csv files have the virtue of being ASCII files, whence if things
go wrong it is at least possible to dig into them with a text editor and
figure out just what the problem is.

cheers,

Rolf


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] The behaviour of read.csv().

2010-12-02 Thread Rolf Turner

I have recently been bitten by an aspect of the behaviour of
the read.csv() function.

Some lines in a (fairly large) *.csv file that I read in had
too many entries.  I would have hoped that this would cause
read.csv() to throw an error, or at least issue a warning,
but it read the file without complaint, putting the extra
entries into an additional line.

This behaviour is illustrated by the toy example in the
attached file ``junk.csv''.  Just do

junk - read.csv(junk.csv,header=TRUE)
junk

to see the problem.

If the offending over-long line were in the fourth line of data
or earlier, an error would be thrown, but if it is in the fifth line
of data or later no error is given.

This is in a way compatible with what the help on read.csv()
says:

The number of data columns is determined by looking at
the first five lines of input (or the whole file if it
has less than five lines), or from the length of col.names
if it is specified and is longer.

However, the help for read.table() says the same thing.  And yet if
one does

gorp - read.table(junk.csv,sep=,,header=TRUE)

one gets an error, whereas read.csv() gives none.

Am I correct in saying that is inappropriate behaviour on
the part of read.csv(), or am I missing something?

cheers,

Rolf Turner



P. S.:
 sessionInfo()
R version 2.12.0 (2010-10-15)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_NZ.UTF-8/en_NZ.UTF-8/C/C/en_NZ.UTF-8/en_NZ.UTF-8

attached base packages:
[1] datasets  utils stats graphics  grDevices methods   base 

other attached packages:
[1] misc_0.0-13 gtools_2.6.2spatstat_1.21-2 deldir_0.0-13  
[5] mgcv_1.6-2  fortunes_1.4-0  MASS_7.3-8 

loaded via a namespace (and not attached):
[1] grid_2.12.0lattice_0.19-13Matrix_0.999375-44 nlme_3.1-97   
[5] tools_2.12.0  

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] The behaviour of read.csv().

2010-12-02 Thread Phil Spector

Rolf -
   I'd suggest using

junk - read.csv(junk.csv,header=TRUE,fill=FALSE)

if you don't want the behaviour you're seeing.

- Phil Spector
 Statistical Computing Facility
 Department of Statistics
 UC Berkeley
 spec...@stat.berkeley.edu


On Fri, 3 Dec 2010, Rolf Turner wrote:



I have recently been bitten by an aspect of the behaviour of
the read.csv() function.

Some lines in a (fairly large) *.csv file that I read in had
too many entries.  I would have hoped that this would cause
read.csv() to throw an error, or at least issue a warning,
but it read the file without complaint, putting the extra
entries into an additional line.

This behaviour is illustrated by the toy example in the
attached file ``junk.csv''.  Just do

junk - read.csv(junk.csv,header=TRUE)
junk

to see the problem.

If the offending over-long line were in the fourth line of data
or earlier, an error would be thrown, but if it is in the fifth line
of data or later no error is given.

This is in a way compatible with what the help on read.csv()
says:

The number of data columns is determined by looking at
the first five lines of input (or the whole file if it
has less than five lines), or from the length of col.names
if it is specified and is longer.

However, the help for read.table() says the same thing.  And yet if
one does

gorp - read.table(junk.csv,sep=,,header=TRUE)

one gets an error, whereas read.csv() gives none.

Am I correct in saying that is inappropriate behaviour on
the part of read.csv(), or am I missing something?

cheers,

Rolf Turner




__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] The behaviour of read.csv().

2010-12-02 Thread Rolf Turner

On 3/12/2010, at 1:08 PM, Phil Spector wrote:

 Rolf -
I'd suggest using
 
 junk - read.csv(junk.csv,header=TRUE,fill=FALSE)
 
 if you don't want the behaviour you're seeing.


The point is not that I don't want this kind of behaviour.
The point is that it seems to me to be unexpected and dangerous.

I can indeed take precautions against it, now that I know about it,
by specifying fill=FALSE.  Given that I remember to do so.

Now that you've pointed it out I can see that this is the reason
for the different behaviour between read.table() and read.csv();
in read.table() fill=FALSE is effectively the default.

Having fill=TRUE being the default in read.csv() strikes me as
being counter-intuitive and dangerous.

cheers,

Rolf

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] The behaviour of read.csv().

2010-12-02 Thread Peter Ehlers

On 2010-12-02 16:26, Rolf Turner wrote:


On 3/12/2010, at 1:08 PM, Phil Spector wrote:


Rolf -
I'd suggest using

 junk- read.csv(junk.csv,header=TRUE,fill=FALSE)

if you don't want the behaviour you're seeing.



The point is not that I don't want this kind of behaviour.
The point is that it seems to me to be unexpected and dangerous.

I can indeed take precautions against it, now that I know about it,
by specifying fill=FALSE.  Given that I remember to do so.

Now that you've pointed it out I can see that this is the reason
for the different behaviour between read.table() and read.csv();
in read.table() fill=FALSE is effectively the default.

Having fill=TRUE being the default in read.csv() strikes me as
being counter-intuitive and dangerous.



Rolf,
This is not to argue with your point re counter-intuitive,
but I always run a count.fields() first if I haven't seen
(or can't easily see) the file in my editor. I must have
learned that the hard way a long time ago.

Peter Ehlers


cheers,

Rolf


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] The behaviour of read.csv().

2010-12-02 Thread Duncan Murdoch

On 02/12/2010 8:04 PM, Peter Ehlers wrote:

On 2010-12-02 16:26, Rolf Turner wrote:


On 3/12/2010, at 1:08 PM, Phil Spector wrote:


Rolf -
 I'd suggest using

  junk- read.csv(junk.csv,header=TRUE,fill=FALSE)

if you don't want the behaviour you're seeing.



The point is not that I don't want this kind of behaviour.
The point is that it seems to me to be unexpected and dangerous.

I can indeed take precautions against it, now that I know about it,
by specifying fill=FALSE.  Given that I remember to do so.

Now that you've pointed it out I can see that this is the reason
for the different behaviour between read.table() and read.csv();
in read.table() fill=FALSE is effectively the default.

Having fill=TRUE being the default in read.csv() strikes me as
being counter-intuitive and dangerous.



Rolf,
This is not to argue with your point re counter-intuitive,
but I always run a count.fields() first if I haven't seen
(or can't easily see) the file in my editor. I must have
learned that the hard way a long time ago.


I think the fill=TRUE option arrived about 10 years ago, in R 1.2.0. 
The comment in the NEWS file suggests it was in response to some strange 
csv file coming out of Excel.


The real problem with the CSV format is that there really isn't a well 
defined standard for it.  The first RFC about it was published in 2005, 
and it doesn't claim to be authoritative.  Excel is kind of a standard, 
but it does some very weird things.  (For example:  enter the string 01 
into a field.  To keep the leading 0, you need to type it as '01.  Save 
the file, read it back:  goodbye 0.  At least that's what a website I 
was just on says about Excel, and what OpenOffice does.)


I've been burned so many times by storing data in .csv files, that I 
just avoid them whenever I can.


Duncan Murdoch

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] The behaviour of read.csv().

2010-12-02 Thread Rolf Turner

On 3/12/2010, at 2:04 PM, Peter Ehlers wrote:

SNIP

 Rolf,
 This is not to argue with your point re counter-intuitive,
 but I always run a count.fields() first if I haven't seen
 (or can't easily see) the file in my editor. I must have
 learned that the hard way a long time ago.


Sound advice!  Thanks.  I'd just like to point out however
that it might be an idea to set quote=\ in the call to
count.fields() --- to make its idea of how many fields there
are consistent with that of read.csv().  In count.fields()
quote defaults to \' whereas in read.csv() it defaults
to \.

cheers,

Rolf

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] The behaviour of read.csv().

2010-12-02 Thread David Winsemius


On Dec 2, 2010, at 8:33 PM, Duncan Murdoch wrote:

snipped


I think the fill=TRUE option arrived about 10 years ago, in R 1.2.0.  
The comment in the NEWS file suggests it was in response to some  
strange csv file coming out of Excel.


The real problem with the CSV format is that there really isn't a  
well defined standard for it.  The first RFC about it was published  
in 2005, and it doesn't claim to be authoritative.  Excel is kind of  
a standard, but it does some very weird things.  (For example:   
enter the string 01 into a field.  To keep the leading 0, you need  
to type it as '01.  Save the file, read it back:  goodbye 0.  At  
least that's what a website I was just on says about Excel, and what  
OpenOffice does.)


In both Excel and in OO,org you can select a column (or any other  
range) and set its format to text. (The default is numeric, not that  
different that read.table()'s default behavior.) Once a format has  
been set, you then do not need leading quotes. I just created a small  
example with OO.org Calc entered leading 0 without leading quotes  
and this code runs as desired after copying the three cells to the  
clipboard:


 read.table(pipe(pbpaste), colClasses=character)
V1
1   01
2  004
3 0005

The same applies to date field in both OO.org and Excel. In this  
regard, it is simply a matter of understanding what is the defined  
behavior of your software and how one can manipulate it. This is no  
different than learning R's classes, coercing them to your ends, and  
dealing with other formatting issues.




I've been burned so many times by storing data in .csv files, that I  
just avoid them whenever I can.


No argument there. I know one physician whose weapon of choice is  
Stata who always uses | as his separator, but that's perhaps because  
he works entirely in Windows. I imagine that might not be the most  
uncommon character in *NIXen.


--

David Winsemius, MD
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] The behaviour of read.csv().

2010-12-02 Thread Duncan Murdoch

On 02/12/2010 9:18 PM, David Winsemius wrote:


On Dec 2, 2010, at 8:33 PM, Duncan Murdoch wrote:

snipped


I think the fill=TRUE option arrived about 10 years ago, in R 1.2.0.
The comment in the NEWS file suggests it was in response to some
strange csv file coming out of Excel.

The real problem with the CSV format is that there really isn't a
well defined standard for it.  The first RFC about it was published
in 2005, and it doesn't claim to be authoritative.  Excel is kind of
a standard, but it does some very weird things.  (For example:
enter the string 01 into a field.  To keep the leading 0, you need
to type it as '01.  Save the file, read it back:  goodbye 0.  At
least that's what a website I was just on says about Excel, and what
OpenOffice does.)


In both Excel and in OO,org you can select a column (or any other
range) and set its format to text. (The default is numeric, not that
different that read.table()'s default behavior.) Once a format has
been set, you then do not need leading quotes. I just created a small
example with OO.org Calc entered leading 0 without leading quotes
and this code runs as desired after copying the three cells to the
clipboard:

read.table(pipe(pbpaste), colClasses=character)
  V1
1   01
2  004
3 0005

The same applies to date field in both OO.org and Excel. In this
regard, it is simply a matter of understanding what is the defined
behavior of your software and how one can manipulate it. This is no
different than learning R's classes, coercing them to your ends, and
dealing with other formatting issues.


You're right, I shouldn't have picked on Excel particularly here, but it 
really is a bizarre format that says the default way to read a file 
containing


V1
01
004
0005

is to assume that the column contains numeric values.  (Yes, read.csv() 
makes this same assumption.)  My main complaint is with the format.


Duncan Murdoch






I've been burned so many times by storing data in .csv files, that I
just avoid them whenever I can.


No argument there. I know one physician whose weapon of choice is
Stata who always uses | as his separator, but that's perhaps because
he works entirely in Windows. I imagine that might not be the most
uncommon character in *NIXen.

--

David Winsemius, MD
West Hartford, CT



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] The behaviour of read.csv().

2010-12-02 Thread David Scott

 On 03/12/10 14:33, Duncan Murdoch wrote:

On 02/12/2010 8:04 PM, Peter Ehlers wrote:

On 2010-12-02 16:26, Rolf Turner wrote:

On 3/12/2010, at 1:08 PM, Phil Spector wrote:


Rolf -
  I'd suggest using

   junk- read.csv(junk.csv,header=TRUE,fill=FALSE)

if you don't want the behaviour you're seeing.


The point is not that I don't want this kind of behaviour.
The point is that it seems to me to be unexpected and dangerous.

I can indeed take precautions against it, now that I know about it,
by specifying fill=FALSE.  Given that I remember to do so.

Now that you've pointed it out I can see that this is the reason
for the different behaviour between read.table() and read.csv();
in read.table() fill=FALSE is effectively the default.

Having fill=TRUE being the default in read.csv() strikes me as
being counter-intuitive and dangerous.


Rolf,
This is not to argue with your point re counter-intuitive,
but I always run a count.fields() first if I haven't seen
(or can't easily see) the file in my editor. I must have
learned that the hard way a long time ago.

I think the fill=TRUE option arrived about 10 years ago, in R 1.2.0.
The comment in the NEWS file suggests it was in response to some strange
csv file coming out of Excel.

The real problem with the CSV format is that there really isn't a well
defined standard for it.  The first RFC about it was published in 2005,
and it doesn't claim to be authoritative.  Excel is kind of a standard,
but it does some very weird things.  (For example:  enter the string 01
into a field.  To keep the leading 0, you need to type it as '01.  Save
the file, read it back:  goodbye 0.  At least that's what a website I
was just on says about Excel, and what OpenOffice does.)

I've been burned so many times by storing data in .csv files, that I
just avoid them whenever I can.
Absolutely agree with this Duncan. Playing around with .csv files is 
like playing with some sort of unstable explosive. I also avoid them as 
much as possible.


David Scott



Duncan Murdoch

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



--
_
David Scott Department of Statistics
The University of Auckland, PB 92019
Auckland 1142,NEW ZEALAND
Phone: +64 9 923 5055, or +64 9 373 7599 ext 85055
Email:  d.sc...@auckland.ac.nz,  Fax: +64 9 373 7018

Director of Consulting, Department of Statistics

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] The behaviour of read.csv().

2010-12-02 Thread Rolf Turner

On 3/12/2010, at 3:48 PM, David Scott wrote:

  On 03/12/10 14:33, Duncan Murdoch wrote:

SNIP

 I think the fill=TRUE option arrived about 10 years ago, in R 1.2.0.
 The comment in the NEWS file suggests it was in response to some strange
 csv file coming out of Excel.
 
 The real problem with the CSV format is that there really isn't a well
 defined standard for it.  The first RFC about it was published in 2005,
 and it doesn't claim to be authoritative.  Excel is kind of a standard,
 but it does some very weird things.  (For example:  enter the string 01
 into a field.  To keep the leading 0, you need to type it as '01.  Save
 the file, read it back:  goodbye 0.  At least that's what a website I
 was just on says about Excel, and what OpenOffice does.)
 
 I've been burned so many times by storing data in .csv files, that I
 just avoid them whenever I can.
 Absolutely agree with this Duncan. Playing around with .csv files is 
 like playing with some sort of unstable explosive. I also avoid them as 
 much as possible.

Where I work, everybody but me uses (yeuuccchhh!!!) Excel or SPSS.  If
we are to share data sets, *.csv files seem to be the most efficacious,
if not the only, way to go.

So far, we've had very few problems.  The one that started off this thread
is the only one I can think of that related to the *.csv format.

At least *.csv files have the virtue of being ASCII files, whence if things
go wrong it is at least possible to dig into them with a text editor and
figure out just what the problem is.

cheers,

Rolf

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] The behaviour of read.csv().

2010-12-02 Thread David Winsemius


On Dec 2, 2010, at 9:33 PM, Duncan Murdoch wrote:


On 02/12/2010 9:18 PM, David Winsemius wrote:


On Dec 2, 2010, at 8:33 PM, Duncan Murdoch wrote:

snipped


I think the fill=TRUE option arrived about 10 years ago, in R 1.2.0.
The comment in the NEWS file suggests it was in response to some
strange csv file coming out of Excel.

The real problem with the CSV format is that there really isn't a
well defined standard for it.  The first RFC about it was published
in 2005, and it doesn't claim to be authoritative.  Excel is kind of
a standard, but it does some very weird things.  (For example:
enter the string 01 into a field.  To keep the leading 0, you need
to type it as '01.  Save the file, read it back:  goodbye 0.  At
least that's what a website I was just on says about Excel, and what
OpenOffice does.)


In both Excel and in OO,org you can select a column (or any other
range) and set its format to text. (The default is numeric, not that
different that read.table()'s default behavior.) Once a format has
been set, you then do not need leading quotes. I just created a small
example with OO.org Calc entered leading 0 without leading quotes
and this code runs as desired after copying the three cells to the
clipboard:

   read.table(pipe(pbpaste), colClasses=character)
 V1
1   01
2  004
3 0005

The same applies to date field in both OO.org and Excel. In this
regard, it is simply a matter of understanding what is the defined
behavior of your software and how one can manipulate it. This is no
different than learning R's classes, coercing them to your ends, and
dealing with other formatting issues.


You're right, I shouldn't have picked on Excel particularly here,  
but it really is a bizarre format that says the default way to read  
a file containing


V1  # minor quibble. The V1 was added by read.table()
01
004
0005

is to assume that the column contains numeric values.


I'm a bit puzzled. Or maybe not. If you are criticizing the default  
behavior of R's read.table then I do understand (but have been taught  
by my reading of the FM that numeric happens iff all first n _are_  
coercible to numeric without NA generation is what one should  
expect). Excel is offering text exactly in the instances it has been  
told that the cell format is text.



 (Yes, read.csv() makes this same assumption.)  My main complaint is  
with the format.


Meaning the defaults chosen for read.csv()?

--
David.




Duncan Murdoch






I've been burned so many times by storing data in .csv files, that I
just avoid them whenever I can.


No argument there. I know one physician whose weapon of choice is
Stata who always uses | as his separator, but that's perhaps  
because

he works entirely in Windows. I imagine that might not be the most
uncommon character in *NIXen.

--

David Winsemius, MD
West Hartford, CT





David Winsemius, MD
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.