-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 02/11/2011 03:37 PM, Laurent Gatto wrote: > On 11 February 2011 19:39, Ben Bolker <bbol...@gmail.com> wrote: >> > [snip] >> >> What is dangerous/confusing is that R silently **wraps** longer lines if >> fill=TRUE (which is the default for read.csv). I encountered this when >> working with a colleague on a long, messy CSV file that had some phantom >> extra fields in some rows, which then turned into empty lines in the >> data frame. >> > > As a matter of fact, this is exactly what happened to a colleague of > mine yesterday and caused her quite a bit of trouble. On the other > hand, it could also be considered as a 'bug' in the csv file. Although > no formal specification exist for the csv format, RFC 4180 [1] > indicates that 'each line should contain the same number of fields > throughout the file'. > > [1] http://tools.ietf.org/html/rfc4180 > > Best wishes, > > Laurent
Asserting that the bug is in the CSV file is logically consistent, but if this is true then the "fill=TRUE" argument (which is only needed when the lines contain different numbers of fields) should not be allowed. I had never seen RFC4180 before -- interesting! I note especially points 5-7 which define the handling of double quotation marks (but says nothing about single quotes or using backslashes as escape characters). Dealing with read.[table|csv] seems a bit of an Augean task <http://en.wikipedia.org/wiki/Augeas> (hmmm, maybe I should write a parallel document to Burns's _Inferno_ ...) cheers Ben > >> Here is an example and a workaround that runs count.fields on the >> whole file to find the maximum column length and set col.names >> accordingly. (It assumes you don't already have a file named "test.csv" >> in your working directory ...) >> >> I haven't dug in to try to write a patch for this -- I wanted to test >> the waters and see what people thought first, and I realize that >> read.table() is a very complicated piece of code that embodies a lot of >> tradeoffs, so there could be lots of different approaches to trying to >> mitigate this problem. I appreciate very much how hard it is to write a >> robust and general function to read data files, but I also think it's >> really important to minimize the number of traps in read.table(), which >> will often be the first part of R that new users encounter ... >> >> A quick fix for this might be to allow the number of lines analyzed >> for length to be settable by the user, or to allow a settable 'maxcols' >> parameter, although those would only help in the case where the user >> already knows there is a problem. >> >> cheers >> Ben Bolker >> >> =============== >> writeLines(c("A,B,C,D", >> "1,a,b,c", >> "2,f,g,c", >> "3,a,i,j", >> "4,a,b,c", >> "5,d,e,f", >> "6,g,h,i,j,k,l,m,n"), >> con=file("test.csv")) >> >> >> read.csv("test.csv") >> try(read.csv("test.csv",fill=FALSE)) >> >> ## assumes header=TRUE, fill=TRUE; should be a little more careful >> ## with comment, quote arguments (possibly explicit) >> ## ... contains information about quote, comment.char, sep >> Read.csv <- function(fn,sep=",",...) { >> colnames <- scan(fn,nlines=1,what="character",sep=sep,...) >> ncolnames <- length(colnames) >> maxcols <- max(count.fields(fn,sep=sep,...)) >> if (maxcols>ncolnames) { >> colnames <- c(colnames,paste("V",(ncolnames+1):maxcols,sep="")) >> } >> ## assumes you don't have any other columns labeled "V[large number]" >> read.csv(fn,...,col.names=colnames) >> } >> >> Read.csv("test.csv") >> >> ______________________________________________ >> R-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >> > > > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk1VsX4ACgkQc5UpGjwzenPwsgCfTtGo0kJSXhUTPcY+p7cgaiuq zHAAnikRORUhqLP9O+6M5SwyZcFEW9uT =Rb2R -----END PGP SIGNATURE----- ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel