Bump. It's been a week since I posted this to r-devel. Any thoughts/discussion? Would R-core be irritated if I submitted a bug report?
cheers Ben -------- Original Message -------- Subject: read.csv trap Date: Fri, 04 Feb 2011 11:16:36 -0500 From: Ben Bolker <bbol...@gmail.com> To: r-de...@stat.math.ethz.ch <r-de...@stat.math.ethz.ch>, David Earn <e...@math.mcmaster.ca> This is not specifically a bug, but an (implicitly/obscurely) documented behavior of read.csv (or read.table with fill=TRUE) that can be quite dangerous/confusing for users. I would love to hear some discussion from other users and/or R-core about this ... As always, I apologize if I have missed some obvious workaround or reason that this is actually the desired behavior ... In a nutshell, when fill=TRUE R guesses the number of columns from the first 5 rows of the data set. That's fine, and ?read.table documents this: The number of data columns is determined by looking at the first five lines of input (or the whole file if it has less than five lines), or from the length of ‘col.names’ if it is specified and is longer. This could conceivably be wrong if ‘fill’ or ‘blank.lines.skip’ are true, so specify ‘col.names’ if necessary. What is dangerous/confusing is that R silently **wraps** longer lines if fill=TRUE (which is the default for read.csv). I encountered this when working with a colleague on a long, messy CSV file that had some phantom extra fields in some rows, which then turned into empty lines in the data frame. Here is an example and a workaround that runs count.fields on the whole file to find the maximum column length and set col.names accordingly. (It assumes you don't already have a file named "test.csv" in your working directory ...) I haven't dug in to try to write a patch for this -- I wanted to test the waters and see what people thought first, and I realize that read.table() is a very complicated piece of code that embodies a lot of tradeoffs, so there could be lots of different approaches to trying to mitigate this problem. I appreciate very much how hard it is to write a robust and general function to read data files, but I also think it's really important to minimize the number of traps in read.table(), which will often be the first part of R that new users encounter ... A quick fix for this might be to allow the number of lines analyzed for length to be settable by the user, or to allow a settable 'maxcols' parameter, although those would only help in the case where the user already knows there is a problem. cheers Ben Bolker =============== writeLines(c("A,B,C,D", "1,a,b,c", "2,f,g,c", "3,a,i,j", "4,a,b,c", "5,d,e,f", "6,g,h,i,j,k,l,m,n"), con=file("test.csv")) read.csv("test.csv") try(read.csv("test.csv",fill=FALSE)) ## assumes header=TRUE, fill=TRUE; should be a little more careful ## with comment, quote arguments (possibly explicit) ## ... contains information about quote, comment.char, sep Read.csv <- function(fn,sep=",",...) { colnames <- scan(fn,nlines=1,what="character",sep=sep,...) ncolnames <- length(colnames) maxcols <- max(count.fields(fn,sep=sep,...)) if (maxcols>ncolnames) { colnames <- c(colnames,paste("V",(ncolnames+1):maxcols,sep="")) } ## assumes you don't have any other columns labeled "V[large number]" read.csv(fn,...,col.names=colnames) } Read.csv("test.csv") ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel