Hi Duncan,
Please allow me to add a bit more context, which I probably should have added to my original message. We actually did see this in an R 3.1 beta which was pulled by an apt-get and thought it had been released accidentally. From my user perspective, the parsing of a string like 1.2345678901234567890 into a factor was so surprising, I actually assumed it was just a really bad bug that would be fixed before the real" release. I didnt bother reporting it since I assumed beta users would be heavily impacted and there is no way it wouldnt be fixed. Apologies for that mistake on my part. After discovering this new behavior really got released GA, I went searching to see what was going on. I found this bug, which states If you wish to express your opinion about the new behavior, please do so on the R-devel mailing list." https://bugs.r-project.org/bugzilla/show_bug.cgi?id=15751 So Im sharing my opinion, as suggested. Thanks to all for the time spent reading my opinion. Let me also say, we are huge fans of R; many of our customers use R, and we greatly appreciate the efforts of the R core team. We are in the process of contributing an H2O package back to the R community and thanks to the CRAN moderators, as well, for their assistance in this process. CRAN is a fantastic resource. I would like to share a little more insight on how this behavior affects us, in particular. These merits have probably already been debated, but let me state them here again to provide the appropriate context. 1. When dealing with larger and larger data, things become cumbersome. Your comment that specifying column types would work is true. But when there are thousands+ of columns, specifying them one by one becomes more and more of a burden, and it becomes easier to make a mistake. And when you do make a mistake, you can imagine a tool writer choosing to just do what its told and swallowing the mistake. (Trying not to be smarter than the user.) 2. When working with datasets that have more and more rows, sometimes there is a bad row. Big data is messy. Having one bad value in one bad row contaminate the entire dataset can be undesirable for some. When you have millions of rows or more, each row becomes less precious. Many people would rather just ignore the effects of the bad row than try to fix it. Especially in this case, when bad means a bit of extra precision that likely wont have a negative impact on the result. (In our case, this extra precision was the output of Javas Double.toString().) Our users want to use R as a driver language and a reference tool. Being able to interchange data easily (even just snippets) between tools is very valuable. Thanks, Tom Below is an example of how you can create a million row dataset which works fine (parses as a numeric), but then adding just one bad row (which still *looks* numeric!) flips the entire column to a factor. Finding that one row out of a million+ can be quite a challenge. # Script to generate dataset. $ cat genDataset.py #!/usr/bin/env python for x in range(0, 1000000): print (str(x) + ".1") # Generate the dataset. $ ./genDataset.py > million.csv # R 3.1 thinks its a numeric. $ R > df = read.csv("million.csv") > str(df) 'data.frame': 999999 obs. of 1 variable: $ X0.1: num 1.1 2.1 3.1 4.1 5.1 6.1 7.1 8.1 9.1 10.1 ... # Add one more over-precision row. $ echo "1.2345678901234567890" >> million.csv # Now R 3.1 thinks its a factor. $ R > df2 = read.csv("million.csv") > str(df2) 'data.frame': 1000000 obs. of 1 variable: $ X0.1: Factor w/ 1000000 levels "1.1","1.2345678901234567890",..: 1 111113 222224 333335 444446 555557 666668 777779 888890 3 ... On Apr 26, 2014, at 4:28 AM, Duncan Murdoch <murdoch.dun...@gmail.com> wrote: > On 26/04/2014, 12:23 AM, Tom Kraljevic wrote: >> >> Hi, >> >> We at 0xdata use Java and R together, and the new behavior for read.csv has >> made R unable to read the output of Javas Double.toString(). > > It may be less convenient, but it's certainly not "unable". Use colClasses. > > >> >> This, needless to say, is disruptive for us. (Actually, it was downright >> shocking.) > > It wouldn't have been a shock if you had tested pre-release versions. > Commercial users of R should be contributing to its development, and that's a > really easy way to do so. > > Duncan Murdoch > >> >> +1 for restoring old behavior. > > > [[alternative HTML version deleted]]
______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel