Hi Duncan,

Please allow me to add a bit more context, which I probably should have added 
to my original message.

We actually did see this in an R 3.1 beta which was pulled by an apt-get and 
thought it had been released
accidentally.  From my user perspective, the parsing of a string like 
“1.2345678901234567890” into a
factor was so surprising, I actually assumed it was just a really bad bug that 
would be fixed before the
“real" release.  I didn’t bother reporting it since I assumed beta users would 
be heavily impacted and
there is no way it wouldn’t be fixed.  Apologies for that mistake on my part.

After discovering this new behavior really got released GA, I went searching to 
see what was going on.
I found this bug, which states “If you wish to express your opinion about the 
new behavior, please do so
on the R-devel mailing list."

    https://bugs.r-project.org/bugzilla/show_bug.cgi?id=15751

So I’m sharing my opinion, as suggested.  Thanks to all for the time spent 
reading my opinion.


Let me also say, we are huge fans of R; many of our customers use R, and we 
greatly appreciate the
efforts of the R core team.  We are in the process of contributing an H2O 
package back to the R
community and thanks to the CRAN moderators, as well, for their assistance in 
this process.
CRAN is a fantastic resource.


I would like to share a little more insight on how this behavior affects us, in 
particular.  These merits
have probably already been debated, but let me state them here again to provide 
the appropriate
context.

1.  When dealing with larger and larger data, things become cumbersome.  Your 
comment that 
specifying column types would work is true.  But when there are thousands+ of 
columns, specifying
them one by one becomes more and more of a burden, and it becomes easier to 
make a mistake.
And when you do make a mistake, you can imagine a tool writer choosing to just 
“do what it’s told”
and swallowing the mistake.  (Trying not to be smarter than the user.)

2.  When working with datasets that have more and more rows, sometimes there is 
a bad row.  
Big data is messy.  Having one bad value in one bad row contaminate the entire 
dataset can be
undesirable for some.  When you have millions of rows or more, each row becomes 
less precious.
Many people would rather just ignore the effects of the bad row than try to fix 
it.  Especially in this
case, when “bad” means a bit of extra precision that likely won’t have a 
negative impact on the result.
(In our case, this extra precision was the output of Java’s Double.toString().)

Our users want to use R as a driver language and a reference tool.  Being able 
to interchange
data easily (even just snippets) between tools is very valuable.


Thanks,
Tom


Below is an example of how you can create a million row dataset which works 
fine (parses as a
numeric), but then adding just one bad row (which still *looks* numeric!) flips 
the entire column to
a factor.  Finding that one row out of a million+ can be quite a challenge.


# Script to generate dataset.
$ cat genDataset.py 
#!/usr/bin/env python

for x in range(0, 1000000):
    print (str(x) + ".1")

# Generate the dataset.
$ ./genDataset.py > million.csv

# R 3.1 thinks it’s a numeric.
$ R
> df = read.csv("million.csv")
> str(df)
'data.frame':   999999 obs. of  1 variable:
 $ X0.1: num  1.1 2.1 3.1 4.1 5.1 6.1 7.1 8.1 9.1 10.1 ...

# Add one more over-precision row.
$ echo "1.2345678901234567890" >> million.csv 

# Now R 3.1 thinks it’s a factor.
$ R
> df2 = read.csv("million.csv")
> str(df2)
'data.frame':   1000000 obs. of  1 variable:
 $ X0.1: Factor w/ 1000000 levels "1.1","1.2345678901234567890",..: 1 111113 
222224 333335 444446 555557 666668 777779 888890 3 ...





On Apr 26, 2014, at 4:28 AM, Duncan Murdoch <murdoch.dun...@gmail.com> wrote:

> On 26/04/2014, 12:23 AM, Tom Kraljevic wrote:
>> 
>> Hi,
>> 
>> We at 0xdata use Java and R together, and the new behavior for read.csv has
>> made R unable to read the output of Java’s Double.toString().
> 
> It may be less convenient, but it's certainly not "unable".  Use colClasses.
> 
> 
>> 
>> This, needless to say, is disruptive for us.  (Actually, it was downright 
>> shocking.)
> 
> It wouldn't have been a shock if you had tested pre-release versions. 
> Commercial users of R should be contributing to its development, and that's a 
> really easy way to do so.
> 
> Duncan Murdoch
> 
>> 
>> +1 for restoring old behavior.
> 
> 
> 


        [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to