On Sat, 15 Mar 2014, peter dalgaard wrote:

On 15 Mar 2014, at 20:54 , Mike Miller <mbmille...@gmail.com> wrote:

$ cat data1.txt
0.005
0.00499999999999989

I don't know why it shows 17 digits and doesn't round to 15, but it is showing 
that the numbers are different, for some reason.


Aiding my weakening eyesight a little:

0.004 999 999 999 999 89

Notice that that makes 15 _significant_ digits.

OK, now I feel really stupid. Of course it's 15 mantissa digits, not 15 %f digits, or whatever that should be called. Sorry about that.


Do you understand why there is a difference between 1-0.995 and 2-1.995 in their internal representations?

Let's see,  that'll be like

1 - 2/3 vs. 10 - 29/3

on a decimal computer if someone is perverse enough to give input in base 3 (i.e., 1.0 - 0.2 ternary vs. 101.0 - 100.2 ternary). Assume that the computer is floating point with 3 significant digits (and possibly taking some liberties compared to what real computers really do), we have

  1 = 1.000 * 10^0
 10 = 1.000 * 10^1
2/3 = 0.667 * 10^0
29/3 = 0.967 * 10^1

1 - 2/3  = 0.333 * 10^0
10 - 29/3 = 0.033 * 10^1 = 0.330 * 10^0

So, yes, I think I do understand how these things can happen.

Yes, and that's a nice explanation, but you had me at "_significant_". I don't know why I didn't get that in the first place. So the difference in my example is that 0.995 is 9.950e-1 so that the 5 is the third significant digit and in 1.995, the 5 is the fourth significant digit, so 1-0.995 provides a more precise representation of 0.005 than does 2-1.995.

I always knew there was some numerical reason why I was getting very long stretches of 9s or 0s in the write.table() output, but my concern is really with how to prevent that from happening. So the question still is, how do I avoid getting 0.00499999999999989 in my output file when I want 0.005? I'm sure I'm not alone in this. It looks like the standard answer is to use format(). For example, I could do this:

write.table(format(data, digits=13, trim=T), file="data.txt", row.names=F, 
col.names=F, quote=F)

That does fix the long numbers -- all of them are reduced to three digits. The one thing that concerns me is that when format() is called, isn't it making an object that could take up a lot of memory if the data frame is large? The data frame created by format() might use a lot more memory than the original data frame if it is converting a lot of doubles (8 bytes) to a lot of possibly 16-byte strings. For example, -10/81 takes up 8 bytes as a double, but converted by format with digits=13 it uses 16 bytes to include the sign, the zero and the decimal point (plus a delimiter when there are many per line of output):

write.table(format(-10/81, digits=13), row.names=F, col.names=F, quote=F)
-0.1234567901235

I'm assuming that write.table() is streaming the data into a file (or stdout) and not creating a complete representation of the output in memory before it does that. It looks like format() creates a data frame where all variables are converted to character type. Thus, it wouldn't be just for convenience that one might want digits=N to be an option in the write.table() function. It would be very useful with large data frames, making it possible to write out things that would be too large to handle using format(). When files are already super-large, we really want to avoid expanding the number of digits per value in the output.

Mike

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to