On Thu, 1 Jan 2015, Duncan Murdoch wrote:

On 01/01/2015 1:21 PM, Mike Miller wrote:
On Thu, 1 Jan 2015, Duncan Murdoch wrote:

On 31/12/2014 8:44 PM, David Winsemius wrote:

On Dec 31, 2014, at 3:24 PM, Mike Miller wrote:

This is probably a FAQ, and I don't really have a question about it, but I just 
ran across this in something I was working on:

as.integer(1000*1.003)
[1] 1002

I didn't expect it, but maybe I should have.  I guess it's about the machine 
precision added to the fact that as.integer always rounds down:


as.integer(1000*1.003 + 255 * .Machine$double.eps)
[1] 1002

as.integer(1000*1.003 + 256 * .Machine$double.eps)
[1] 1003


This does it right...

as.integer( round( 1000*1.003 ) )
[1] 1003

...but this seems to always give the same answer and it is a little faster in 
my application:

as.integer( 1000*1.003 + .1 )
[1] 1003


FYI - I'm reading in a long vector of numbers from a text file with no more 
than three digits to the right of the decimal.  I'm converting them to integers 
and saving them in binary format.


So just add 0.0001 or even .0000001 to all of them and coerce to integer.

I don't think the original problem was stated clearly, so I'm not sure
whether this is a solution, but it looks wrong to me.  If you want to
round to the nearest integer, why not use round() (without the
as.integer afterwards)?  Or if you really do want an integer, why add
0.1 or 0.0001, why not add 0.5 before calling as.integer()?  This is the
classical way to implement round().

To state the problem clearly, I'd like to know what result is expected
for any real number x.  Since R's numeric type only approximates the
real numbers we might not be able to get a perfect match, but at least
we could quantify how close we get.  Or is the input really character
data?  The original post mentioned reading numbers from a text file.


Maybe you'd like to know what I'm really doing.  I have 1600 text files
each with up to 16,000 lines with 3100 numbers per line, delimited by a
single space.  The numbers are between 0 and 2, inclusive, and they have
up to three digits to the right of the decimal.  Every possible value in
that range will occur in the data.  Some examples numbers: 0 1 2 0.325
1.12 1.9.  I want to multiply by 1000 and store them as 16-bit integers
(uint16).

I've been reading in the data like so:

data <- scan( file=FILE, what=double(), nmax=3100*16000)

At first I tried making the integers like so:

ptm <- proc.time() ; ints <- as.integer( 1000 * data ) ; proc.time()-ptm
    user  system elapsed
   0.187   0.387   0.574

I decided I should compare with the result I got using round():

ptm <- proc.time() ; ints2 <- as.integer( round( 1000 * data ) ) ; 
proc.time()-ptm
    user  system elapsed
   1.595   0.757   2.352

It is a curious fact that only a few of the values from 0 to 2000 disagree
between the two methods:

table( ints2[ ints2 != ints ] )

  1001  1003  1005  1007  1009  1011  1013  1015  1017  1019  1021  1023
35651 27020 15993 11505  8967  7549  6885  6064  5512  4828  4533  4112

I understand that it's all about the problem of representing digital
numbers in binary, but I still find some of the results a little
surprising, like that list of numbers from the table() output.  For
another example:

1000+3 - 1000*(1+3/1000)
[1] 1.136868e-13

3 - 1000*(0+3/1000)
[1] 0

2000+3 - 1000*(2+3/1000)
[1] 0

See what I mean?  So there is something special about the numbers around
1000.

I think it's really that there is something special about the numbers
near 1, and you're multiplying that by 1000.

Numbers from 1 to just below 2 are stored as their fractional part, with
52 bit precision.  Some intermediate calculations will store them with
64 bit precision.  52 bits gives about 15 or 16 decimal places.

If your number x is close to 3/1000, it is stored as the fractional part
of 2^9 * x.  This gives it an extra 2 or 3 decimal digits of precision,
so that's why these values are accurate.

If your number x is close to 2.003, it is stored as the fractional part
of x/2, i.e. with errors like 1.0015 would have.  So I would have
guessed that 2.006 would have the same problems as 1.003, but I thought
you didn't see that.  So I tried it myself, and I do see that:

1000+3 - 1000*(1+3/1000)
[1] 1.136868e-13
2000+6 - 1000*(2+6/1000)
[1] 2.273737e-13

Reading more closely, I see that you didn't test this particular case,
so there's no contradiction here.

The one thing I couldn't think of an explanation for is why other
numbers between 1 and 2 don't have the same sorts of problems.  So I
tried the following:

# Set data to 1.000 thru 1.999
data <- 1 + 0:999/1000

# Find the errors
errors <- 1000 + 0:999 - 1000*data

# Plot them
plot(data, errors)

The plot doesn't show a uniform distribution, but much more uniform than yours: so I think your data doesn't really cover all possible values from 0.000 to 1.999. (I get a similar plot if I look at cases where ints != ints2 with my data.)

No, my data do cover all of the values in the range (I checked that by listing all of them in a file:

tr ' ' '\n' < FILE | uniq | sort -n | uniq >| uniq.txt

They are definitely all there -- all 2001 of them. The thing is, your way of making the numbers is a little different from reading them from a file. You construct the number through an arithmetic operation and you get the error:

as.integer( 1000 * (1+118/1000) )
[1] 1117

(Your first error-producing value was 1.118.) But when the number is read froma file, it starts as character and is translated to numeric. So I start with the arithmetic, convert to character, back to numeric, and then I do not see the error:

as.integer( 1000 * as.numeric( as.character( 1+118/1000 ) ) )
[1] 1118

That reflects what was happening in my work. To see what I'm seeing, you have to do this:

data <- 1 + 0:999/1000
errors <- 1000 + 0:999 - 1000 * as.numeric( as.character( data ) )
plot(data, errors)

Or you could cover he full range from 0 to 2:

data <- 0:2000/1000
errors <- 0:2000 - 1000 * as.numeric( as.character( data ) )
plot(data, errors)

Mike

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to