I suspect there are plenty of data.table users that use UPCs and other large integer-like doubles as identifiers in their data. Storing UPCs as character data takes up an order of magnitude more space compared to a double; not really an acceptable alternative for a 1.5 billion row table, i.e. 10 GiB of RAM just for UPCs as doubles (*crosses fingers for long vector support*).

However, the newest data.table breaks that (see example below). The developers are aware of this, but I guess speed for imprecise numbers is a higher priority than proper results for people using data with large IDs.

In any case, I thought people should be more aware of this, and maybe someone would have a suggested workaround. I'm currently stuck at SVN r1129 because I was hitting some crashing bugs in 1.8.10.

For the interested, you can track the feature request at:
https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5369&group_id=240&atid=978

The relevant NEWS item:
Numeric data is still joined and grouped within tolerance as before but instead 
of tolerance
      being sqrt(.Machine$double.eps) == 1.490116e-08 (the same as 
base::all.equal's default) the
      the significand is now rounded to the last 2 bytes, apx 11 s.f. This is 
more appropriate
      for large (1.23e20) and small (1.23e-20) numerics and is faster via a 
simple bit twiddle.
      A few functions provided a 'tolerance' argument but this wasn't being 
passed through so has
      been removed. We aim to add a global option (e.g. 2, 1 or 0 byte 
rounding) in a future release.


library(data.table)
DT <- data.table(upc = c(301426027592, 301426027593, 314775802939,
                         314775802940, 314775803490, 314775803491,
                         314775815510, 314775815511, 314933000171,
                         314933000172), d=rnorm(10), key='upc')

DT[, list(length=length(d)), keyby=upc]

Output with 1.9.2 is:
> DT[, list(length=length(d)), keyby=upc]
            upc length
1: 301426027592      2
2: 314775802939      2
3: 314775803490      2
4: 314775815510      2
5: 314933000171      2

Instead of:
> DT[, list(length=length(d)), keyby=upc]
             upc length
 1: 301426027592      1
 2: 301426027593      1
 3: 314775802939      1
 4: 314775802940      1
 5: 314775803490      1
 6: 314775803491      1
 7: 314775815510      1
 8: 314775815511      1
 9: 314933000171      1
10: 314933000172      1

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Reply via email to