On 04/03/14 23:49, James Sams wrote:
I suspect there are plenty of data.table users that use UPCs and other large integer-like doubles as identifiers in their data. Storing UPCs as character data takes up an order of magnitude more space compared to a double; not really an acceptable alternative for a 1.5 billion row table, i.e. 10 GiB of RAM just for UPCs as doubles (*crosses fingers for long vector support*).

However, the newest data.table breaks that (see example below). The developers are aware of this, but I guess speed for imprecise numbers is a higher priority than proper results for people using data with large IDs.
I knew of such ids but I hadn't fully connected that numeric was being used for them currently which relied on the old value for tolerance. In my mind, such ids are what we've been working on integer64 for. Which is what the sweeping changes to sorting have been leading up to. The new radix sort for integer can now be applied to integer64 which is the right type for UPCs it seems. Yike is having a look at that. I'll see if I can quickly add the option to do full 8 byte radix passes optionally (it isn't just a single number somewhere otherwise the option would have been trivial).

Matt


In any case, I thought people should be more aware of this, and maybe someone would have a suggested workaround. I'm currently stuck at SVN r1129 because I was hitting some crashing bugs in 1.8.10.

For the interested, you can track the feature request at:
https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5369&group_id=240&atid=978

The relevant NEWS item:
Numeric data is still joined and grouped within tolerance as before but instead of tolerance being sqrt(.Machine$double.eps) == 1.490116e-08 (the same as base::all.equal's default) the the significand is now rounded to the last 2 bytes, apx 11 s.f. This is more appropriate for large (1.23e20) and small (1.23e-20) numerics and is faster via a simple bit twiddle. A few functions provided a 'tolerance' argument but this wasn't being passed through so has been removed. We aim to add a global option (e.g. 2, 1 or 0 byte rounding) in a future release.


library(data.table)
DT <- data.table(upc = c(301426027592, 301426027593, 314775802939,
                         314775802940, 314775803490, 314775803491,
                         314775815510, 314775815511, 314933000171,
                         314933000172), d=rnorm(10), key='upc')

DT[, list(length=length(d)), keyby=upc]

Output with 1.9.2 is:
> DT[, list(length=length(d)), keyby=upc]
            upc length
1: 301426027592      2
2: 314775802939      2
3: 314775803490      2
4: 314775815510      2
5: 314933000171      2

Instead of:
> DT[, list(length=length(d)), keyby=upc]
             upc length
 1: 301426027592      1
 2: 301426027593      1
 3: 314775802939      1
 4: 314775802940      1
 5: 314775803490      1
 6: 314775803491      1
 7: 314775815510      1
 8: 314775815511      1
 9: 314933000171      1
10: 314933000172      1

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Reply via email to