On 04/03/14 23:49, James Sams wrote:
I suspect there are plenty of data.table users that use UPCs and other
large integer-like doubles as identifiers in their data. Storing UPCs
as character data takes up an order of magnitude more space compared
to a double; not really an acceptable alternative for a 1.5 billion
row table, i.e. 10 GiB of RAM just for UPCs as doubles (*crosses
fingers for long vector support*).
However, the newest data.table breaks that (see example below). The
developers are aware of this, but I guess speed for imprecise numbers
is a higher priority than proper results for people using data with
large IDs.
I knew of such ids but I hadn't fully connected that numeric was being
used for them currently which relied on the old value for tolerance. In
my mind, such ids are what we've been working on integer64 for. Which
is what the sweeping changes to sorting have been leading up to. The
new radix sort for integer can now be applied to integer64 which is the
right type for UPCs it seems. Yike is having a look at that. I'll see
if I can quickly add the option to do full 8 byte radix passes
optionally (it isn't just a single number somewhere otherwise the option
would have been trivial).
Matt
In any case, I thought people should be more aware of this, and maybe
someone would have a suggested workaround. I'm currently stuck at SVN
r1129 because I was hitting some crashing bugs in 1.8.10.
For the interested, you can track the feature request at:
https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5369&group_id=240&atid=978
The relevant NEWS item:
Numeric data is still joined and grouped within tolerance as before
but instead of tolerance
being sqrt(.Machine$double.eps) == 1.490116e-08 (the same as
base::all.equal's default) the
the significand is now rounded to the last 2 bytes, apx 11 s.f.
This is more appropriate
for large (1.23e20) and small (1.23e-20) numerics and is faster
via a simple bit twiddle.
A few functions provided a 'tolerance' argument but this wasn't
being passed through so has
been removed. We aim to add a global option (e.g. 2, 1 or 0
byte rounding) in a future release.
library(data.table)
DT <- data.table(upc = c(301426027592, 301426027593, 314775802939,
314775802940, 314775803490, 314775803491,
314775815510, 314775815511, 314933000171,
314933000172), d=rnorm(10), key='upc')
DT[, list(length=length(d)), keyby=upc]
Output with 1.9.2 is:
> DT[, list(length=length(d)), keyby=upc]
upc length
1: 301426027592 2
2: 314775802939 2
3: 314775803490 2
4: 314775815510 2
5: 314933000171 2
Instead of:
> DT[, list(length=length(d)), keyby=upc]
upc length
1: 301426027592 1
2: 301426027593 1
3: 314775802939 1
4: 314775802940 1
5: 314775803490 1
6: 314775803491 1
7: 314775815510 1
8: 314775815511 1
9: 314933000171 1
10: 314933000172 1
_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help