Re: [datatable-help] using a UPC as identifier broken in 1.9.2 (related to 'tolerance of precision' NEWS item)

Matt Dowle Wed, 05 Mar 2014 04:25:32 -0800

On 04/03/14 23:49, James Sams wrote:

I suspect there are plenty of data.table users that use UPCs and otherlarge integer-like doubles as identifiers in their data. Storing UPCsas character data takes up an order of magnitude more space comparedto a double; not really an acceptable alternative for a 1.5 billionrow table, i.e. 10 GiB of RAM just for UPCs as doubles (*crossesfingers for long vector support*).
However, the newest data.table breaks that (see example below). Thedevelopers are aware of this, but I guess speed for imprecise numbersis a higher priority than proper results for people using data withlarge IDs.

I knew of such ids but I hadn't fully connected that numeric was beingused for them currently which relied on the old value for tolerance. Inmy mind, such ids are what we've been working on integer64 for. Whichis what the sweeping changes to sorting have been leading up to. Thenew radix sort for integer can now be applied to integer64 which is theright type for UPCs it seems. Yike is having a look at that. I'll seeif I can quickly add the option to do full 8 byte radix passesoptionally (it isn't just a single number somewhere otherwise the optionwould have been trivial).


Matt

In any case, I thought people should be more aware of this, and maybesomeone would have a suggested workaround. I'm currently stuck at SVNr1129 because I was hitting some crashing bugs in 1.8.10.
For the interested, you can track the feature request at:
https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5369&group_id=240&atid=978
The relevant NEWS item:
Numeric data is still joined and grouped within tolerance as beforebut instead of tolerancebeing sqrt(.Machine$double.eps) == 1.490116e-08 (the same asbase::all.equal's default) thethe significand is now rounded to the last 2 bytes, apx 11 s.f.This is more appropriatefor large (1.23e20) and small (1.23e-20) numerics and is fastervia a simple bit twiddle.A few functions provided a 'tolerance' argument but this wasn'tbeing passed through so hasbeen removed. We aim to add a global option (e.g. 2, 1 or 0byte rounding) in a future release.
library(data.table)
DT <- data.table(upc = c(301426027592, 301426027593, 314775802939,
                         314775802940, 314775803490, 314775803491,
                         314775815510, 314775815511, 314933000171,
                         314933000172), d=rnorm(10), key='upc')

DT[, list(length=length(d)), keyby=upc]

Output with 1.9.2 is:
> DT[, list(length=length(d)), keyby=upc]
            upc length
1: 301426027592      2
2: 314775802939      2
3: 314775803490      2
4: 314775815510      2
5: 314933000171      2

Instead of:
> DT[, list(length=length(d)), keyby=upc]
             upc length
 1: 301426027592      1
 2: 301426027593      1
 3: 314775802939      1
 4: 314775802940      1
 5: 314775803490      1
 6: 314775803491      1
 7: 314775815510      1
 8: 314775815511      1
 9: 314933000171      1
10: 314933000172      1

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Re: [datatable-help] using a UPC as identifier broken in 1.9.2 (related to 'tolerance of precision' NEWS item)

Reply via email to