Re: [Rd] sum() returns NA on a long logical vector when nb of TRUE values exceeds 2^31

Martin Maechler Mon, 05 Feb 2018 04:43:51 -0800

>>>>> Martin Maechler <maech...@stat.math.ethz.ch>
>>>>>     on Thu, 1 Feb 2018 16:34:04 +0100 writes:


> >>>>> Hervé Pagès <hpa...@fredhutch.org>
> >>>>>     on Tue, 30 Jan 2018 13:30:18 -0800 writes:
> 
>     > Hi Martin, Henrik,
>     > Thanks for the follow up.
> 
>     > @Martin: I vote for 2) without *any* hesitation :-)
> 
>     > (and uniformity could be restored at some point in the
>     > future by having prod(), rowSums(), colSums(), and others
>     > align with the behavior of length() and sum())
> 
> As a matter of fact, I had procrastinated and worked at
> implementing '2)' already a bit on the weekend and made it work
> - more or less.  It needs a bit more work, and I had also been considering
> replacing the numbers in the current overflow check
> 
>       if (ii++ > 1000) {       \
>           ii = 0;                                                     \
>           if (s > 9000000000000000L || s < -9000000000000000L) {      \
>               if(!updated) updated = TRUE;                            \
>               *value = NA_INTEGER;                                    \
>               warningcall(call, _("integer overflow - use 
> sum(as.numeric(.))")); \
>               return updated;                                         \
>           }                                                           \
>       }                                                               \
> 
> i.e. think of tweaking the '1000' and '9000000000000000L', 
> but decided to leave these and add comments there about why. For
> the moment.
> They may look arbitrary, but are not at all: If you multiply
> them (which looks correct, if we check the sum 's' only every 1000-th
> time ...((still not sure they *are* correct))) you get  9*10^18
> which is only slightly smaller than  2^63 - 1 which may be the
> maximal "LONG_INT" integer we have.
> 
> So, in the end, at least for now, we do not quite go all they way
> but overflow a bit earlier,... but do potentially gain a bit of
> speed, notably with the ITERATE_BY_REGION(..) macros
> (which I did not show above).
> 
> Will hopefully become available in R-devel real soon now.
>
> Martin

After finishing that... I challenged myself that one should be able to do
better, namely "no overflow" (because of large/many
integer/logical), and so introduced  irsum()  which uses a double 
precision accumulator for integer/logical  ... but would really
only be used when the 64-bit int accumulator would get close to
overflow.
The resulting code is not really beautiful, and also contains a
a comment     " (a waste, rare; FIXME ?) "
If anybody feels like finding a more elegant version without the
"waste" case, go ahead and be our guest ! 

Testing the code does need access to a platform with enough GB
RAM, say 32 (and I have run the checks only on servers with >
100 GB RAM). This concerns the new checks at the (current) end
of <R-devel_R>/tests/reg-large.R

In R-devel svn rev >= 74208  for a few minutes now.

Martin

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31

Reply via email to

Re: [Rd] sum() returns NA on a long logical vector when nb of TRUE values exceeds 2^31