Thanks; I've committed Ivan'x fix to R-devel in r89248.

Best,

luke

On Sat, 27 Dec 2025, Benjamin Schwendinger wrote:

Hello! data.table diverged over 7 years ago from this code when Matt
rewrote forder.

This code was originally contributed by data.table. I believe Michael
Lawrence handled the integration at the time. There were a number of
issues like this early on that were resolved on the R side and I
believe contributed back to data.table. If you have the energy it
might be good to compare the two now and see if there are things that
should be ported from one to the other.

The original code had the following note which is also mentioned
multiple times but not explicitly part of radixsort.c

Note on challenges implementing `nalast=NA` so that things are under the same 
perspective when looked at later. To go to the point, there are four challenges 
in implementing `nalast=NA` with the current form of forder.c (which is 
excellent!):

  1) Handling the i|d|csorted routines for nalast=NA:
     The current implementation of nalast=NA in *sorted routine is that if all 
values are NAs, then it returns -2, a different value, so that all values of 
o/osub can be replacedw with 0. If any values are NA, then we assume that this 
vector is unsorted and return 0 to be taken care of by the corresponding sort 
routine. Now, this may very well be questioned/improved. For ex: we could 
detect if the vector is sorted and also if the first value is NA, and if so, 
get the group size of these NAs and replace that many indices with 0. This is 
not yet done, but could very well be implemented later. If there are no NAs in 
the vector, then things are the same.

  2) Handling cases where n==1 and n==2:
     The current sorting routines don't take care of n<=2, and rightly so, as they can be dealt with 
directly. But it's not the case for nalast=NA because, when n==2 and all are NAs, point 1 will handle it. But 
when n==1 and it is NA and the column is not the first column to order upon, with the current implementation, 
"thisgrpn" is checked for "== 1" and if so, just pushes that group size and continues.. But 
we've to replace it with 0 in our case. So, nalast==0 is checked for inside that if statement and if so, if the 
value is NA is checked for all types and if so, replaced with 0. For n==2 and "any" is NA, we've to 
introduce a special case for this in each sort routine to take care of and not result in an error (which tells 
this should be already taken care of in *sorted).

  3) o[0] and newo[0] = -1:
     Since nalast already populates NAs with 0 values in o, we can't use the 
old value of 0 to check if it's already populated or not. Therefore this has 
been changed to -1.

  4) default cases are untouched as much as possible:
     In order to not compromise in speed for default case (`setkey`), iinsert and 
dinsert are not touched at all, which means, we've to make sure that they are not 
run when n<200 and nalast==0, as they don't replace o[x=NA] with 0. And, 
'icount', 'iradix', 'dradix' are modified to take care of nalast=NA, to my 
knowledge.

However, case 2) does not cater for Ivans identified corner case
example of order(NA_character_, "c", method = "radix", na.last = NA),
which is not tied to n==1 or n==2 but rather to thisgrpn==1 and
x=c(NA_character_, another_string).

Hence, the missing case is indeed to cater for o[i] = 0 with Ivans
1-liner or more explicitly via

--- src/main/radixsort.c
@@ -1766,7 +1766,12 @@
                    // this edge case had to be taken care of
                    // here.. (see the bottom of this file for
                    // more explanation)
-                   switch (TYPEOF(x)) {
+            if (o[i] == 0) {
+                       isSorted = false;
+                       i++;
+                       push(1);
+                       continue;
+                   }               switch (TYPEOF(x)) {
                    case INTSXP:
                        if (INTEGER(x)[o[i] - 1] == NA_INTEGER) {
                            isSorted = false;

Best, Ben

        [[alternative HTML version deleted]]

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
   Actuarial Science
241 Schaeffer Hall                  email:   [email protected]
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu/

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to