Hi all,

sample() has some well-documented undesirable behaviour.

sample(1:6, 1)
sample(2:6, 1)
...
sample(5:6, 1)

do what you expect, but

sample(6:6, 1)
sample(1:6, 1)

do the same thing.

This behaviour is documented:

     If 'x' has length 1, is numeric (in the sense of 'is.numeric') and
     'x >= 1', sampling _via_ 'sample' takes place from '1:x'.  _Note_
     that this convenience feature may lead to undesired behaviour when
     'x' is of varying length 'sample(x)'.  See the 'resample()'
     example below.

My proposal is to add an extra parameter is.set to sample() to control
this behaviour.  If the parameter is unspecified, then we keep the old
behaviour for compatibility.  If it is TRUE, then we treat the first
parameter x as a set.  If it is FALSE, then we treat it as a set size.
 This means that

sample(6:6, 1, is.set=TRUE)

would return 6 with probability 1.

I have attached a patch to implement this new option.

Cheers,
Andrew
diff --git a/src/library/base/R/sample.R b/src/library/base/R/sample.R
index 8d22469..ddf7cf0 100644
--- a/src/library/base/R/sample.R
+++ b/src/library/base/R/sample.R
@@ -14,13 +14,17 @@
 #  A copy of the GNU General Public License is available at
 #  http://www.r-project.org/Licenses/
 
-sample <- function(x, size, replace=FALSE, prob=NULL)
+sample <- function(x, size, replace=FALSE, prob=NULL, is.set=NULL)
 {
-    if(length(x) == 1L && is.numeric(x) && x >= 1) {
+    is.natural <- function(x) length(x) == 1L && is.integer(x) && x > 1
+    if(is.set == NULL) is.set <- !is.natural(x)
+    if(!is.set) {
+	stopifnot(is.natural(x))
 	if(missing(size)) size <- x
 	.Internal(sample(x, size, replace, prob))
     }
     else {
+	stopifnot(length(x) >= 1)
 	if(missing(size)) size <- length(x)
 	x[.Internal(sample(length(x), size, replace, prob))]
     }
diff --git a/src/library/base/man/sample.Rd b/src/library/base/man/sample.Rd
index 3929ff2..811fed2 100644
--- a/src/library/base/man/sample.Rd
+++ b/src/library/base/man/sample.Rd
@@ -12,26 +12,31 @@
   of \code{x} using either with or without replacement.
 }
 \usage{
-sample(x, size, replace = FALSE, prob = NULL)
+sample(x, size, replace = FALSE, prob = NULL, is.set=NULL)
 
 sample.int(n, size, replace = FALSE, prob = NULL)
 }
 \arguments{
   \item{x}{Either a (numeric, complex, character or logical) vector of
-    more than one element from which to choose, or a positive integer.}
+    elements from which to choose, or a positive integer.  The interpretation
+    depends on is.set, or heuristics described below.}
   \item{n}{a non-negative integer, the number of items to choose from.}
   \item{size}{positive integer giving the number of items to choose.}
   \item{replace}{Should sampling be with replacement?}
   \item{prob}{A vector of probability weights for obtaining the elements
     of the vector being sampled.}
+  \item{is.set}{A vector of probability weights for obtaining the elements
+    of the vector being sampled.}
 }
 \details{
-  If \code{x} has length 1, is numeric (in the sense of
-  \code{\link{is.numeric}}) and \code{x >= 1}, sampling \emph{via}
-  \code{sample} takes place from
-  \code{1:x}.  \emph{Note} that this convenience feature may lead to
-  undesired behaviour when \code{x} is of varying length
-  \code{sample(x)}.  See the \code{resample()} example below.
+  The \code{is.set} parameter controls whether the \code{x} is interpreted as a
+  set of items to sample from (when \code{is.set} is \code{TRUE}), or the size
+  of the set of samples (when \code{is.set} is \code{FALSE}), in which case
+  the sample set is \code{1:x}.  If \code{is.set} is unspecified, then
+  \code{is.set} is set to \code{FALSE} when \code{x} has length 1, is numeric
+  (in the sense of \code{\link{is.numeric}}) and \code{x >= 1}.  \emph{Note}
+  that when \code{x} is a vector of varying size, leaving \code{is.set} can
+  lead to undesirable behaviour.
 
   By default \code{size} is equal to \code{length(x)}
   so that \code{sample(x)} generates a random permutation
@@ -93,13 +98,9 @@ x <- 1:10
     sample(x[x >  9]) # oops -- length 10!
 try(sample(x[x > 10]))# error!
 
-## This is safer, but only for sampling without replacement
-resample <- function(x, size, ...)
-  if(length(x) <= 1) { if(!missing(size) && size == 0) x[FALSE] else x
-  } else sample(x, size, ...)
-
-resample(x[x >  8])# length 2
-resample(x[x >  9])# length 1
-resample(x[x > 10])# length 0
+## This is safer
+sample(x[x >  8], is.set=TRUE)# length 2
+sample(x[x >  9], is.set=TRUE)# length 1
+sample(x[x > 10], is.set=TRUE)# length 0
 }
 \keyword{distribution}
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to