On Jun 14, 2008, at 1:25 AM, T.D.Rudolph wrote:


aggregate() is indeed a useful function in this case, but it only returns the
columns by which it was grouped.  Is there a way I can use this while
simultaneously retaining all the other column values in the dataframe?

e.g. add superfluous (yet pertinent for later) column containing any
information at all and retain it in the final output

I had exactly this kind of need many times, and I have finally created a function for it, which I hope to include soon in an upcoming package. Here is a run of it (I added an extra "A" column containing just the numbers 1:8):

> DF
  id      day diff A
1  1 01-01-09  0.5 1
2  1 01-01-09  0.7 2
3  2 01-01-09  0.2 3
4  2 01-01-09  0.4 4
5  1 01-02-09  0.1 5
6  1 01-02-09  0.3 6
7  2 01-02-09  0.3 7
8  2 01-02-09  0.4 8
> byDataFrame(DF, list(id, day), function(x) x[which.min(x$diff),])
  diff A id      day
1  0.5 1  1 01-01-09
2  0.2 3  2 01-01-09
3  0.1 5  1 01-02-09
4  0.3 7  2 01-02-09

Would that do what you want?

I've appended the function byDataFrame, and its prerequisite, a function parseIndexList. I'm not quite set on the names yet, but anyway. Hope this helps. I haven't really tested it on large sets, it might perform poorly. Any suggestions on speeding the code / corrections are welcome.

Haris Skiadas
Department of Mathematics and Computer Science
Hanover College



parseIndexList <- function(indexList) {
  # browser()
  if (!is.list(indexList))
    indexList <- as.list(indexList)
  nI <- length(indexList)
  namelist <- vector("list", nI)
  names(namelist) <- names(indexList)
  extent <- integer(nI)
  nx <- length(indexList[[1]])
  one <- as.integer(1)
  group <- rep.int(one, nx)
  ngroup <- one
  for (i in seq.int(indexList)) {
      index <- as.factor(indexList[[i]])
      if (length(index) != nx)
          stop("arguments must have same length")
      namelist[[i]] <- sort(unique(indexList[[i]]))
      extent[i] <- length(namelist[[i]])
      group <- group + ngroup * (as.integer(index) - one)
      ngroup <- ngroup * nlevels(index)
  }
  nms <- do.call(expand.grid, namelist)
  ind <- unique(sort(group))
  res <- data.frame(index=ind, nms[ind, , drop=FALSE])
  return(list(cases=group, groups=res))
}

byDataFrame <- function (data, INDEX, FUN, newnames, omit.index.cols=TRUE, ...) {
# # Part of the code shamelessly stolen from tapply
  IND <- eval(substitute(INDEX), data)
  nms <- as.character(as.list(substitute(INDEX)))
  if (!is.list(IND)) {
    IND <- list(IND)
    names(IND) <- nms
  } else {
    names(IND) <- nms[-1]
  }
  funname <- paste(as.character(substitute(FUN)), collapse=".")
  indexInfo <- parseIndexList(IND)
  FUNx <- if (omit.index.cols) {
    omit.cols <- match(names(indexInfo$groups)[-1], names(data))
    function(x, ...) FUN(data[x, -omit.cols], ...)
  } else {
    function(x, ...) FUN(data[x, ], ...)
  }
  ans <- lapply(split(1:nrow(data), indexInfo$cases), FUNx, ...)
  index <- as.numeric(names(ans))
  if (!is.data.frame(ans[[1]])) {
    ans <- lapply(ans, function(x) {
      dframe <- as.data.frame(t(x))
      if (is.null(names(x)))
        names(dframe) <- funname
      dframe
    })
  }
  lengths <- sapply(ans, nrow)
  ans <- do.call(rbind, ans)
  if (!missing(newnames))
    names(ans) <- newnames
  nms <- indexInfo$groups[rep(index, lengths),-1, drop=FALSE]
  res <- cbind(ans, nms)
  res
}

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to