You might try the following function. First it identifies the last element in
each run, then the length of each run, then calls sequence() to generate the
within-run sequence numbers. my.sequence is a version of sequence that is more
efficient (less time, less memory) than sequence when there are lots of short
runs (sequence() calls lapply, which makes a memory consuming list, and then
unlists it, and my.sequence avoids the big intermediate list).
For your data, f(data) produces the same thing as data$conditional_time.
f<-function(data, use.my.sequence=FALSE){
n<-nrow(data)
lastInRun <- with(data, eif | c(id[-1]!=id[-n], TRUE))
runLengths <- diff(c(0L,which(lastInRun)))
if (use.my.sequence) {
my.sequence<-
function(nvec)seq_len(sum(nvec))-rep.int(c(0L,cumsum(nvec[-length(nvec)])),nvec)
my.sequence(runLengths)
} else {
sequence(runLengths)
}
}
Bill Dunlap, Spotfire Division, TIBCO Software Inc.
----------------------------------------
Hi everyone,
Please forgive me if my question is simple and my code terrible, I'm new to
R. I am not looking for a ready-made answer, but I would really appreciate
it if someone could share conceptual hints for programming, or point me
toward an R function/package that could speed up my processing time.
Thanks a lot for your help!
##
My dataframe includes the variables 'year', 'id', and 'eif' and has +/- 1.9
million id-year observations
I would like to do 2 things:
-1- I want to create a 'conditional_time' variable, which increases in
increments of 1 every year, but which resets during year(t) if event 'eif'
occured for this 'id' at year(t-1). It should also reset when we switch to a
new 'id'. For example:
dataframe = test
year id eif conditional_time
1990 1010 0 1
1991 1010 0 2
1992 1010 1 3
1993 1010 0 1
1994 1010 0 2
1995 1010 0 3
1996 1010 0 4
1997 1010 1 5
1998 1010 0 1
1999 1010 0 2
2000 1010 0 3
2001 1010 0 4
2002 1010 0 5
2003 1010 0 6
1990 2010 0 1
1991 2010 0 2
1992 2010 0 3
1993 2010 0 4
1994 2010 0 5
1995 2010 0 6
1996 2010 0 7
1997 2010 0 8
1998 2010 0 9
1999 2010 0 10
2000 2010 0 11
2001 2010 1 12
2002 2010 0 1
2003 2010 0 2
-2- In a copy of the original dataframe, drop all id-year rows that
correspond to years after a given id has experienced his first 'eif' event.
I have written the code below to take care of -1-, but it is incredibly
inefficient. Given the size of my database, and considering how slow my
computer is, I don't think it's practical to use it. Also, it depends on
correct sorting of the dataframe, which might generate errors.
##
for (i in 1:nrow(test)) {
if (i == 1) { # If first id-year
cond_time <- 1
test[i, 4] <- cond_time
} else if ((test[i-1, 1]) != (test[i, 4])) { # If new id
cond_time <- 1
test[i, 4] <- cond_time
} else { # Same id as previous row
if (test[i, 3] == 0) {
test[i, 4] <- sum(cond_time, 1)
cond_time <- test[i, 6]
} else {
test[i, 4] <- sum(cond_time, 1)
cond_time <- 0
}
}
}
--
Vincent Arel
M.A. Student, McGill University
[[alternative HTML version deleted]]
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
[[alternative HTML version deleted]]
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.