Hi Ravi, Did you try fixing the problem? What did you try and what went wrong?
The answer is probably A <- as.data.table(A) A[ , g15 := cumsum(ifelse(is.na(Time_Diff > 12), 0, Time_Diff > 12))] A[ , flag_1 := 1:.N, by = c("customer", "g15")] A[ , g15 := NULL] but you would have learned more if you had at least tried getting there yourself. Best, Ista On Sun, Sep 20, 2015 at 6:19 AM, Ravi Teja <raviteja2...@gmail.com> wrote: > Hi Ista. > > Thanks a ton for the response and your assumptions were right. > > f the Time_Diff is missing then flag_1 value should be 1 > if the Time_Diff is > 12 then flag_1 value should be 1 > if the Time_Diff is < 12 the flag_1 value should be (if the current row is i > then flag_1 value should be (flag_1[i-1] + 1) ) > > When I tried to apply the logic you had shared, the results are deviating > from the expected results. > > I think the logic you had shared will not function if there are two > successive rows with Time_Diff values > 12 > > I have attached a sample of my original data set and the expected flag_1 > column to this mail. > > Please help in tweaking your code to generate the attached result. > > Awaiting for your reply > > Thanks, > Ravi > > On Sun, Sep 20, 2015 at 8:18 AM, Ista Zahn <istaz...@gmail.com> wrote: >> >> This assumes that the data are sorted by customer, and that only the >> first value of Time_Diff is missing for each customer (and that the >> first value is always missing for each customer). If those assumptions >> hold you can do something like >> >> A <- read.table(text = "customer Time_Diff flag_1 >> 1 NA 1 >> 1 10 2 >> 1 8 3 >> 1 15 1 >> 1 9 2 >> 1 10 3 >> 2 NA 1 >> 2 2 2 >> 2 5 3", >> header = TRUE) >> >> A$flag_1 <- NULL >> >> library(data.table) >> >> A <- as.data.table(A) >> A[ , g15 := cumsum(c(0, ifelse(is.na(diff(Time_Diff > 12)), 0, >> diff(Time_Diff > 12) > 0)))] >> ## I'm not proud of the previous line, probably there is a cleaner way >> A[ , flag_1 := 1:.N, by = c("customer", "g15")] >> A[ , g15 := NULL] >> >> Best, >> Ista >> >> On Sat, Sep 19, 2015 at 5:09 PM, Ravi Teja <raviteja2...@gmail.com> wrote: >> > Hi, >> > >> > I am trying to apply the below logic to generate flag_1 column on a data >> > set consisting of ~1.2 million records in R. >> > >> > Code : >> > >> > for(i in 1: nrows) >> > { >> > if(A$customer[i]==A$customer[i+1]) >> > { >> > >> > if(is.na(A$Time_Diff[i])) >> > A$flag_1[i] <- 1 >> > else if (A$Time_Diff[i] > 12) >> > A$flag_1[i] <- 1 >> > else >> > A$flag_1[i] <- A$flag_1[i-1]+1 >> > >> > } >> > >> > else >> > { >> > >> > if(is.na(A$Time_Diff[i])) >> > A$flag_1[i] <- 1 >> > else if (A$Time_Diff[i] > 12) >> > A$flag_1[i] <- 1 >> > else >> > A$flag_1[i] <- A$flag_1[i-1]+1 >> > >> > } >> > } >> > >> > >> > Resultant dataset should look like >> > >> > Customer Time_diff flag_1 >> > 1 NA 1 >> > 1 10 2 >> > 1 8 3 >> > 1 15 1 >> > 1 9 2 >> > 1 10 3 >> > 2 NA 1 >> > 2 2 2 >> > 2 5 3 >> > >> > The above logic will take approximately 60 hours to generate the flag_1 >> > column on a dataset consisting of ~1.2 million records. Is there any >> > effective way in R to implement this logic in R ? >> > >> > Appreciate your help. >> > >> > Thanks, >> > Ravi >> > >> > [[alternative HTML version deleted]] >> > >> > ______________________________________________ >> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> > https://stat.ethz.ch/mailman/listinfo/r-help >> > PLEASE do read the posting guide >> > http://www.R-project.org/posting-guide.html >> > and provide commented, minimal, self-contained, reproducible code. > > > > > -- > raviteja ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.