Hi, On Nov 21, 2013, at 10:42 AM, "MacQueen, Don" <macque...@llnl.gov> wrote:
> I have some processes where I do the same thing, iterate over subsets of a > data frame. > My data frame has ~250,000 rows, 30 variables, and the subsets are such > that there are about 6000 of them. > > Performing a which() statement like yours seems quite fast. > > For example, wrapping unix.time() around the which() expression, I get > > user system elapsed 0.008 0.000 0.008 > > It's hard for me to imagine the single task of getting the indexes is slow > enough to be a bottleneck. > > > > On the other hand, if the variable being used to identify subsets is a > factor with many levels (~6000 in my case), it is noticeably slower. > > user system elapsed > 0.024 0.002 0.026 > > > I haven't tested it, and have no real expectation that it will make a > difference, but perhaps sorting by the index variable before iterating > will help (if you haven't already). Since these are not true indexes in > the sense used by relational database systems, maybe it will make a > difference. > You might also want to check this out… http://adv-r.had.co.nz/Performance.html Cheers, Ben > > -- > Don MacQueen > > Lawrence Livermore National Laboratory > 7000 East Ave., L-627 > Livermore, CA 94550 > 925-423-1062 > > > > > > On 11/20/13 12:16 PM, "Noah Silverman" <noahsilver...@g.ucla.edu> wrote: > >> Hello, >> >> I have a fairly large data.frame. (About 150,000 rows of 100 >> variables.) There are case IDs, and multiple entries for each ID, with a >> date stamp. (i.e. records of peoples activity.) >> >> >> I need to iterate over each person (record ID) in the data set, and then >> process their data for each date. The processing part is fast, the date >> part is fast. Locating the records is slow. I've even tried using >> data.table, with ID set as the index, and it is still slow. >> >> The line with the slow process (According to Rprof) is: >> >> >> j <- which( d$id == person ) >> >> (I then process all the records indexed by j, which seems fast enough.) >> >> where d is my data.frame or data.table >> >> I thought that using the data.table indexing would speed things up, but >> not in this case. >> >> Any ideas on how to speed this up? >> >> >> Thanks! >> >> -- >> Noah Silverman, M.S., C.Phil >> UCLA Department of Statistics >> 8117 Math Sciences Building >> Los Angeles, CA 90095 >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. Ben Tupper Bigelow Laboratory for Ocean Sciences 60 Bigelow Drive, P.O. Box 380 East Boothbay, Maine 04544 http://www.bigelow.org ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.