On Wed, 2011-04-13 at 04:04 -0700, Karl Ove Hufthammer wrote: > Thank you for your detailed reply. Comments below: > > On Wed, 13 Apr 2011 10:05:56 +0100, Matthew Dowle <[email protected]> > wrote: > >>> options(stringsAsFactors=FALSE) > >>> dat=data.frame(x=c("1","1","2","3"), y=c("a","b","a","c")) > >>> dat > >>> A <- B <- data.table(dat) > >>> key(A)="x" > >>> key(B)="y" > >>> > >>> A[B["a"][,x]][,y] > >>> > >>> The problem is performance (my real-life data.table is *much* > >>> larger), since B["a"][,x] outputs a character vector. > > > > Not character, B["a"][,x] returns a factor for me. > > Did you remember to run the line ‘options(stringsAsFactors=FALSE)’? > The ‘x’ column in ‘B’ is a character vector when the data.table is > created from a data.frame with ‘x‘ as a character vector (but a > factor if I create the data.table directly). I just created the data.table directly. Ok, I'm with you now.
> I have about 150,000 levels in one of the keys and 30,000 in the other. Thanks. Might be the known issue, but looking at your code it's likely something more basic. > I‘ve tried to come up with a similar generated dataset and example code. > The result is much faster than similar code on my real data set, but still > > shows (using ‘Rprof’) that ‘levels<-’ is the main bottle-neck, as about > one > third of the time is spent there. Here’s the code. I’ve included both > a tiny example to show how the function is supposed to work, and a larger > and slower example (which is still pretty fast, about thirty seconds > on my computer): > > ------------------------------------------------------------------------ I've been looking at the code for 40 minutes. It generates data and runs but I can't grasp the big picture. If I was doing iterative connectedness I'd just iterate bulk joins until no new connections turned up. I don't see why there is processed vector as long as the table, or why it appears to do just the first y on the first iteration (why not all the unique y in one go on the first step?), or what the group column added at the end represents. In terms of the levels<-, could I ask for a simpler example isolating that please. Other thoughts : Why can't all columns of x2y and y2x be factor? More to the point, why store the x integers as character? Can't they be kept as integers and the levels<- thing goes away. Is it at all possible you didn't know that x2y[J(c(1,6,8))] joins using the integer values and doesn't refer to the row numbers? Matthew _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
