Hi Karl, For a while the "1" and "2" (integers stored as character) confused me. I thought you wanted a,b because that was the value of A[1:2,x]. Now I'm over that, comments inserted below ...
On Tue, 2011-04-12 at 03:53 -0700, Short, Tom wrote: > > -----Original Message----- > > From: [email protected] > > [mailto:[email protected]] On > > Behalf Of Karl Ove Hufthammer > > > > I have found one way of achieving this, creating two > > identical data.tables with different keys: > > > > options(stringsAsFactors=FALSE) > > dat=data.frame(x=c("1","1","2","3"), y=c("a","b","a","c")) > > dat > > A <- B <- data.table(dat) > > key(A)="x" > > key(B)="y" > > > > A[B["a"][,x]][,y] > > > > The problem is performance (my real-life data.table is *much* > > larger), since B["a"][,x] outputs a character vector. Not character, B["a"][,x] returns a factor for me. > > When > > this is used in A[...], the character is converted to a factor > > with appropriate levels, and it turns out (shown using > > 'Rprof') that the majority of the time running the function > > is taken up by 'levels<-', i.e., creating this factor / > > attaching the levels. Interesting. I doubt it's converting character to factor (because it's already factor), rather matching the levels of column in the i table to the levels of the column in the x table (in this case B). It then changes the levels in a working copy of the i column so the binary search can work on 2 integer columns where those integers mean the same level. There are some performance issues nestled in there which only come to light when you have very many levels (e.g. 100,000+ levels). I can't remember where it starts to bite. How many levels do you have in the real data? I've done quite a bit of work on allowing keys to be character. The need to match levels to levels then goes away and saves some copies too. Sorting character is very slow (which is why data.table likes integers of course) but to solve that we now have a counting sort for character. That C code is committed but not hooked up yet. Short answer then ... you might be hitting a known performance issue. If you have some way to generate the large dataset with large number of levels it would be great to have some more test cases to play with. More comments below ... > > > > I believe one potential solution would be to have both 'x' > > and 'y' being factors, so that there is no conversion to/from > > characters. This would eliminate both the conversion '"a" to > > factor' and 'B["a"][,x] to factor'. > > However, 'data.table' doesn't accept 'i' being a factor (and > > if I convert it to the internal numeric codes, it thinks I > > mean row numbers). Yes, that means row numbers, don't do that! :-) > > > > Any suggestions on how to solve this? > > To answer part of your inquiry, you can use factors by enclosing i with > J() as follows: > > options(stringsAsFactors=TRUE) > dat=data.frame(x=c("1","1","2","3"), y=c("a","b","a","c")) > A <- B <- data.table(dat) > key(A)="x" > key(B)="y" > A[J(B["a"][,x])][,y] Only slight improvement on that for speed I can think of is : A[J(B["a"]$x)]$y but that would only make a difference if this is being repeated many times. Karl's example seems like a very rare but valid reason to iterate the query. >> I also wonder if it is possible for a data.table to only >> return unique values of a column? For the above example >> I would like the output y=a,b. >> Note that for instance >> A[B["a"][,x],mult="first"][,y] >> does not work, as this returns the first value of ‘y’ for >> each x (here y=a,a). How about : A[J(B["a"][,x])][,unique(y)] >> >> My last question is why >> >> A[B["a"][,x], y, mult="all"] >> >> returns a two-column data.table/data.frame, while >> >> A[B["a"][,x], y, mult="first"] >> >> returns a character vector. I would expect both of >> them to return a character vector. Is this a feature >> or simply a bug? Try FAQs (latest on homepage): 2.2 and 2.17 Also ... It seems strange perhaps because "a" is a single group. It may have been a variable containing "a" or c("a","b") and in data.table we prefer consistent return types regardless of the data passed in. [Background: I dislike having to remember to write drop=FALSE in [.data.frame and believe that is the source of many bugs in user code.] In general the rule is that the type of j is the type of the result. When j is a vector the result will be vector, and if you want a data.table, then wrap with list(). [As per FAQ 1.3]. *Except* when grouping (either by i or by by) since you are unlikely to want to lose which items in the result correspond to which groups. That is tightly coupled to .SD (now) excluding the grouping columns. With all the above in mind, mult="first" and mult="last" is not considered grouping. In those cases the table is subset by i, then j is run once on the result. Thats when j=y vs j=list(y) makes a difference. mult="all" is different because it has a dual meaning: subset and grouping. All that said, it has been tricky to get right. If anyone can see a better way to do it then please do suggest. Matthew > > - Tom > _______________________________________________ > datatable-help mailing list > [email protected] > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
