On Mon, Mar 7, 2011 at 10:18 PM, Matthew Dowle <[email protected]> wrote: > Maybe. The slowdown would be fairly significant, perhaps. Although the > levels vector is contiguous in memory, the global character hash (the > memory where the character pointers point to) isn't. It's not the string > cmp as such, it's the page fetches. Also, it might potentially do this > check over and over again for the same levels vectors (very wasteful). > Remember that [.data.table is recursive in places, although once only I > think.
Good point. Well -- there is a place I've identified in merge.data.table that throws some esoteric error due to this problem. Perhaps I'll just catch the error there and wrap it with a test to see if the levels of the factor are sorted -- it'll only hit that once, and if it does happen, maybe I (or whoever) will be able to remember how it got messed up to begin with. > Did you find out what created the out-of-order levels? This check won't > help you find out where that occurred, or will it? Unfortunately, I haven't investigated much further. I have a sneaking suspicion that it has to do with the type of values that are in the key'd column (enterz.id). It is originally of type character when the data.table is constructed -- then after it is key'd, it turns into a factor. When I was making this particular data.table, I saved the table to a text file, and reloaded it into another R session via read.table. The thing with that entrez.id column is that it can successfully be parsed as an integer, so it was likely read in as such. Somehow after it was keyed it was turned into a factor -- not sure how. The ordering of the "broken" levels is consistent with ordering an integer: R> head(levels(m2$entrez.id)) [1] "9" "10" "12" "13" "18" "20" And after fixing the data.table, the levels are reorderd as a character should be: R> m2$entrez.id <- factor(as.character(m2$entrez.id)) [1] "10" "10009" "100093630" "10010" "100113384" "100113407" I'm thinking all signs are pointing to me having done something bone headed ... I haven't had time to really try too many different things, but the few (one) obvious thing I tried to reconstruct my m2 table from my raw (text) data file isn't turning my entrez.id column into a screwed factor column. Anyway -- I guess there isn't much to do just yet. As I said, I'll just add a check for an error in the appropriate place in merge.data.table and keep my eye out to see if it happens again. -steve > > > On Mon, 2011-03-07 at 21:39 -0500, Steve Lianoglou wrote: >> On Mon, Mar 7, 2011 at 8:50 PM, Matthew Dowle <[email protected]> wrote: >> > Btw : >> > >> >> a small utility at the C level to scan through the levels() of >> >> factor-keys to test for them being in order and >> >> breaking/short-circuiting as soon as it finds one level that's out of >> >> order? >> > >> > That's base::is.unsorted(), which is done in C. >> >> Aww -- was looking forward to writing some C code ... >> >> It looks like you were right, though -- the problematic data.table has >> a (factor) key where `is.unsorted(levels(the_key_column))` is TRUE. >> >> So I guess we're talking about having something like >> options(datatable.check.factor.levels=TRUE) check at the top of the >> [.data.table function that fires a warning() when the levels are >> unsorted, yeah? >> >> -steve >> > > > -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
