An option to turn on a check like that might be good. Probably at the start of [.data.table.
When I've seen this issue before it's been when I have been constructing data.table's 'manually'. Similar to other places in R, nothing stops you creating invalid objects, directly. For example (in base R) : > DF = list(1:10,1:5) > DF [[1]] [1] 1 2 3 4 5 6 7 8 9 10 [[2]] [1] 1 2 3 4 5 > class(DF)="data.frame" > sapply(DF,length) [1] 10 5 > DF NULL <0 rows> (or 0-length row.names) > > attr(DF,"row.names")=letters[1:3] > DF[6,] NA NA NA 6 NA > On Mon, 2011-03-07 at 11:30 -0500, Steve Lianoglou wrote: > Hi, > > On Sat, Mar 5, 2011 at 4:06 PM, Matthew Dowle <[email protected]> wrote: > > Hi, > > Seems consistent with out of order factor levels. The binary search > > relies on levels being sorted. If that's it then please track down the > > earlier point where the out-of-order factor levels were introduced and > > maybe a fix is needed there. Everything else here is correct behaviour. > > I know it sounds lame, but I'm having problems tracking down how my > key/factor column arrived at having out of order levels. > > While I try to smoke that out, do you think it would be a good idea to > write a small utility at the C level to scan through the levels() of > factor-keys to test for them being in order and > breaking/short-circuiting as soon as it finds one level that's out of > order? This way we can fire off a warning when this problem is > detected so the user would be warned to expect "weird" behavior (and > also know how to fix(?)) > > I'm not sure exactly where/when we would invoke that test -- maybe > after calls to setkey ... and optionally under merge-like operations. > > I can take a crack at doing that if it seems like a good idea. > > -steve > > > Matthew > > > > On Fri, 2011-03-04 at 21:43 -0500, Steve Lianoglou wrote: > >> Hi Mel, > >> > >> On Fri, Mar 4, 2011 at 8:15 PM, Bacou, Melanie <[email protected]> wrote: > >> > Steve, > >> > > >> > Try instead: > >> > > >> > R> m2[J(9)] > >> > > >> > It seems your original entrez.id key is integer not character > >> > >> It's actually a factor: > >> > >> R> is(m2$entrez.id) > >> [1] "factor" "integer" "oldClass" "numeric" "vector" > >> > >> and moreover: > >> > >> R> '9' %in% levels(m2$entrez.id) > >> [1] TRUE > >> > >> and the integer J() maneuver is a no go: > >> > >> R> Error in `[.data.table`(m2, J(9)) : > >> x.entrez.id is a factor but joining to i.V1 which is not a factor. > >> Factors must join to factors. > >> > >> > -- but to be honest I'm not sure why: > >> > > >> > R> m2[9] > >> > > >> > doesn't work either... > >> > >> That works, in that it does something, but it just gets the 9th row of > >> m2, not the row whose key is '9' > >> > >> Seems like something's strange is afoot here ... > >> > >> -steve > >> > >> > --Mel. > >> > > >> > -----Original Message----- > >> > From: [email protected] > >> > [mailto:[email protected]] On Behalf Of Steve > >> > Lianoglou > >> > Sent: Friday, March 04, 2011 5:46 PM > >> > To: [email protected] > >> > Subject: [datatable-help] Something seems funky. I think with > >> > character-to-factor conversion for keys (?) > >> > > >> > I'll have to apologize in advance because I can't create a > >> > reproducible example for this behavior, but I'll keep trying .. please > >> > bear with me. > >> > > >> > Somehow I've ended up with a data.table `m2` that looks like this: > >> > > >> > R> m2 > >> > entrez.id total.tags.liver cds.liver intron.liver utr.liver > >> > [1,] 9 27 0 0 0 > >> > [2,] 10 347 0 0 0 > >> > [3,] 12 5076 0 17 0 > >> > [4,] 13 2445 0 0 0 > >> > [5,] 18 2076 0 0 0 > >> > [6,] 20 15 0 0 0 > >> > [7,] 25 62 0 0 0 > >> > [8,] 32 320 0 0 0 > >> > [9,] 34 1377 0 0 0 > >> > [10,] 35 757 0 0 0 > >> > First 10 rows of 5236 printed. > >> > > >> > R> key(m2) > >> > [1] "entrez.id" > >> > > >> > R> any(duplicated(m2$entrez.id)) > >> > [1] FALSE > >> > > >> > So far so good -- I stumbled on the following problem when `merge`-ing > >> > two large data tables which was giving me a stranger error. In the > >> > process of trying to smoke out the problem, I notice this unexpected > >> > behavior: > >> > > >> > ## This is expected > >> > R> subset(m2, entrez.id == '9') > >> > entrez.id total.tags.liver cds.liver intron.liver utr.liver > >> > [1,] 9 27 0 0 0 > >> > > >> > ## This isn't > >> > R> m2['9'] > >> > entrez.id total.tags.liver cds.liver intron.liver utr.liver > >> > [1,] 9 NA NA NA NA > >> > > >> > Woops! Isn't that supposed to return the same as above? > >> > > >> > I can fix `m2` by manipulating the key column: > >> > > >> > R> key(m2) <- NULL ## probably not necessary > >> > R> m2$entrez.id <- as.character(m2$entrez.id) > >> > R> key(m2) <- 'entrez.id' > >> > R> m2['9'] > >> > entrez.id total.tags.liver cds.liver intron.liver utr.liver > >> > [1,] 9 27 0 0 0 > >> > > >> > (side note: the bug I mentioned when I try to `merge` this w/ another > >> > data.table is gone after I did the above fix). > >> > > >> > So -- I guess my point is that I'm not exactly sure how I got `m2` to > >> > have a funky key, but the fact that it got messed up like this somehow > >> > I think is undesired behavior, no? > >> > > >> > Does this point to something (maybe obvious) that happened on the way > >> > to building up `m2`? > >> > > >> > Thanks, > >> > -steve > >> > > >> > -- > >> > Steve Lianoglou > >> > Graduate Student: Computational Systems Biology > >> > | Memorial Sloan-Kettering Cancer Center > >> > | Weill Medical College of Cornell University > >> > Contact Info: http://cbio.mskcc.org/~lianos/contact > >> > _______________________________________________ > >> > datatable-help mailing list > >> > [email protected] > >> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > >> > > >> > _______________________________________________ > >> > datatable-help mailing list > >> > [email protected] > >> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > >> > > >> > >> > >> > > > > > > > > > _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
