Excellent, thanks for confirming. Thinking about it now, with fresh eyes, new feature request raised :
FR#2456 rbindlist should choose the highest type per column, not the first https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2456&group_id=240&atid=978 where 'highest' means in this hierarchy: LGLSXP < INTSXP < REALSXP < CPLXSXP < STRSXP That would be easy and wouldn't hurt performance at all. On 04.01.2013 23:18, patricknic wrote: > Some output: > > ## NAs in bound data >> dt > Warning messages: > 1: In rbindlist(dtlist) : NAs introduced by coercion > 2: In rbindlist(dtlist) : NAs introduced by coercion > 3: In rbindlist(dtlist) : NAs introduced by coercion > 4: In rbindlist(dtlist) : NAs introduced by coercion > 5: In rbindlist(dtlist) : NAs introduced by coercion > 6: In rbindlist(dtlist) : NAs introduced by coercion > ## No NAs in list of data.tables >> sapply(dtlist, function(x) sum(is.na [9](x))) > [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > [32] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > ## Summary of bound data.table >> summary(dt) > blockfips land_area water_area > Length:11083767 Min. :0.000e+00 Min. :0.000e+00 > Class :character 1st Qu.:8.098e+03 1st Qu.:0.000e+00 > Mode :character Median :2.478e+04 Median :0.000e+00 > Mean :7.470e+05 Mean :5.782e+04 > 3rd Qu.:1.788e+05 3rd Qu.:0.000e+00 > Max. :2.133e+09 Max. :2.112e+09 > NA's :183 NA's :14 > long lat > Min. :-179.13 Min. :18.91 > 1st Qu.: -99.74 1st Qu.:34.18 > Median : -90.09 Median :38.64 > Mean : -93.01 Mean :38.11 > 3rd Qu.: -82.07 3rd Qu.:41.73 > Max. : 179.75 Max. :71.40 > >> Many thanks. I'll take a look. If you can find a way to narrow >> down the problem then it might be quicker to resolve. Does it >> happen with the first 2 items passed to rblindlist, the first >> 10, which one causes the NA? If each item is chopped to the >> first 2 rows, does it still happen? > >> lapply(seq_along(dtlist), function(x) dtlist[[x]][, tab := x]) >> dt2 > Warning messages: > 1: In rbindlist(dtlist) : NAs introduced by coercion > 2: In rbindlist(dtlist) : NAs introduced by coercion > 3: In rbindlist(dtlist) : NAs introduced by coercion > 4: In rbindlist(dtlist) : NAs introduced by coercion > 5: In rbindlist(dtlist) : NAs introduced by coercion > 6: In rbindlist(dtlist) : NAs introduced by coercion >> dt2[which(apply(is.na [10](dt2), 1, any)), table(tab)] > tab > 2 13 23 45 50 > 183 1 10 1 2 > So, for the most part it's coming from the second list data.table. >> dtlist.first2 > >> dtlist.first10 >> dtlist.first100 >> dtlist.first1000 >> dt.first2 >> dt.first10 >> dt.first100 > Warning message: > In rbindlist(dtlist.first100) : NAs introduced by coercion >> dt.first1000 > Warning messages: > 1: In rbindlist(dtlist.first1000) : NAs introduced by coercion > 2: In rbindlist(dtlist.first1000) : NAs introduced by coercion > And NAs start getting introduced somewhere between 10 and 100 row data.tables, which seems really low. > >> Also if the list of data.table/data.frame passed to rbindlist >> is called L, and rbindlist(L) returns an NA column, does >> lapply(L, sapply, class) reveal any type differences? > >> do.call("rbind", lapply(dtlist, sapply, class)) > blockfips land_area water_area long lat > [1,] "character" "integer" "integer" "numeric" "numeric" > [2,] "character" "numeric" "numeric" "numeric" "numeric" > [3,] "character" "integer" "integer" "numeric" "numeric" > [4,] "character" "integer" "integer" "numeric" "numeric" > [5,] "character" "integer" "integer" "numeric" "numeric" > [6,] "character" "integer" "integer" "numeric" "numeric" > [7,] "character" "integer" "integer" "numeric" "numeric" > [8,] "character" "integer" "integer" "numeric" "numeric" > [9,] "character" "integer" "integer" "numeric" "numeric" > [10,] "character" "integer" "integer" "numeric" "numeric" > [11,] "character" "integer" "integer" "numeric" "numeric" > [12,] "character" "integer" "integer" "numeric" "numeric" > [13,] "character" "numeric" "integer" "numeric" "numeric" > [14,] "character" "integer" "integer" "numeric" "numeric" > [15,] "character" "integer" "integer" "numeric" "numeric" > [16,] "character" "integer" "integer" "numeric" "numeric" > [17,] "character" "integer" "integer" "numeric" "numeric" > [18,] "character" "integer" "integer" "numeric" "numeric" > [19,] "character" "integer" "integer" "numeric" "numeric" > [20,] "character" "integer" "integer" "numeric" "numeric" > [21,] "character" "integer" "integer" "numeric" "numeric" > [22,] "character" "integer" "integer" "numeric" "numeric" > [23,] "character" "integer" "numeric" "numeric" "numeric" > [24,] "character" "integer" "integer" "numeric" "numeric" > [25,] "character" "integer" "integer" "numeric" "numeric" > [26,] "character" "integer" "integer" "numeric" "numeric" > [27,] "character" "integer" "integer" "numeric" "numeric" > [28,] "character" "integer" "integer" "numeric" "numeric" > [29,] "character" "integer" "integer" "numeric" "numeric" > [30,] "character" "integer" "integer" "numeric" "numeric" > [31,] "character" "integer" "integer" "numeric" "numeric" > [32,] "character" "integer" "integer" "numeric" "numeric" > [33,] "character" "integer" "integer" "numeric" "numeric" > [34,] "character" "integer" "integer" "numeric" "numeric" > [35,] "character" "integer" "integer" "numeric" "numeric" > [36,] "character" "integer" "integer" "numeric" "numeric" > [37,] "character" "integer" "integer" "numeric" "numeric" > [38,] "character" "integer" "integer" "numeric" "numeric" > [39,] "character" "integer" "integer" "numeric" "numeric" > [40,] "character" "integer" "integer" "numeric" "numeric" > [41,] "character" "integer" "integer" "numeric" "numeric" > [42,] "character" "integer" "integer" "numeric" "numeric" > [43,] "character" "integer" "integer" "numeric" "numeric" > [44,] "character" "integer" "integer" "numeric" "numeric" > [45,] "character" "numeric" "integer" "numeric" "numeric" > [46,] "character" "integer" "integer" "numeric" "numeric" > [47,] "character" "integer" "integer" "numeric" "numeric" > [48,] "character" "integer" "integer" "numeric" "numeric" > [49,] "character" "integer" "integer" "numeric" "numeric" > [50,] "character" "integer" "numeric" "numeric" "numeric" > [51,] "character" "integer" "integer" "numeric" "numeric" > And there's the problem: in the problem list data.tables column 2 or 3 is numeric instead of integer. > It does sound like rbli > >> ; >> Hm. It seems I put it in but commented it out : >> if (TYPEOF(thiscol) != TYPEOF(target)) { >> thiscol = PROTECT(coerceVector(thiscol, TYPEOF(target))); >> coerced = TRUE; >> // TO DO: options(datatable.pedantic=TRUE) to issue this warning : >> // warning("Column %d of item %d is type '%s', inconsistent with >> column %d of item %d's type >> ('%s')",j+1,i+1,type2char(TYPEOF(thiscol)),j+1,first+1,type2char(TYPEOF(target))); >> } >> Likely that coerce is creating the NA. Types are taken from the first >> item of L. If a column there is 'numeric' then in a later item L it's >> character, that'll give rise to an NA. >> Thinking about it, it can probably coerce the target to cope with the >> later item ... >>> dtlist >>> dt >> >>> dt[, lapply(.SD, function(x) sum( > a>(x))), .SDcols=c("land_area", "water_area")] > land_area water_area > 1: 0 0 > And it's fixed. > > Thanks, > Patrick > > On Fri, Jan 4, 2013 at 4:52 AM, Matthew Dowle [via R] <[hidden email]> wrote: > > Many thanks. I'll take a look. If you can find a way to narrow > down the problem then it > >> ist, the first >> 10, which one causes the NA? If each item is chopped to the >> first 2 rows, does it still happen? >> >> Also if the list of data.table/data.frame passed to rbindlist >> is called L, and rbindlist(L) returns an NA column, does >> lapply(L, sapply, class) reveal any type differences? >> >> It does sound like rblindlist should be issuing a warning or >> being more helpful at least, anyway. >> >> Hm. It seems I put it in but commented it out : >> >> if (TYPEOF(thiscol) != TYPEOF(target)) { >> thiscol = PROTECT(coerceVector(thiscol, TYPEOF(target))); >> coerced = TRUE; >> // TO DO: options(datatable.pedantic=TRUE) to issue this warning : >> // warning("Column %d of item %d is type '%s', inconsistent with >> column %d of item %d's type >> ('%s')",j+1,i+1,type2char(TYPEOF(thiscol)),j+1,first+1,type2char(TYPEOF(target))); >> } >> >> Likely that coerce is creating the NA. Types are taken from the first >> item of L. If a column there is 'numeric' then in a later item L it's >> character, that'll give rise to an NA. >> >> Thinking about it, it can probably coerce the target to cope with the >> later item ... >> >> On 03.01.2013 20:30, patricknic wrote: >> >>> Apologies, I forgot to switch the directories in the code. Corrected >>> on >>> nabble and below. >>> >>> >>> >>> >>> # Directories >>> tempwd > setwd(tempwd) >>> >>> # Packages >>> library(dataframe) >>> library(data.table) >>> library(foreign) >>> >>> # Get blocks and coordinates >>> state.fips > 15:42, >>> 44:51, 53:56)) >>> tmpf > dtlist > cat("State", fips, ":t") >>> nm > dbfname > if (!file.exists(file.path(tempwd, dbfname))) { >>> cat("Downloading...t") >>> url > paste0("http://www2.census.gov/geo/tiger/TIGER2011/TABBLOCK/ [1]", >>> nm, ".zip") >>> download.file(url, destfile=tmp, quiet=FALSE) >>> unzip(tmp, exdir=tempwd) >>> } >>> del > invisible(lapply(del[grep("dbf", del, invert=TRUE)], file.remove)) >>> cat("Reading...t") >>> df as.is=TRUE) >>> dt > cat("Donen") >>> dt[, list(blockfips = GEOID, land_area = ALAND, water_area = >>> AWATER, long >>> = as.numeric(INTPTLON), >>> lat = as.numeric(INTPTLAT))] >>> }) >>> b > >>> ### No NA problem: >>> dtlist2 > b2 > >>> >>> >>> -- >>> View this message in context: >>> >>> http://r.789695.n4.nabble.com/NAs-introduced-by-coercion-in-rbindlist-tp4654576p4654577.html [2] > Sent from the datatable-help mailing list archive at Nabble.com. >>> _______________________________________________ >>> datatable-help mailing list > [hidden email] [3] >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [4] _______________________________________________ >> datatable-help mailing list >> [hidden email] [5] >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [6] >> >> ------------------------- >> >> If you reply to this email, your message will be added to the discussion below: http://r.789695.n4.nabble.com/NAs-introduced-by-coercion-in-rbindlist-tp4654576p4654623.html [7] >> To unsubscribe from NAs introduced by coercion in rbindlist(), click here. >> NAML [8] >> >> ------------------------- >> View this message in context: Re: NAs introduced by coercion > ) > Sent from the datatable-help mailing list archive [11] at Nabble.com. Links: ------ [1] http://www2.census.gov/geo/tiger/TIGER2011/TABBLOCK/ [2] http://r.789695.n4.nabble.com/NAs-introduced-by-coercion-in-rbindlist-tp4654576p4654577.html [3] http://user/SendEmail.jtp?type=node&node=4654623&i=0 [4] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [5] http://user/SendEmail.jtp?type=node&node=4654623&i=1 [6] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [7] http://r.789695.n4.nabble.com/NAs-introduced-by-coercion-in-rbindlist-tp4654576p4654623.html [8] http://r.789695.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml [9] http://is.na [10] http://is.na [11] http://r.789695.n4.nabble.com/datatable-help-f2315188.html
_______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
