hm... not sure about `j` (sorry, I havent taken a close look at your code), but my comment was to point out that these two statements are different:
DT [ TRUE, ] DT [ .(TRUE), ] The first one is giving you the whole data.table DT[TRUE, ] is the same as DT (since TRUE is getting recycled) The second one is giving you all rows within DT where the first column of the key has a value of TRUE. Ricardo Saporta Graduate Student, Data Analytics Rutgers University, New Jersey e: [email protected] On Fri, Sep 27, 2013 at 12:20 PM, Stian Håklev <[email protected]> wrote: > > system.time( db[T, matches := str_match_all(text, url_pattern)] ) > user system elapsed > 19.610 0.475 20.304 > > system.time( db[.(T), matches := str_match_all(text, url_pattern)] ) > Error in `[.data.table`(db, .(T), `:=`(matches, str_match_all(text, > url_pattern))) : > All items in j=list(...) should be atomic vectors or lists. If you are > trying something like j=list(.SD,newcol=mean(colA)) then use := by group > instead (much quicker), or cbind or merge afterwards. > Timing stopped at: 6.339 0.043 6.403 > > > On Fri, Sep 27, 2013 at 11:48 AM, Ricardo Saporta < > [email protected]> wrote: > >> Hi Stian, >> >> Try the following two and look at the difference: >> >> db[T, matches := str_match_all(text, url_pattern)] >> db[.(T), matches := str_match_all(text, url_pattern)] >> >> ;) >> >> >> >> On Fri, Sep 27, 2013 at 11:21 AM, Stian Håklev <[email protected]> wrote: >> >>> I really appreciate all your help - amazingly supportive community. I >>> could probably figure out a "brute-force" way of doing things, but since >>> I'm going to be writing a lot of R in the future too, I always want to find >>> the "correct" way of doing it, which both looks clear, and is quick. (I >>> come from a background in Ruby, and am always interested in writing very >>> clear and DRY (do not repeat yourself) code, but I find I still spend a lot >>> of time in R struggling with various data formats - lists, nested lists, >>> vectors, matrices, different forms of apply/ddply/for loops etc). >>> >>> Anyway, a few different points. >>> >>> I tried db[has_url,], but got "object has_url not found" >>> >>> I then tried setkey(db, "has_url"), and using this, but somehow it was a >>> lot slower than what I used to do (I repeated a few times). Not sure if I'm >>> doing it wrong. (Not important - even 15 sec is totally fine, I'll only run >>> this once. But good to understand the underlying principles). >>> >>> setkey(db, "has_url") >>> > system.time( db[T, matches := str_match_all(text, url_pattern)] ) >>> user system elapsed >>> 17.514 0.334 17.847 >>> > system.time( db[has_url == T, matches := str_match_all(text, >>> url_pattern)] ) >>> user system elapsed >>> 5.943 0.040 5.984 >>> >>> The second point was how to get out the matches. The idea was that you >>> have a text field which might contain several urls, which I want to >>> extract, but I need each URL tagged with the row it came from (so I can >>> link it back to properties of the post and author, look at whether certain >>> students are more likely to post certain kinds of URLs etc). >>> >>> Instead of a function, you'll see above that I rewrote it to use :=, >>> which creates a new column that holds a list. That worked wonderfully, but >>> now how do I get these "out" of this data.table, and into a new one. >>> >>> Made-up example data: >>> a <- c(1,2,3) >>> b <- c(2,3,4) >>> dt <- data.table(names=c("Stian", "Christian", "John"), >>> numbers=list(a,b, NULL)) >>> >>> Now my goal is to have a new data.table that looks like this >>> Name Number >>> Stian 1 >>> Stian 2 >>> Stian 3 >>> Christian 2 >>> Christian 3 >>> Christian 4 >>> >>> Again, I'm sure I could do this with a for() or lapply? But I'd love to >>> see the most elegant solution. >>> >>> Note that this: >>> >>> getUrls <- function(text, id) { >>> matches <- str_match_all(text, url_pattern) >>> data.frame(urls=unlist(matches), id=id) >>> } >>> >>> system.time( a <- db[(has_url), getUrls(text, id), by=id] ) >>> >>> Works perfectly, the result is >>> idurlsid116 >>> https://class.coursera.org/aboriginaled-001/forum/thread?thread_id=166 >>> 16224http://www.youtube.com/watch?v=JUiGF4TGI9w 24 344 >>> http://www.cbc.ca/revisionquest/blog/2010/07/21/july-21-july-24-the-metis-keeping-it-riel/ >>> 44461 >>> http://www.support-native-american-art.com/Native-American-Medicine-Wheels.html >>> 61575 >>> http://indigenousfoundations.arts.ubc.ca/home/government-policy/the-residential-school-system.html >>> 75675https://www.facebook.com/photo.php?fbid=10151324672623754 75 >>> >>> which is exactly what I was looking for. So I've really reached my goal, >>> but I'm curious about the other method as well. >>> >>> Thanks! >>> Stian >>> >>> >>> On Fri, Sep 27, 2013 at 8:48 AM, Matthew Dowle >>> <[email protected]>wrote: >>> >>>> >>>> That was my thought too. I don't know what str_match_all is, but >>>> given the unlist() in getUrls(), it seems to return a list. Rather than >>>> unlist(), leave it as list, and data.table should happily make a `list` >>>> column where each cell is itself a vector. In fact each cell can be >>>> anything at all, even embedded data.table, function definitions, or any >>>> type of object. >>>> You might need a list(list(str_match_all(...))) in j to do that. >>>> >>>> Or what Rick has suggested here might work first time. It's hard to >>>> visualise it without a small reproducible example, so we're having to make >>>> educated guesses. >>>> >>>> Many thanks for the kind words about data.table. >>>> >>>> Matthew >>>> >>>> >>>> >>>> On 27/09/13 07:44, Ricardo Saporta wrote: >>>> >>>> In fact, you should be able to skip the function altogether and just >>>> use: >>>> >>>> db[ (has_url), str_match_all(text, url_pattern), by=id] >>>> >>>> >>>> (and now, my apologies to all for the email clutter) >>>> good night >>>> >>>> On Fri, Sep 27, 2013 at 2:41 AM, Ricardo Saporta < >>>> [email protected]> wrote: >>>> >>>>> sorry, I probably should have elaborated (it's late here, in NJ) >>>>> >>>>> The error you are seeing is most likely coming from your getURL >>>>> function in that you are adding several ids to a data.frame of varying >>>>> rows, and `R` cannot recycle it correctly. >>>>> >>>>> If you instead breakdown by id, then each time you are only >>>>> assigning one id and R will be able to recycle appropriately, without >>>>> issue. >>>>> >>>>> good luck! >>>>> Rick >>>>> >>>>> >>>>> Ricardo Saporta >>>>> Graduate Student, Data Analytics >>>>> Rutgers University, New Jersey >>>>> e: [email protected] >>>>> >>>>> >>>>> >>>>> On Fri, Sep 27, 2013 at 2:37 AM, Ricardo Saporta < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi there, >>>>>> >>>>>> Try inserting a `by=id` in >>>>>> >>>>>> a <- db[(has_url), getUrls(text, id), by=id] >>>>>> >>>>>> Also, no need for "has_url == T" >>>>>> instead, use >>>>>> (has_url) >>>>>> If the variable is alread logical. (Otherwise, you are just slowing >>>>>> things down ;) >>>>>> >>>>>> >>>>>> >>>>>> Ricardo Saporta >>>>>> Graduate Student, Data Analytics >>>>>> Rutgers University, New Jersey >>>>>> e: [email protected] >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Sep 26, 2013 at 11:16 PM, Stian Håklev <[email protected]>wrote: >>>>>> >>>>>>> I'm trying to run a function on every row fulfilling a certain >>>>>>> criterium, which returns a data frame - the idea is then to take the >>>>>>> list >>>>>>> of data frames and rbindlist them together for a totally separate >>>>>>> data.table. (I'm extracting several URL links from each forum post, and >>>>>>> tagging them with the forum post they came from). >>>>>>> >>>>>>> I tried doing this with a data.table >>>>>>> >>>>>>> a <- db[has_url == T, getUrls(text, id)] >>>>>>> >>>>>>> and get the message >>>>>>> >>>>>>> Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 6L, 1L, 2L, >>>>>>> 4L, : >>>>>>> replacement has 11007 rows, data has 29787 >>>>>>> >>>>>>> Because some rows have several URLs... However, I don't care that >>>>>>> these rowlengths don't match, I still want these rows :) I thought J >>>>>>> would >>>>>>> just let me execute arbitrary R code in the context of the rows as >>>>>>> variable >>>>>>> names, etc. >>>>>>> >>>>>>> Here's the function it's running, but that shouldn't be relevant >>>>>>> >>>>>>> getUrls <- function(text, id) { >>>>>>> matches <- str_match_all(text, url_pattern) >>>>>>> a <- data.frame(urls=unlist(matches)) >>>>>>> a$id <- id >>>>>>> a >>>>>>> } >>>>>>> >>>>>>> >>>>>>> Thanks, and thanks for an amazing package - data.table has made my >>>>>>> life so much easier. It should be part of base, I think. >>>>>>> Stian Haklev, University of Toronto >>>>>>> >>>>>>> -- >>>>>>> http://reganmian.net/blog -- Random Stuff that Matters >>>>>>> >>>>>>> _______________________________________________ >>>>>>> datatable-help mailing list >>>>>>> [email protected] >>>>>>> >>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>>> _______________________________________________ >>>> datatable-help mailing >>>> [email protected]https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>>> >>>> >>>> >>> >>> >>> -- >>> http://reganmian.net/blog -- Random Stuff that Matters >>> >> >> > > > -- > http://reganmian.net/blog -- Random Stuff that Matters >
_______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
