> system.time( db[T, matches := str_match_all(text, url_pattern)] ) user system elapsed 19.610 0.475 20.304 > system.time( db[.(T), matches := str_match_all(text, url_pattern)] ) Error in `[.data.table`(db, .(T), `:=`(matches, str_match_all(text, url_pattern))) : All items in j=list(...) should be atomic vectors or lists. If you are trying something like j=list(.SD,newcol=mean(colA)) then use := by group instead (much quicker), or cbind or merge afterwards. Timing stopped at: 6.339 0.043 6.403
On Fri, Sep 27, 2013 at 11:48 AM, Ricardo Saporta < [email protected]> wrote: > Hi Stian, > > Try the following two and look at the difference: > > db[T, matches := str_match_all(text, url_pattern)] > db[.(T), matches := str_match_all(text, url_pattern)] > > ;) > > > > On Fri, Sep 27, 2013 at 11:21 AM, Stian Håklev <[email protected]> wrote: > >> I really appreciate all your help - amazingly supportive community. I >> could probably figure out a "brute-force" way of doing things, but since >> I'm going to be writing a lot of R in the future too, I always want to find >> the "correct" way of doing it, which both looks clear, and is quick. (I >> come from a background in Ruby, and am always interested in writing very >> clear and DRY (do not repeat yourself) code, but I find I still spend a lot >> of time in R struggling with various data formats - lists, nested lists, >> vectors, matrices, different forms of apply/ddply/for loops etc). >> >> Anyway, a few different points. >> >> I tried db[has_url,], but got "object has_url not found" >> >> I then tried setkey(db, "has_url"), and using this, but somehow it was a >> lot slower than what I used to do (I repeated a few times). Not sure if I'm >> doing it wrong. (Not important - even 15 sec is totally fine, I'll only run >> this once. But good to understand the underlying principles). >> >> setkey(db, "has_url") >> > system.time( db[T, matches := str_match_all(text, url_pattern)] ) >> user system elapsed >> 17.514 0.334 17.847 >> > system.time( db[has_url == T, matches := str_match_all(text, >> url_pattern)] ) >> user system elapsed >> 5.943 0.040 5.984 >> >> The second point was how to get out the matches. The idea was that you >> have a text field which might contain several urls, which I want to >> extract, but I need each URL tagged with the row it came from (so I can >> link it back to properties of the post and author, look at whether certain >> students are more likely to post certain kinds of URLs etc). >> >> Instead of a function, you'll see above that I rewrote it to use :=, >> which creates a new column that holds a list. That worked wonderfully, but >> now how do I get these "out" of this data.table, and into a new one. >> >> Made-up example data: >> a <- c(1,2,3) >> b <- c(2,3,4) >> dt <- data.table(names=c("Stian", "Christian", "John"), numbers=list(a,b, >> NULL)) >> >> Now my goal is to have a new data.table that looks like this >> Name Number >> Stian 1 >> Stian 2 >> Stian 3 >> Christian 2 >> Christian 3 >> Christian 4 >> >> Again, I'm sure I could do this with a for() or lapply? But I'd love to >> see the most elegant solution. >> >> Note that this: >> >> getUrls <- function(text, id) { >> matches <- str_match_all(text, url_pattern) >> data.frame(urls=unlist(matches), id=id) >> } >> >> system.time( a <- db[(has_url), getUrls(text, id), by=id] ) >> >> Works perfectly, the result is >> idurlsid116 >> https://class.coursera.org/aboriginaled-001/forum/thread?thread_id=166 16 >> 224http://www.youtube.com/watch?v=JUiGF4TGI9w 24 344 >> http://www.cbc.ca/revisionquest/blog/2010/07/21/july-21-july-24-the-metis-keeping-it-riel/ >> 44461 >> http://www.support-native-american-art.com/Native-American-Medicine-Wheels.html >> 61575 >> http://indigenousfoundations.arts.ubc.ca/home/government-policy/the-residential-school-system.html >> 75675https://www.facebook.com/photo.php?fbid=10151324672623754 75 >> >> which is exactly what I was looking for. So I've really reached my goal, >> but I'm curious about the other method as well. >> >> Thanks! >> Stian >> >> >> On Fri, Sep 27, 2013 at 8:48 AM, Matthew Dowle <[email protected]>wrote: >> >>> >>> That was my thought too. I don't know what str_match_all is, but given >>> the unlist() in getUrls(), it seems to return a list. Rather than >>> unlist(), leave it as list, and data.table should happily make a `list` >>> column where each cell is itself a vector. In fact each cell can be >>> anything at all, even embedded data.table, function definitions, or any >>> type of object. >>> You might need a list(list(str_match_all(...))) in j to do that. >>> >>> Or what Rick has suggested here might work first time. It's hard to >>> visualise it without a small reproducible example, so we're having to make >>> educated guesses. >>> >>> Many thanks for the kind words about data.table. >>> >>> Matthew >>> >>> >>> >>> On 27/09/13 07:44, Ricardo Saporta wrote: >>> >>> In fact, you should be able to skip the function altogether and just >>> use: >>> >>> db[ (has_url), str_match_all(text, url_pattern), by=id] >>> >>> >>> (and now, my apologies to all for the email clutter) >>> good night >>> >>> On Fri, Sep 27, 2013 at 2:41 AM, Ricardo Saporta < >>> [email protected]> wrote: >>> >>>> sorry, I probably should have elaborated (it's late here, in NJ) >>>> >>>> The error you are seeing is most likely coming from your getURL >>>> function in that you are adding several ids to a data.frame of varying >>>> rows, and `R` cannot recycle it correctly. >>>> >>>> If you instead breakdown by id, then each time you are only assigning >>>> one id and R will be able to recycle appropriately, without issue. >>>> >>>> good luck! >>>> Rick >>>> >>>> >>>> Ricardo Saporta >>>> Graduate Student, Data Analytics >>>> Rutgers University, New Jersey >>>> e: [email protected] >>>> >>>> >>>> >>>> On Fri, Sep 27, 2013 at 2:37 AM, Ricardo Saporta < >>>> [email protected]> wrote: >>>> >>>>> Hi there, >>>>> >>>>> Try inserting a `by=id` in >>>>> >>>>> a <- db[(has_url), getUrls(text, id), by=id] >>>>> >>>>> Also, no need for "has_url == T" >>>>> instead, use >>>>> (has_url) >>>>> If the variable is alread logical. (Otherwise, you are just slowing >>>>> things down ;) >>>>> >>>>> >>>>> >>>>> Ricardo Saporta >>>>> Graduate Student, Data Analytics >>>>> Rutgers University, New Jersey >>>>> e: [email protected] >>>>> >>>>> >>>>> >>>>> On Thu, Sep 26, 2013 at 11:16 PM, Stian Håklev <[email protected]>wrote: >>>>> >>>>>> I'm trying to run a function on every row fulfilling a certain >>>>>> criterium, which returns a data frame - the idea is then to take the list >>>>>> of data frames and rbindlist them together for a totally separate >>>>>> data.table. (I'm extracting several URL links from each forum post, and >>>>>> tagging them with the forum post they came from). >>>>>> >>>>>> I tried doing this with a data.table >>>>>> >>>>>> a <- db[has_url == T, getUrls(text, id)] >>>>>> >>>>>> and get the message >>>>>> >>>>>> Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 6L, 1L, 2L, >>>>>> 4L, : >>>>>> replacement has 11007 rows, data has 29787 >>>>>> >>>>>> Because some rows have several URLs... However, I don't care that >>>>>> these rowlengths don't match, I still want these rows :) I thought J >>>>>> would >>>>>> just let me execute arbitrary R code in the context of the rows as >>>>>> variable >>>>>> names, etc. >>>>>> >>>>>> Here's the function it's running, but that shouldn't be relevant >>>>>> >>>>>> getUrls <- function(text, id) { >>>>>> matches <- str_match_all(text, url_pattern) >>>>>> a <- data.frame(urls=unlist(matches)) >>>>>> a$id <- id >>>>>> a >>>>>> } >>>>>> >>>>>> >>>>>> Thanks, and thanks for an amazing package - data.table has made my >>>>>> life so much easier. It should be part of base, I think. >>>>>> Stian Haklev, University of Toronto >>>>>> >>>>>> -- >>>>>> http://reganmian.net/blog -- Random Stuff that Matters >>>>>> >>>>>> _______________________________________________ >>>>>> datatable-help mailing list >>>>>> [email protected] >>>>>> >>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>>>>> >>>>> >>>>> >>>> >>> >>> >>> _______________________________________________ >>> datatable-help mailing >>> [email protected]https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> >>> >>> >> >> >> -- >> http://reganmian.net/blog -- Random Stuff that Matters >> > > -- http://reganmian.net/blog -- Random Stuff that Matters
_______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
