Hi Stian, Try the following two and look at the difference:
db[T, matches := str_match_all(text, url_pattern)] db[.(T), matches := str_match_all(text, url_pattern)] ;) On Fri, Sep 27, 2013 at 11:21 AM, Stian Håklev <[email protected]> wrote: > I really appreciate all your help - amazingly supportive community. I > could probably figure out a "brute-force" way of doing things, but since > I'm going to be writing a lot of R in the future too, I always want to find > the "correct" way of doing it, which both looks clear, and is quick. (I > come from a background in Ruby, and am always interested in writing very > clear and DRY (do not repeat yourself) code, but I find I still spend a lot > of time in R struggling with various data formats - lists, nested lists, > vectors, matrices, different forms of apply/ddply/for loops etc). > > Anyway, a few different points. > > I tried db[has_url,], but got "object has_url not found" > > I then tried setkey(db, "has_url"), and using this, but somehow it was a > lot slower than what I used to do (I repeated a few times). Not sure if I'm > doing it wrong. (Not important - even 15 sec is totally fine, I'll only run > this once. But good to understand the underlying principles). > > setkey(db, "has_url") > > system.time( db[T, matches := str_match_all(text, url_pattern)] ) > user system elapsed > 17.514 0.334 17.847 > > system.time( db[has_url == T, matches := str_match_all(text, > url_pattern)] ) > user system elapsed > 5.943 0.040 5.984 > > The second point was how to get out the matches. The idea was that you > have a text field which might contain several urls, which I want to > extract, but I need each URL tagged with the row it came from (so I can > link it back to properties of the post and author, look at whether certain > students are more likely to post certain kinds of URLs etc). > > Instead of a function, you'll see above that I rewrote it to use :=, which > creates a new column that holds a list. That worked wonderfully, but now > how do I get these "out" of this data.table, and into a new one. > > Made-up example data: > a <- c(1,2,3) > b <- c(2,3,4) > dt <- data.table(names=c("Stian", "Christian", "John"), numbers=list(a,b, > NULL)) > > Now my goal is to have a new data.table that looks like this > Name Number > Stian 1 > Stian 2 > Stian 3 > Christian 2 > Christian 3 > Christian 4 > > Again, I'm sure I could do this with a for() or lapply? But I'd love to > see the most elegant solution. > > Note that this: > > getUrls <- function(text, id) { > matches <- str_match_all(text, url_pattern) > data.frame(urls=unlist(matches), id=id) > } > > system.time( a <- db[(has_url), getUrls(text, id), by=id] ) > > Works perfectly, the result is > idurlsid116 > https://class.coursera.org/aboriginaled-001/forum/thread?thread_id=166 162 > 24http://www.youtube.com/watch?v=JUiGF4TGI9w24 344 > http://www.cbc.ca/revisionquest/blog/2010/07/21/july-21-july-24-the-metis-keeping-it-riel/ > 44461 > http://www.support-native-american-art.com/Native-American-Medicine-Wheels.html > 61575 > http://indigenousfoundations.arts.ubc.ca/home/government-policy/the-residential-school-system.html > 75675https://www.facebook.com/photo.php?fbid=10151324672623754 75 > > which is exactly what I was looking for. So I've really reached my goal, > but I'm curious about the other method as well. > > Thanks! > Stian > > > On Fri, Sep 27, 2013 at 8:48 AM, Matthew Dowle <[email protected]>wrote: > >> >> That was my thought too. I don't know what str_match_all is, but given >> the unlist() in getUrls(), it seems to return a list. Rather than >> unlist(), leave it as list, and data.table should happily make a `list` >> column where each cell is itself a vector. In fact each cell can be >> anything at all, even embedded data.table, function definitions, or any >> type of object. >> You might need a list(list(str_match_all(...))) in j to do that. >> >> Or what Rick has suggested here might work first time. It's hard to >> visualise it without a small reproducible example, so we're having to make >> educated guesses. >> >> Many thanks for the kind words about data.table. >> >> Matthew >> >> >> >> On 27/09/13 07:44, Ricardo Saporta wrote: >> >> In fact, you should be able to skip the function altogether and just >> use: >> >> db[ (has_url), str_match_all(text, url_pattern), by=id] >> >> >> (and now, my apologies to all for the email clutter) >> good night >> >> On Fri, Sep 27, 2013 at 2:41 AM, Ricardo Saporta < >> [email protected]> wrote: >> >>> sorry, I probably should have elaborated (it's late here, in NJ) >>> >>> The error you are seeing is most likely coming from your getURL >>> function in that you are adding several ids to a data.frame of varying >>> rows, and `R` cannot recycle it correctly. >>> >>> If you instead breakdown by id, then each time you are only assigning >>> one id and R will be able to recycle appropriately, without issue. >>> >>> good luck! >>> Rick >>> >>> >>> Ricardo Saporta >>> Graduate Student, Data Analytics >>> Rutgers University, New Jersey >>> e: [email protected] >>> >>> >>> >>> On Fri, Sep 27, 2013 at 2:37 AM, Ricardo Saporta < >>> [email protected]> wrote: >>> >>>> Hi there, >>>> >>>> Try inserting a `by=id` in >>>> >>>> a <- db[(has_url), getUrls(text, id), by=id] >>>> >>>> Also, no need for "has_url == T" >>>> instead, use >>>> (has_url) >>>> If the variable is alread logical. (Otherwise, you are just slowing >>>> things down ;) >>>> >>>> >>>> >>>> Ricardo Saporta >>>> Graduate Student, Data Analytics >>>> Rutgers University, New Jersey >>>> e: [email protected] >>>> >>>> >>>> >>>> On Thu, Sep 26, 2013 at 11:16 PM, Stian Håklev <[email protected]>wrote: >>>> >>>>> I'm trying to run a function on every row fulfilling a certain >>>>> criterium, which returns a data frame - the idea is then to take the list >>>>> of data frames and rbindlist them together for a totally separate >>>>> data.table. (I'm extracting several URL links from each forum post, and >>>>> tagging them with the forum post they came from). >>>>> >>>>> I tried doing this with a data.table >>>>> >>>>> a <- db[has_url == T, getUrls(text, id)] >>>>> >>>>> and get the message >>>>> >>>>> Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 6L, 1L, 2L, >>>>> 4L, : >>>>> replacement has 11007 rows, data has 29787 >>>>> >>>>> Because some rows have several URLs... However, I don't care that >>>>> these rowlengths don't match, I still want these rows :) I thought J would >>>>> just let me execute arbitrary R code in the context of the rows as >>>>> variable >>>>> names, etc. >>>>> >>>>> Here's the function it's running, but that shouldn't be relevant >>>>> >>>>> getUrls <- function(text, id) { >>>>> matches <- str_match_all(text, url_pattern) >>>>> a <- data.frame(urls=unlist(matches)) >>>>> a$id <- id >>>>> a >>>>> } >>>>> >>>>> >>>>> Thanks, and thanks for an amazing package - data.table has made my >>>>> life so much easier. It should be part of base, I think. >>>>> Stian Haklev, University of Toronto >>>>> >>>>> -- >>>>> http://reganmian.net/blog -- Random Stuff that Matters >>>>> >>>>> _______________________________________________ >>>>> datatable-help mailing list >>>>> [email protected] >>>>> >>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>>>> >>>> >>>> >>> >> >> >> _______________________________________________ >> datatable-help mailing >> [email protected]https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> >> > > > -- > http://reganmian.net/blog -- Random Stuff that Matters >
_______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
