That was my thought too. I don't know what str_match_all is, but given
the unlist() in getUrls(), it seems to return a list. Rather than
unlist(), leave it as list, and data.table should happily make a
`list` column where each cell is itself a vector. In fact each cell can
be anything at all, even embedded data.table, function definitions, or
any type of object.
You might need a list(list(str_match_all(...))) in j to do that.
Or what Rick has suggested here might work first time. It's hard to
visualise it without a small reproducible example, so we're having to
make educated guesses.
Many thanks for the kind words about data.table.
Matthew
On 27/09/13 07:44, Ricardo Saporta wrote:
In fact, you should be able to skip the function altogether and just use:
db[ (has_url), str_match_all(text, url_pattern), by=id]
(and now, my apologies to all for the email clutter)
good night
On Fri, Sep 27, 2013 at 2:41 AM, Ricardo Saporta
<[email protected]
<mailto:[email protected]>> wrote:
sorry, I probably should have elaborated (it's late here, in NJ)
The error you are seeing is most likely coming from your getURL
function in that you are adding several ids to a data.frame of
varying rows, and `R` cannot recycle it correctly.
If you instead breakdown by id, then each time you are only
assigning one id and R will be able to recycle appropriately,
without issue.
good luck!
Rick
Ricardo Saporta
Graduate Student, Data Analytics
Rutgers University, New Jersey
e: [email protected] <mailto:[email protected]>
On Fri, Sep 27, 2013 at 2:37 AM, Ricardo Saporta
<[email protected]
<mailto:[email protected]>> wrote:
Hi there,
Try inserting a `by=id` in
a <- db[(has_url), getUrls(text, id), by=id]
Also, no need for "has_url == T"
instead, use
(has_url)
If the variable is alread logical. (Otherwise, you are just
slowing things down ;)
Ricardo Saporta
Graduate Student, Data Analytics
Rutgers University, New Jersey
e: [email protected] <mailto:[email protected]>
On Thu, Sep 26, 2013 at 11:16 PM, Stian Håklev
<[email protected] <mailto:[email protected]>> wrote:
I'm trying to run a function on every row fulfilling a
certain criterium, which returns a data frame - the idea
is then to take the list of data frames and rbindlist them
together for a totally separate data.table. (I'm
extracting several URL links from each forum post, and
tagging them with the forum post they came from).
I tried doing this with a data.table
a <- db[has_url == T, getUrls(text, id)]
and get the message
Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 6L,
1L, 2L, 4L, :
replacement has 11007 rows, data has 29787
Because some rows have several URLs... However, I don't
care that these rowlengths don't match, I still want these
rows :) I thought J would just let me execute arbitrary R
code in the context of the rows as variable names, etc.
Here's the function it's running, but that shouldn't be
relevant
getUrls <- function(text, id) {
matches <- str_match_all(text, url_pattern)
a <- data.frame(urls=unlist(matches))
a$id <- id
a
}
Thanks, and thanks for an amazing package - data.table has
made my life so much easier. It should be part of base, I
think.
Stian Haklev, University of Toronto
--
http://reganmian.net/blog -- Random Stuff that Matters
_______________________________________________
datatable-help mailing list
[email protected]
<mailto:[email protected]>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help