[R] Efficiently Extracting Meta Data from TM Corpora

Shad Thomas Thu, 13 Aug 2009 10:18:11 -0700

I'm using text miner (the "tm" package) to process large numbers of blog and 
message board postings (about 245,000). Does anyone have any advice for how to 
efficiently extract the meta data from a corpus of this size?


TM does a great job of using MPI for many functions (e.g. tmMap) which greatly 
speed up the processing. However, the "meta" function that I need does not take 
advantage of MPI. 

I have two ideas: 
1) Find a way of running the meta function in parallel mode. Specifically, the 
code that I'm running is: 
urllist <- lapply(workingcorpus, meta, tag = "FeedUrl") 
Unfortunately, I receive the following error message when I try to use the 
command "parLapply" 
"Error in checkCluster(cl) : not a valid cluster 
Calls: parLapply ... is.vector -> clusterApply -> staticClusterApply -> 
checkCluster" 

2) Alternatively, I wonder if there might be a way of extracting all of the 
meta data into a data.frame that would be faster for processing? 

Thanks for any suggestions or ideas! 
Shad 


shad thomas | president | glass box research company | +1 (312) 451-3611 tel | 
shad.tho...@glassboxresearch.com | www.glassboxresearch.com 

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Efficiently Extracting Meta Data from TM Corpora

Reply via email to