Hey everybody!

I have to use R's tm package to do some text analysis, first thing would be to 
create a term frequency matrix.
Digging in tm's source code it seems like it uses some logic like this to 
create term frequencies:

data("crude")
(txt <- Content(crude[[1]]))
(tokTxt <- unlist(strsplit(gsub("[^[:alnum:]]+", " ", txt), " ", fixed = TRUE)))
table(factor(tokTxt, levels = c('two')))
table(factor(tokTxt, levels = c('two days')))

Like this code example demostrates the tokenization of the input text makes it 
impossible to use "a group of words separated by whitespace" as input words.

So my question is: How would you create such a term frequency matrix in R?

Here's some Ruby code I once wrote to show what I want:
txt = "some text containing two days\n"
freq = ['two', 'two days'].inject({}) { |h,w| h[w] = txt.scan(Regexp.compile(" 
#{w} ")).length; h }
(Reads as: "Given txt: Generate an associative array mapping words to the 
word's frequency in txt. To count occurences do not split the text at 
whitespace but instead use a regular expression to search for the word/group of 
words surrounded by whitespace in txt.")

Thanks in advance for any input!
--

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to