Dear All, The following code illustrate the problem.
[R code] require(tm) exampledoc <- c("R is good", "R is really good") examplecorpus <- Corpus(VectorSource(exampledoc), encoding = "UTF-8") dtm <- DocumentTermMatrix(examplecorpus, control = list(minWordLength = 1)) as.matrix(dtm) [/R code] The term "R" and "is" were not included in the dtm even the control parameter minWordLength was set to 1. Terms Docs good really 1 1 0 2 1 1 Would you reproduce this problem? The following is my sessionInfo > sessionInfo() R version 2.15.0 (2012-03-30) Platform: i686-pc-linux-gnu (32-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] tm_0.5-7.1 loaded via a namespace (and not attached): [1] compiler_2.15.0 slam_0.1-23 tools_2.15.0 Regards, CH ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.