I tried writing a textual comparison program in R, but found that it is too slow for my purposes. I need to make about 145 million comparisons of the word patterns in pieces of text. I basically compare vectors that contain count data on a multitude of words and find ones that are similar to others (145million X 2 %in% comparisons, a couple million multinomial estimates). The best I could do in R would take about 9 hours of computation time (a matrix solution bogs down because of the size of the matrix: 10k by 17k; a looping solution takes too long). I've been looking into whether a Java or C routine would be the best alternative. Seems to be quite a debate on the web about which is faster, and I don't know how much of that debate is up to date. Any impressions out there regarding whether Java or C would be faster for this application and by how much? I'd have to learn quite a bit to implement either so I'd rather just work on one.
______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
