Milan, Jeff, Patrick, Thank you for your comments and suggestions.
Milan, This is far from a "completely theoretical problem". I am performing text analytics on a corpus of about 2m documents. There are tens of thousands of distinct words (lemmata). It seems to me that the natural representation of words is as an "enumeration type" -- in R terms, a "factor". Why do I think factors are the "natural way" of representing such things? Because for most kinds of analysis, only their *identity* matters (not their spelling as words), but the human user would like to see names, not numbers. That is pretty much the definition of an enumeration type. In terms of R implementation, R is very efficient in dealing with integer identities and indexing (e.g. tabulate) and not very efficient in dealing with character identities -- indeed, 'table' first converts strings into factors. Of course I could represent the lemmata as integers, and perform the translation between integers and strings myself, but that would just be duplicating the function of an enumeration type. Jeffrey, Extending R "via the mechanisms in place" is exactly what I have in mind. Of course, if it's already been done, I'd rather reuse that work than start from scratch, which is why my message explicitly asks if there is a "factors package using this or some similar approach". I did search CRAN, and wasn't able to find such a thing, but I may have missed something, which is why I sent my message to the list. Patrick, Data.table certainly has some useful mechanisms, and I've been experimenting with it as an implementation mechanism, though it's not a drop-in substitute for factors. Also, though it is efficient for set operations between small sets and large sets, it is not very efficient for operations between two large sets -- I am working with its implementors to see if we can put in place a better algorithm based on e.g. Demaine et al.<http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.36.9963>and Barbay et al <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.87.9365>. Thanks everyone, and if you do come across a relevant CRAN package, I'd be very interested in hearing about it. -s [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel