On 11/29/2011 8:49 PM, Owen Densmore wrote:
Specifically, if the data set has highly correlated features such as sq. ft. of a house, and the number of floors, a dimensionality reduction algorithm is very likely to find high correlation with # floors and sq. ft. of the house, and merge these two into a single new reduced term.


A difficulty arrises: what do you name the new, reduced features?
Reserve a forbidden character (e.g. \001) as a delimiter and append the original strings upon the term reduction, forming a lexicon of those unique strings. Then you don't need to remember the index -> string relationships of the original encoding. Alternatively, to make a more dense encoding, one could take the integers corresponding to the terms' row or column indices and form a tuple or list of indices and hash on that to get the new identifier. Could accumulate that stuff recursively if you want to know the history of the encodings.

Marcus

============================================================
FRIAM Applied Complexity Group listserv
Meets Fridays 9a-11:30 at cafe at St. John's College
lectures, archives, unsubscribe, maps at http://www.friam.org

Reply via email to