On 11/29/2011 8:49 PM, Owen Densmore wrote:
Specifically, if the data set has highly correlated features such as
sq. ft. of a house, and the number of floors, a dimensionality
reduction algorithm is very likely to find high correlation with #
floors and sq. ft. of the house, and merge these two into a single new
reduced term.
A difficulty arrises: what do you name the new, reduced features?
Reserve a forbidden character (e.g. \001) as a delimiter and append the
original strings upon the term reduction, forming a lexicon of those
unique strings. Then you don't need to remember the index -> string
relationships of the original encoding. Alternatively, to make a more
dense encoding, one could take the integers corresponding to the terms'
row or column indices and form a tuple or list of indices and hash on
that to get the new identifier. Could accumulate that stuff
recursively if you want to know the history of the encodings.
Marcus
============================================================
FRIAM Applied Complexity Group listserv
Meets Fridays 9a-11:30 at cafe at St. John's College
lectures, archives, unsubscribe, maps at http://www.friam.org