On Tue, Nov 29, 2011 at 8:49 PM, Owen Densmore <[email protected]> wrote:
> > Specifically, if the data set has highly correlated features such as sq. > ft. of a house, and the number of floors, a dimensionality reduction > algorithm is very likely to find high correlation with # floors and sq. ft. > of the house, and merge these two into a single new reduced term. > > A difficulty arrises: what do you name the new, reduced features? > > We always used to call them reduced dimensions 1, 2, 3, ..., because they never stuck around long enough to get familiar. Opening lines of the abstract for a Hadley Wickham <http://had.co.nz/> talk in Pittsburgh this week: It's often said that 80 percent of the effort of analysis is spent just getting the data ready to analyze, the process of data cleaning. Data cleaning is not only a vital first step, but it is often repeated multiple times over the course of an analysis as new problems come to light. If your data set is the only data set for the problem, and it's already perfect, and if your reduction method is the only one for the problem, and it's also perfect, or if all data sets and reduction methods give the exact same reduced dimensions, then you might have time to worry about what to call the reduced dimensions. Otherwise your time is better spent figuring out how to ensure that your data set is what you think it really is, because with probability 1 it's a horrible caricature of what you think it is. And every time you fix something in the data prep all your carefully chosen names go down the tubes with whatever amazing theories you attached to them. It may be that your class problems are perfect data sets for the perfect reduction methods they ask you to apply to them, that's never happened to me. -- rec --
============================================================ FRIAM Applied Complexity Group listserv Meets Fridays 9a-11:30 at cafe at St. John's College lectures, archives, unsubscribe, maps at http://www.friam.org
