On Thu, Jun 07, 2012 at 10:40:32AM -0700, Vandana Bachani wrote: > Hi Andreas, > > I agree missing data is not specific to MLP. > We dealt it with pretty simple as u mentioned by taking mean over the > dataset for continuous-valued attributes. > Another thing that I feel is not adequately explored in the scikit > implementations is the discrete attributes. > Classification problems with discrete input features or a mix of discrete > and continuous features cannot be handled well. Many UCI datasets have a > mix of discrete and continuous attributes. > For discrete attributes we consider the missing values as another kind of > discrete value namely 'UNKNOWN'.
How are you encoding the discrete features? As one-hot vectors? In that case, a natural encoding for "unknown" is a zero-vector, as the stochastic gradient step will represent a no-op with respect to all of the weights for every possible value of that feature. Whether it's sensible to do *only* this depends, again, on whether the data is assumed missing-at-random or not. David ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general