On Thu, Jun 07, 2012 at 10:40:32AM -0700, Vandana Bachani wrote:
> Hi Andreas,
> 
> I agree missing data is not specific to MLP.
> We dealt it with pretty simple as u mentioned by taking mean over the
> dataset for continuous-valued attributes.
> Another thing that I feel is not adequately explored in the scikit
> implementations is the discrete attributes.
> Classification problems with discrete input features or a mix of discrete
> and continuous features cannot be handled well. Many UCI datasets have a
> mix of discrete and continuous attributes.
> For discrete attributes we consider the missing values as another kind of
> discrete value namely 'UNKNOWN'.

How are you encoding the discrete features? As one-hot vectors?

In that case, a natural encoding for "unknown" is a zero-vector, as the
stochastic gradient step will represent a no-op with respect to all of the
weights for every possible value of that feature. Whether it's sensible
to do *only* this depends, again, on whether the data is assumed
missing-at-random or not.

David

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to