Hi David,
Yes I use one-hot encoding, but my understanding of one-hot encoding says
that each discrete attribute can be represented as a bit pattern. So the
node corresponding to that input attribute is actually a set of nodes
representing that bit pattern. An unknown just means that the bit for
unknown value is set to one and rest are set to 0. At any instance the
nodes corresponding to an input attribute will have atleast one node with a
value of 1. The downside of using one hot encoding is that it bloats up the
weight space and the number of input units but I guess thats ok as this is
one of the best ways of doing discrete attribute classification if we are
to use MLPs.
Thanks,
Vandana
On Thu, Jun 7, 2012 at 11:12 AM, David Warde-Farley <
warde...@iro.umontreal.ca> wrote:
> On Thu, Jun 07, 2012 at 10:40:32AM -0700, Vandana Bachani wrote:
> > Hi Andreas,
> >
> > I agree missing data is not specific to MLP.
> > We dealt it with pretty simple as u mentioned by taking mean over the
> > dataset for continuous-valued attributes.
> > Another thing that I feel is not adequately explored in the scikit
> > implementations is the discrete attributes.
> > Classification problems with discrete input features or a mix of discrete
> > and continuous features cannot be handled well. Many UCI datasets have a
> > mix of discrete and continuous attributes.
> > For discrete attributes we consider the missing values as another kind of
> > discrete value namely 'UNKNOWN'.
>
> How are you encoding the discrete features? As one-hot vectors?
>
> In that case, a natural encoding for "unknown" is a zero-vector, as the
> stochastic gradient step will represent a no-op with respect to all of the
> weights for every possible value of that feature. Whether it's sensible
> to do *only* this depends, again, on whether the data is assumed
> missing-at-random or not.
>
> David
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
--
Vandana Bachani
Graduate Student, MSCE
Computer Science & Engineering Department
Texas A&M University, College Station
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general