Re: [Scikit-learn-general] "reverse feature engineering" (or something vague like that)

Gael Varoquaux Wed, 03 Oct 2012 01:40:24 -0700

Hi Christian,

Just a quick answer, as this is a vast question. I have indeed been
working on similar problems for the last few years: we do not want to use
classifiers as black boxes, but use them to come to conclusions about the
data-generating mechanism.


My point of view on this is that the problem should be understood as a
regularized estimation problem. 

On the one hand, your classifier as a 'forward model', that describe the
link between a set of parameters, and the observed data and target
variables. That's a first component in terms of understanding the
classification. For linear models (including linear SVMs), it's simply a
linear link between coefficients, a design matrix, and the observed
variable.

On the other hand, your classifier comes most probably with some sort of
regularization to simplify the learning. Either the forward model is
parametrized with very few parameters, and that implicit restriction of
the model space is a regularization, or there is an additional
regularization, for instance in the form of a penalization. This
regularization comes in as imposing a form of prior knowledge that moves
your solution away from a solution matching fully your data, to a
solution matching what the estimator think is a 'simple and elegant'
model.

In terms of understanding and controlling the estimated parameters,
knowing the effect of the regularization and understanding where the
trade-off between model fit and regularization sits is critical. For
instance, I should not be surprised to find a sparse solution if I have
used an estimator based on sparsity. Thus coming to conclusion about this
sparsity is difficult and probably meaningless. Knowing what can be
concluded and what cannot from an estimator is the scope of 'learning
theory' and pretty requires understanding properties on a case-by-case
basis. It's hard, and it's easy to come up with meaningless conclusions
if you don't understand the estimator.

That said, to conclude with a positive message, something that gives you
some control on the parameters estimated is to run bootstrap and
permutation tests on them. Chances are that you will realize the main
caveats related to an estimator if you permute and bootstrap it.

Hope this helps,

Gaël


------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] "reverse feature engineering" (or something vague like that)

Reply via email to