Hi,
A part of my job for the GSoC is to discuss with you an interface for
data imputation.
This topic is strongly related to the issue #1963
<https://github.com/scikit-learn/scikit-learn/issues/1963> and this mail
<http://sourceforge.net/mailarchive/forum.php?thread_name=CAAkaFLUWinbY9gk4s6ePaypiHBZmsjDzNLjD3kqtpXrQcQa4Og%40mail.gmail.com&forum_name=scikit-learn-general>
on the mailing list.
I would like to separate what should be done from how it will be done to
make easier to check that everything was done correctly afterwards.
For the concepts, I've thought about a few things. I think it should at
least be possible:
* To easily change how the missing values are encoded.
* To use the imputators many times without retraining them.
* To use the imputators in pipelines
* To impute only some of the missing values (rows, columns or a
combination)
* To impute in-place or in a new array
For me, data imputation is simply a particular transformation of the
data. Particular in the the sense that it doesn't change the shape of
the data and could be done in-place. So, I suggest to use the existing
transform() and inverse_transform() and fit_transform() methods:
* The transform() method would impute the data
* The inverse_transform() would remove the data from the selected
rows/columns
* The fit_transform() would be used when the rows/columns to predict
are included in the data used to fit (like in the matrix completion
problem)
* The rows/columns could be selected:
o Using the constructor
o Or using the imputation method
* For the representation of the data, I think by default the
non-encoded data in sparse matrices should be considered missing. A
parameter in the constructor of the imputators could allow the user
to select another value.
Here <http://pastebin.com/GLPcfhix> is the list of all the declarations
of transform() methods in scikit-learn.
Have a nice day!
Nicolas
------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:
Build for Windows Store.
http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general