Hi,

A part of my job for the GSoC is to discuss with you an interface for data imputation.

This topic is strongly related to the issue #1963 <https://github.com/scikit-learn/scikit-learn/issues/1963> and this mail <http://sourceforge.net/mailarchive/forum.php?thread_name=CAAkaFLUWinbY9gk4s6ePaypiHBZmsjDzNLjD3kqtpXrQcQa4Og%40mail.gmail.com&forum_name=scikit-learn-general> on the mailing list.

I would like to separate what should be done from how it will be done to make easier to check that everything was done correctly afterwards.

For the concepts, I've thought about a few things. I think it should at least be possible:

 * To easily change how the missing values are encoded.
 * To use the imputators many times without retraining them.
 * To use the imputators in pipelines
 * To impute only some of the missing values (rows, columns or a
   combination)
 * To impute in-place or in a new array

For me, data imputation is simply a particular transformation of the data. Particular in the the sense that it doesn't change the shape of the data and could be done in-place. So, I suggest to use the existing transform() and inverse_transform() and fit_transform() methods:

 * The transform() method would impute the data
 * The inverse_transform() would remove the data from the selected
   rows/columns
 * The fit_transform() would be used when the rows/columns to predict
   are included in the data used to fit (like in the matrix completion
   problem)
 * The rows/columns could be selected:
     o Using the constructor
     o Or using the imputation method
 * For the representation of the data, I think by default the
   non-encoded data in sparse matrices should be considered missing. A
   parameter in the constructor of the imputators could allow the user
   to select another value.

Here <http://pastebin.com/GLPcfhix> is the list of all the declarations of transform() methods in scikit-learn.

Have a nice day!

Nicolas

------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to