Re: [Scikit-learn-general] Interface for data imputation

Joel Nothman Thu, 20 Jun 2013 17:38:07 -0700

I'm not certain I've understood all of your suggestions. I assume we can
consider a simple example like:

* X has some unknown values for some features
* they should be filled with the mean of the known values for those features

As long as the representation of unknown values is known (be it a
particular value, or use of a masked array), writing a Transformer should
be pretty straightforward, but I don't understand why you need extra
arguments to transform (which you imply by linking to #1963), or how
inverse_transform could possibly work. Could you give us an example?

Now I realise why you might want extra arguments to transform: if
imputation is conditioned on some additional array of per-sample data. Yes,
this would not currently work nicely with Pipeline (or FeatureUnion), and
perhaps it is a case for making those things more powerful, or perhaps it
is a case for suggesting the user write their own extension...

If imputation is conditioned on per-feature data, I think that is better
provided as an object parameter, not a method parameter. See #2027, #2034
about handling heterogenous data in transformers.

In the current scipy.sparse implementation, the value of non-encoded data
in sparse matrices is necessarily zero, and setting cells to zero makes
them disappear in sparse matrix transformations. So you can't use unfilled
cells as missing data, except where 0 isn't an option for actual values. In
general, you can allow the missing value indicator to be set as a
transformer parameter.

Currently, -1 is used for missing target values for semi-supervised
learning, not that there's a lot of it in scikit-learn. See #547, #430.

Finally, consider using `git grep -p 'def transform'` to tell you what
class the transform method is in.

- Joel

On Fri, Jun 21, 2013 at 7:56 AM, Nicolas Trésegnie <
nicolas.treseg...@gmail.com> wrote:

>  Hi,
>
> A part of my job for the GSoC is to discuss with you an interface for data
> imputation.
>
> This topic is strongly related to the issue 
> #1963<https://github.com/scikit-learn/scikit-learn/issues/1963>and this
> mail<http://sourceforge.net/mailarchive/forum.php?thread_name=CAAkaFLUWinbY9gk4s6ePaypiHBZmsjDzNLjD3kqtpXrQcQa4Og%40mail.gmail.com&forum_name=scikit-learn-general>on
>  the mailing list.
>
> I would like to separate what should be done from how it will be done to
> make easier to check that everything was done correctly afterwards.
>
> For the concepts, I've thought about a few things. I think it should at
> least be possible:
>
>    - To easily change how the missing values are encoded.
>    - To use the imputators many times without retraining them.
>    - To use the imputators in pipelines
>     - To impute only some of the missing values (rows, columns or a
>    combination)
>    - To impute in-place or in a new array
>
> For me, data imputation is simply a particular transformation of the data.
> Particular in the the sense that it doesn't change the shape of the data
> and could be done in-place. So, I suggest to use the existing transform()
> and inverse_transform() and fit_transform() methods:
>
>    - The transform() method would impute the data
>    - The inverse_transform() would remove the data from the selected
>    rows/columns
>    - The fit_transform() would be used when the rows/columns to predict
>    are included in the data used to fit (like in the matrix completion 
> problem)
>    - The rows/columns could be selected:
>       - Using the constructor
>       - Or using the imputation method
>    - For the representation of the data, I think by default the
>    non-encoded data in sparse matrices should be considered missing. A
>    parameter in the constructor of the imputators could allow the user to
>    select another value.
>
> Here <http://pastebin.com/GLPcfhix> is the list of all the declarations
> of transform() methods in scikit-learn.
>
> Have a nice day!
>
> Nicolas
>
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by Windows:
>
> Build for Windows Store.
>
> http://p.sf.net/sfu/windows-dev2dev
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Interface for data imputation

Reply via email to