On Fri, Jun 21, 2013 at 2:28 PM, Lars Buitinck <l.j.buiti...@uva.nl> wrote:
>
> Besides, scipy.sparse is hard to update in-place, is a very wasteful
> representation for dense data and is harder to work with than np.array
> (for us, but more importantly for users).
>
And it can't be masked..?
> > Currently, -1 is used for missing target values for semi-supervised
> > learning, not that there's a lot of it in scikit-learn. See #547, #430.
>
> -1 is a very valid feature value, though. It's only treated as a
> special label value in a few restricted cases (semi-supervised
> learning, outlier detection).
>
Of course. I don't think I meant quite what I said. I meant to say that at
least in some of those cases, the user was able to specify another value.
One advantage of masked arrays is that they work with any underlying
datatype, but I'm not sure that's useful here (can you impute categorical
values before binarizing them?). Another is that they include
implementations of mean, std, sum, median, max, etc that ignore the masked
cells. This allows vectorized operations across the whole matrix rather
than doing things column by column to ignore the nans. So it'll probably be
a useful intermediate format anyway. Its main annoyance is user
friendliness and the additional testing needed to make sure they're handled
by everything.
And I agree that sparse input should probably not be handled (for now?)...
If for no other reason than that for sparse X you can't currently do things
neatly like X[X == np.nan] = 1 (but you probably will be able to do it
after this GSOC).
------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:
Build for Windows Store.
http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general