Re: [Scikit-learn-general] API for multi-sample "documents"

Gael Varoquaux Sat, 10 Nov 2012 04:45:03 -0800

Hi Vlad,

This is a problem that I have often. In my settings, the 'document' would
be a subject, and I might have multiple observations (time points) per
subject.

In practice, I have found that there are 2 efficient ways of solving it,
and that both approaches have pros and cons:

1) Concatenate everything in a big 2D array 'X', but have a vector of
   'labels' that tracks which 'document' a sample belongs to. If you want
   to apply a multitask learner, such as a group-lasso, to such a
   problem, this is often a good representation.

2) Have a list of 2D arrays, and at some point a learner (or a transform)
   that knows how to do something clever with it. In practice, you
   probably want to avoid having a list, and it is better to have an
   array of dtype 'object', that contains arrays, because it then
   supports fancy indexing.

The pro of approach 1) is that it works out of the box in estimators that
don't support the notion of 'multi-task' or grouping, but can use the
'leave-one-label-out' approach for cross-validation. The con is that it
creates big arrays in memory that get copied during the cross-validation.
The pro of the other approach is specificaly that it avoids that last
problem. The con is that it requires a special estimator or transform.

I've dealt reasonnably well with this problems in research code in the
last couple of years. We are going to want to release this code somewhat
soon, so we are going to have to clean up our APIs. It will be
interesting to see what comes out from this clean up.

Gaël

On Wed, Oct 31, 2012 at 01:13:53PM +0000, Vlad Niculae wrote:
> It seems I have reached again the need for something that became
> apparent when working with image patches last summer. Sometimes we
> don't have a 1 to 1 correspondence between samples (rows in X) and
> actual documents we are interested in scoring over. Instead, each
> document consists of (a different) number of samples.

> This can be implemented either as an extra masking array that says
> for each sample, what document it belongs to, by grouping `y` into
> a list of lists (cumbersome and fails for the unsupervised case), or
> by more clever / space efficient methods.

> The question is: did you need this? If so, how did you implement it?
> Are you aware of other general purpose libraries that provide such
> an API? Because I'm not. Next question is, what can we do about it?

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_nov
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] API for multi-sample "documents"

Reply via email to