Hi Vlad, This is a problem that I have often. In my settings, the 'document' would be a subject, and I might have multiple observations (time points) per subject.
In practice, I have found that there are 2 efficient ways of solving it, and that both approaches have pros and cons: 1) Concatenate everything in a big 2D array 'X', but have a vector of 'labels' that tracks which 'document' a sample belongs to. If you want to apply a multitask learner, such as a group-lasso, to such a problem, this is often a good representation. 2) Have a list of 2D arrays, and at some point a learner (or a transform) that knows how to do something clever with it. In practice, you probably want to avoid having a list, and it is better to have an array of dtype 'object', that contains arrays, because it then supports fancy indexing. The pro of approach 1) is that it works out of the box in estimators that don't support the notion of 'multi-task' or grouping, but can use the 'leave-one-label-out' approach for cross-validation. The con is that it creates big arrays in memory that get copied during the cross-validation. The pro of the other approach is specificaly that it avoids that last problem. The con is that it requires a special estimator or transform. I've dealt reasonnably well with this problems in research code in the last couple of years. We are going to want to release this code somewhat soon, so we are going to have to clean up our APIs. It will be interesting to see what comes out from this clean up. Gaƫl On Wed, Oct 31, 2012 at 01:13:53PM +0000, Vlad Niculae wrote: > It seems I have reached again the need for something that became > apparent when working with image patches last summer. Sometimes we > don't have a 1 to 1 correspondence between samples (rows in X) and > actual documents we are interested in scoring over. Instead, each > document consists of (a different) number of samples. > This can be implemented either as an extra masking array that says > for each sample, what document it belongs to, by grouping `y` into > a list of lists (cumbersome and fails for the unsupervised case), or > by more clever / space efficient methods. > The question is: did you need this? If so, how did you implement it? > Are you aware of other general purpose libraries that provide such > an API? Because I'm not. Next question is, what can we do about it? ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_nov _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general