Hi Vlad. This is definitely a good question. I have that often when representing an image as bags of keypoints / features. Why is it not a good solution to have X as being a list of arrays / lists? Which algorithms do you want to use such samples in? The text feature extraction sort of deals with this by using a list, right?
Cheers, Andy On 10/31/2012 01:13 PM, Vlad Niculae wrote: > Hello, > > It seems I have reached again the need for something that became > apparent when working with image patches last summer. Sometimes we > don't have a 1 to 1 correspondence between samples (rows in X) and > actual documents we are interested in scoring over. Instead, each > document consists of (a different) number of samples. > > This can be implemented either as an extra masking array that says > for each sample, what document it belongs to, by grouping `y` into > a list of lists (cumbersome and fails for the unsupervised case), or > by more clever / space efficient methods. > > The question is: did you need this? If so, how did you implement it? > Are you aware of other general purpose libraries that provide such > an API? Because I'm not. Next question is, what can we do about it? > > Example applications: > > - Image classification: > first, from each image we extract k-by-k image patches, then we > transform them by sparse coding, and finally we feed them into a > classifier. This classifies each patch individually but in the end > we would want to group the results within each image and compute > "local" scores, or just take the max, for example. > > If using something like CIFAR where images have the same size, the > problem is simplified because each image will be split in the exact > same number of patches. If images have different shapes, or in the > next examples, this assumption cannot be made. > > - Coreference resolution: > A successful model for this problem is based on the mention pair > structure. The goal is to identify clusters of noun phrases that > refer to the same real-world entity. For each document (eg. news > article), the possible mentions (NPs, pronouns) are identified. > The feature extraction then builds "samples" in the form of all > possible pairs of these (sometimes we filter out pairs that are > obviously not coreferent, e.g. he / she, but this is disputable). > > Evaluating such systems requires average over document-level > scores, because the document-level scores typically used do not > distribute over averaging. [1] > > - Hyphenation: > This is just something I'm currently working on but the same > situation might occur more often. Documents are words, and > samples are positions between letters within each word. > Labels are whether it's correct to add a hyphen there or not. > In the end, sklearn can easily report how many hyphens were > correctly identified over the whole dictionary available. > However a more realistic score would be: how many words were > fully hyphenated correctly? This is because a sequence model > can be smart enough to know that it's not frequent to > insert three hyphens one after the other, for example a pattern > ...xx-x-x-xx..., because of its global document-level awareness. > It would be interesting to see how much this brings over a > local SVM classifier that only sees one position at a time. > > Objects that should be aware of this: > > - score functions / metrics, > - some transformers > - resamplers / shufflers: we either want to keep documents together, > or make sure that when reshuffling, document membership is not lost. > > > Best, > Vlad > ------------------ > Vlad N. > http://vene.ro > > > > > ------------------------------------------------------------------------------ > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_sfd2d_oct > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_sfd2d_oct _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general