Hello, It seems I have reached again the need for something that became apparent when working with image patches last summer. Sometimes we don't have a 1 to 1 correspondence between samples (rows in X) and actual documents we are interested in scoring over. Instead, each document consists of (a different) number of samples.
This can be implemented either as an extra masking array that says for each sample, what document it belongs to, by grouping `y` into a list of lists (cumbersome and fails for the unsupervised case), or by more clever / space efficient methods. The question is: did you need this? If so, how did you implement it? Are you aware of other general purpose libraries that provide such an API? Because I'm not. Next question is, what can we do about it? Example applications: - Image classification: first, from each image we extract k-by-k image patches, then we transform them by sparse coding, and finally we feed them into a classifier. This classifies each patch individually but in the end we would want to group the results within each image and compute "local" scores, or just take the max, for example. If using something like CIFAR where images have the same size, the problem is simplified because each image will be split in the exact same number of patches. If images have different shapes, or in the next examples, this assumption cannot be made. - Coreference resolution: A successful model for this problem is based on the mention pair structure. The goal is to identify clusters of noun phrases that refer to the same real-world entity. For each document (eg. news article), the possible mentions (NPs, pronouns) are identified. The feature extraction then builds "samples" in the form of all possible pairs of these (sometimes we filter out pairs that are obviously not coreferent, e.g. he / she, but this is disputable). Evaluating such systems requires average over document-level scores, because the document-level scores typically used do not distribute over averaging. [1] - Hyphenation: This is just something I'm currently working on but the same situation might occur more often. Documents are words, and samples are positions between letters within each word. Labels are whether it's correct to add a hyphen there or not. In the end, sklearn can easily report how many hyphens were correctly identified over the whole dictionary available. However a more realistic score would be: how many words were fully hyphenated correctly? This is because a sequence model can be smart enough to know that it's not frequent to insert three hyphens one after the other, for example a pattern ...xx-x-x-xx..., because of its global document-level awareness. It would be interesting to see how much this brings over a local SVM classifier that only sees one position at a time. Objects that should be aware of this: - score functions / metrics, - some transformers - resamplers / shufflers: we either want to keep documents together, or make sure that when reshuffling, document membership is not lost. Best, Vlad ------------------ Vlad N. http://vene.ro ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_sfd2d_oct _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general