Here's a quick mockup that I used for my syllables. This e-mail contains a write-up of my observations, followed by the reply to Andy's questions.
https://gist.github.com/4005112 I marked the groups using a indicator array. This way if you want to shuffle the dataset, you can just apply the same permutation to the groups vector and the score function will still work. Unfortunately the implementation is O(n_groups * n_samples), which in my case tends to n_samples ** 2 which makes it unfeasible. I quickly hacked a function to compute the score in one pass relying on the contiguity assumption, just so I can get a result. It's less elegant though, but I will come back to it. With additional memory it can be implemented in one go for the shuffled case. A note is that this generalizes easily: replace np.all with something like at_least_k, or some weighted average (as in the metrics used in coreference resolution[1] that I want to end up implementing). The specific aggregator function can be passed as a parameter. A general form for such a score would actually have two aggregator functions, one at group level and one at global level, but I can't think of any use cases where the global one would be anything else than the mean. [1] http://journals.cambridge.org/action/displayAbstract?aid=8376543 On Oct 31, 2012, at 13:19 , Andreas Mueller <amuel...@ais.uni-bonn.de> wrote: > Hi Vlad. > This is definitely a good question. I have that often when representing > an image as bags of keypoints / features. > Why is it not a good solution to have X as being a list of arrays / lists? Because if you feed such a structure into an estimator that uses mini-batches, you would want the data shuffled first, but the list of lists forces contiguity of classes. If your groups (documents) are small compared to the batch size, you can maybe split at group level, but it depends what independency assumptions you want. > Which algorithms do you want to use such samples in? Good question. For now I'm only interested in scores, but probably all objects should be aware of such structure and let it pass throught when they do their job, so the grouping can be fed in with the dataset and used at the end during scoring. > The text feature extraction sort of deals with this by using a list, right? I'm not sure what you mean by this. > Cheers, > Andy > > On 10/31/2012 01:13 PM, Vlad Niculae wrote: >> Hello, >> >> It seems I have reached again the need for something that became >> apparent when working with image patches last summer. Sometimes we >> don't have a 1 to 1 correspondence between samples (rows in X) and >> actual documents we are interested in scoring over. Instead, each >> document consists of (a different) number of samples. >> >> This can be implemented either as an extra masking array that says >> for each sample, what document it belongs to, by grouping `y` into >> a list of lists (cumbersome and fails for the unsupervised case), or >> by more clever / space efficient methods. >> >> The question is: did you need this? If so, how did you implement it? >> Are you aware of other general purpose libraries that provide such >> an API? Because I'm not. Next question is, what can we do about it? >> >> Example applications: >> >> - Image classification: >> first, from each image we extract k-by-k image patches, then we >> transform them by sparse coding, and finally we feed them into a >> classifier. This classifies each patch individually but in the end >> we would want to group the results within each image and compute >> "local" scores, or just take the max, for example. >> >> If using something like CIFAR where images have the same size, the >> problem is simplified because each image will be split in the exact >> same number of patches. If images have different shapes, or in the >> next examples, this assumption cannot be made. >> >> - Coreference resolution: >> A successful model for this problem is based on the mention pair >> structure. The goal is to identify clusters of noun phrases that >> refer to the same real-world entity. For each document (eg. news >> article), the possible mentions (NPs, pronouns) are identified. >> The feature extraction then builds "samples" in the form of all >> possible pairs of these (sometimes we filter out pairs that are >> obviously not coreferent, e.g. he / she, but this is disputable). >> >> Evaluating such systems requires average over document-level >> scores, because the document-level scores typically used do not >> distribute over averaging. [1] >> >> - Hyphenation: >> This is just something I'm currently working on but the same >> situation might occur more often. Documents are words, and >> samples are positions between letters within each word. >> Labels are whether it's correct to add a hyphen there or not. >> In the end, sklearn can easily report how many hyphens were >> correctly identified over the whole dictionary available. >> However a more realistic score would be: how many words were >> fully hyphenated correctly? This is because a sequence model >> can be smart enough to know that it's not frequent to >> insert three hyphens one after the other, for example a pattern >> ...xx-x-x-xx..., because of its global document-level awareness. >> It would be interesting to see how much this brings over a >> local SVM classifier that only sees one position at a time. >> >> Objects that should be aware of this: >> >> - score functions / metrics, >> - some transformers >> - resamplers / shufflers: we either want to keep documents together, >> or make sure that when reshuffling, document membership is not lost. >> >> >> Best, >> Vlad >> ------------------ >> Vlad N. >> http://vene.ro >> >> >> >> >> ------------------------------------------------------------------------------ >> Everyone hates slow websites. So do we. >> Make your web apps faster with AppDynamics >> Download AppDynamics Lite for free today: >> http://p.sf.net/sfu/appdyn_sfd2d_oct >> _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > ------------------------------------------------------------------------------ > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_sfd2d_oct > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------ Vlad N. http://vene.ro ------------------------------------------------------------------------------ LogMeIn Central: Instant, anywhere, Remote PC access and management. Stay in control, update software, and manage PCs from one command center Diagnose problems and improve visibility into emerging IT issues Automate, monitor and manage. Do more in less time with Central http://p.sf.net/sfu/logmein12331_d2d _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general