Re: [Scikit-learn-general] API for multi-sample "documents"

Vlad Niculae Fri, 02 Nov 2012 19:08:46 -0700

Here's a quick mockup that I used for my syllables. This e-mail contains
a write-up of my observations, followed by the reply to Andy's questions.


https://gist.github.com/4005112

I marked the groups using a indicator array. This way if you want to shuffle
the dataset, you can just apply the same permutation to the groups vector
and the score function will still work. Unfortunately the implementation
is O(n_groups * n_samples), which in my case tends to n_samples ** 2 which
makes it unfeasible.

I quickly hacked a function to compute the score in one pass relying on the
contiguity assumption, just so I can get a result. It's less elegant though,
but I will come back to it. With additional memory it can be implemented in
one go for the shuffled case.

A note is that this generalizes easily: replace np.all with something like
at_least_k, or some weighted average (as in the metrics used in coreference
resolution[1] that I want to end up implementing). The specific aggregator
function can be passed as a parameter.

A general form for such a score would actually have two aggregator functions,
one at group level and one at global level, but I can't think of any use
cases where the global one would be anything else than the mean. 


[1] http://journals.cambridge.org/action/displayAbstract?aid=8376543
On Oct 31, 2012, at 13:19 , Andreas Mueller <amuel...@ais.uni-bonn.de> wrote:

> Hi Vlad.
> This is definitely a good question. I have that often when representing
> an image as bags of keypoints / features.
> Why is it not a good solution to have X as being a list of  arrays / lists?

Because if you feed such a structure into an estimator that uses
mini-batches, you would want the data shuffled first, but the list
of lists forces contiguity of classes. If your groups (documents) are small
compared to the batch size, you can maybe split at group level, but it
depends what independency assumptions you want.

> Which algorithms do you want to use such samples in?

Good question. For now I'm only interested in scores, but probably all objects
should be aware of such structure and let it pass throught when they do their
job, so the grouping can be fed in with the dataset and used at the end during
scoring.

> The text feature extraction sort of deals with this by using a list, right?

I'm not sure what you mean by this.

> Cheers,
> Andy
> 
> On 10/31/2012 01:13 PM, Vlad Niculae wrote:
>> Hello,
>> 
>> It seems I have reached again the need for something that became
>> apparent when working with image patches last summer. Sometimes we
>> don't have a 1 to 1 correspondence between samples (rows in X) and
>> actual documents we are interested in scoring over. Instead, each
>> document consists of (a different) number of samples.
>> 
>> This can be implemented either as an extra masking array that says
>> for each sample, what document it belongs to, by grouping `y` into
>> a list of lists (cumbersome and fails for the unsupervised case), or
>> by more clever / space efficient methods.
>> 
>> The question is: did you need this? If so, how did you implement it?
>> Are you aware of other general purpose libraries that provide such
>> an API? Because I'm not. Next question is, what can we do about it?
>> 
>> Example applications:
>> 
>>   - Image classification:
>>     first, from each image we extract k-by-k image patches, then we
>>     transform them by sparse coding, and finally we feed them into a
>>     classifier. This classifies each patch individually but in the end
>>     we would want to group the results within each image and compute
>>     "local" scores, or just take the max, for example.
>> 
>>     If using something like CIFAR where images have the same size, the
>>     problem is simplified because each image will be split in the exact
>>     same number of patches. If images have different shapes, or in the
>>     next examples, this assumption cannot be made.
>> 
>>   - Coreference resolution:
>>     A successful model for this problem is based on the mention pair
>>     structure. The goal is to identify clusters of noun phrases that
>>     refer to the same real-world entity. For each document (eg. news
>>     article), the possible mentions (NPs, pronouns) are identified.
>>     The feature extraction then builds "samples" in the form of all
>>     possible pairs of these (sometimes we filter out pairs that are
>>     obviously not coreferent, e.g. he / she, but this is disputable).
>> 
>>     Evaluating such systems requires average over document-level
>>     scores, because the document-level scores typically used do not
>>     distribute over averaging. [1]
>> 
>>   - Hyphenation:
>>     This is just something I'm currently working on but the same
>>     situation might occur more often. Documents are words, and
>>     samples are positions between letters within each word.
>>     Labels are whether it's correct to add a hyphen there or not.
>>     In the end, sklearn can easily report how many hyphens were
>>     correctly identified over the whole dictionary available.
>>     However a more realistic score would be: how many words were
>>     fully hyphenated correctly? This is because a sequence model
>>     can be smart enough to know that it's not frequent to
>>     insert three hyphens one after the other, for example a pattern
>>     ...xx-x-x-xx..., because of its global document-level awareness.
>>     It would be interesting to see how much this brings over a
>>     local SVM classifier that only sees one position at a time.
>> 
>> Objects that should be aware of this:
>> 
>>   - score functions / metrics,
>>   - some transformers
>>   - resamplers / shufflers: we either want to keep documents together,
>>     or make sure that when reshuffling, document membership is not lost.
>> 
>> 
>> Best,
>> Vlad
>> ------------------
>> Vlad N.
>> http://vene.ro
>> 
>> 
>> 
>> 
>> ------------------------------------------------------------------------------
>> Everyone hates slow websites. So do we.
>> Make your web apps faster with AppDynamics
>> Download AppDynamics Lite for free today:
>> http://p.sf.net/sfu/appdyn_sfd2d_oct
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> 
> 
> ------------------------------------------------------------------------------
> Everyone hates slow websites. So do we.
> Make your web apps faster with AppDynamics
> Download AppDynamics Lite for free today:
> http://p.sf.net/sfu/appdyn_sfd2d_oct
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------
Vlad N.
http://vene.ro





------------------------------------------------------------------------------
LogMeIn Central: Instant, anywhere, Remote PC access and management.
Stay in control, update software, and manage PCs from one command center
Diagnose problems and improve visibility into emerging IT issues
Automate, monitor and manage. Do more in less time with Central
http://p.sf.net/sfu/logmein12331_d2d
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] API for multi-sample "documents"

Reply via email to