Re: [Scikit-learn-general] API for multi-sample "documents"

Andreas Mueller Wed, 31 Oct 2012 06:20:21 -0700

Hi Vlad.
This is definitely a good question. I have that often when representing
an image as bags of keypoints / features.
Why is it not a good solution to have X as being a list of  arrays / lists?
Which algorithms do you want to use such samples in?
The text feature extraction sort of deals with this by using a list, right?


Cheers,
Andy

On 10/31/2012 01:13 PM, Vlad Niculae wrote:
> Hello,
>
> It seems I have reached again the need for something that became
> apparent when working with image patches last summer. Sometimes we
> don't have a 1 to 1 correspondence between samples (rows in X) and
> actual documents we are interested in scoring over. Instead, each
> document consists of (a different) number of samples.
>
> This can be implemented either as an extra masking array that says
> for each sample, what document it belongs to, by grouping `y` into
> a list of lists (cumbersome and fails for the unsupervised case), or
> by more clever / space efficient methods.
>
> The question is: did you need this? If so, how did you implement it?
> Are you aware of other general purpose libraries that provide such
> an API? Because I'm not. Next question is, what can we do about it?
>
> Example applications:
>
>    - Image classification:
>      first, from each image we extract k-by-k image patches, then we
>      transform them by sparse coding, and finally we feed them into a
>      classifier. This classifies each patch individually but in the end
>      we would want to group the results within each image and compute
>      "local" scores, or just take the max, for example.
>
>      If using something like CIFAR where images have the same size, the
>      problem is simplified because each image will be split in the exact
>      same number of patches. If images have different shapes, or in the
>      next examples, this assumption cannot be made.
>
>    - Coreference resolution:
>      A successful model for this problem is based on the mention pair
>      structure. The goal is to identify clusters of noun phrases that
>      refer to the same real-world entity. For each document (eg. news
>      article), the possible mentions (NPs, pronouns) are identified.
>      The feature extraction then builds "samples" in the form of all
>      possible pairs of these (sometimes we filter out pairs that are
>      obviously not coreferent, e.g. he / she, but this is disputable).
>
>      Evaluating such systems requires average over document-level
>      scores, because the document-level scores typically used do not
>      distribute over averaging. [1]
>
>    - Hyphenation:
>      This is just something I'm currently working on but the same
>      situation might occur more often. Documents are words, and
>      samples are positions between letters within each word.
>      Labels are whether it's correct to add a hyphen there or not.
>      In the end, sklearn can easily report how many hyphens were
>      correctly identified over the whole dictionary available.
>      However a more realistic score would be: how many words were
>      fully hyphenated correctly? This is because a sequence model
>      can be smart enough to know that it's not frequent to
>      insert three hyphens one after the other, for example a pattern
>      ...xx-x-x-xx..., because of its global document-level awareness.
>      It would be interesting to see how much this brings over a
>      local SVM classifier that only sees one position at a time.
>
> Objects that should be aware of this:
>
>    - score functions / metrics,
>    - some transformers
>    - resamplers / shufflers: we either want to keep documents together,
>      or make sure that when reshuffling, document membership is not lost.
>
>
> Best,
> Vlad
> ------------------
> Vlad N.
> http://vene.ro
>
>
>
>
> ------------------------------------------------------------------------------
> Everyone hates slow websites. So do we.
> Make your web apps faster with AppDynamics
> Download AppDynamics Lite for free today:
> http://p.sf.net/sfu/appdyn_sfd2d_oct
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] API for multi-sample "documents"

Reply via email to