Hello,

It seems I have reached again the need for something that became
apparent when working with image patches last summer. Sometimes we
don't have a 1 to 1 correspondence between samples (rows in X) and
actual documents we are interested in scoring over. Instead, each
document consists of (a different) number of samples.

This can be implemented either as an extra masking array that says
for each sample, what document it belongs to, by grouping `y` into
a list of lists (cumbersome and fails for the unsupervised case), or
by more clever / space efficient methods.

The question is: did you need this? If so, how did you implement it?
Are you aware of other general purpose libraries that provide such
an API? Because I'm not. Next question is, what can we do about it?

Example applications:

  - Image classification:
    first, from each image we extract k-by-k image patches, then we
    transform them by sparse coding, and finally we feed them into a
    classifier. This classifies each patch individually but in the end
    we would want to group the results within each image and compute
    "local" scores, or just take the max, for example.

    If using something like CIFAR where images have the same size, the
    problem is simplified because each image will be split in the exact
    same number of patches. If images have different shapes, or in the
    next examples, this assumption cannot be made.

  - Coreference resolution:
    A successful model for this problem is based on the mention pair
    structure. The goal is to identify clusters of noun phrases that
    refer to the same real-world entity. For each document (eg. news
    article), the possible mentions (NPs, pronouns) are identified.
    The feature extraction then builds "samples" in the form of all
    possible pairs of these (sometimes we filter out pairs that are
    obviously not coreferent, e.g. he / she, but this is disputable).

    Evaluating such systems requires average over document-level
    scores, because the document-level scores typically used do not
    distribute over averaging. [1]

  - Hyphenation:
    This is just something I'm currently working on but the same
    situation might occur more often. Documents are words, and
    samples are positions between letters within each word.
    Labels are whether it's correct to add a hyphen there or not.
    In the end, sklearn can easily report how many hyphens were
    correctly identified over the whole dictionary available.
    However a more realistic score would be: how many words were
    fully hyphenated correctly? This is because a sequence model
    can be smart enough to know that it's not frequent to
    insert three hyphens one after the other, for example a pattern
    ...xx-x-x-xx..., because of its global document-level awareness.
    It would be interesting to see how much this brings over a
    local SVM classifier that only sees one position at a time.

Objects that should be aware of this:

  - score functions / metrics,
  - some transformers
  - resamplers / shufflers: we either want to keep documents together,
    or make sure that when reshuffling, document membership is not lost.


Best,
Vlad
------------------
Vlad N.
http://vene.ro




------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to