Re: [Scikit-learn-general] Multilabel sequences of sequences considered harmful

Mathieu Blondel Sat, 01 Jun 2013 20:36:07 -0700

Hi Joel,

Sorry for the late answer. It's hard for me to keep track of all the
design-related discussions lately.


For me, the advantages of the sequences of sequences format are:
- they are quite natural from a user point of view (although, as you said,
an array of sets would be technically better)
- they are easy to construct

Their main disadvantages is that they are slow to process (since they often
require for loops).

Label indicator matrices represented as a numpy array are easy to process
but they require linear complexity for access to the labels of a given
training instance (all non-zero elements of a row). This could be fixed by
using the CSR matrix format. The question is how to build the matrix? CSR
matrices are non-trivial to build by hand and I would not recommend them.
lil_matrix and dok_matrix formats are usually used for incremental matrix
building but AFAIK they require you to specify an initial shape, which is
not very convenient if you're parsing a file incrementally.

It's hard to reason about what format is the best since we don't have any
support for multi-label estimators, except in OneVsRestClassifier. And even
so, most of the multi-label support in OneVsRestClassifier relies on
LabelBinarizer.

One concern with the current implementation of the multi-label metrics is
that it is quite complex, since it supports both label-indicator and
sequences of sequences. I believe we could simplify the implementation if
we delegated the conversion from sequences of sequences (or arrays of sets)
to CSR to LabelBinarizer. Then, internally, the metrics would make their
computations on the CSR format.
This way we would have the following advantages:
- easy incremental building of the labels, thanks to the support for
sequences of sequences or arrays of sets
- simplified implementation of metrics (and future estimators), since the
handling of sequences of sequences / arrays of sets would be delegated to
LabelBinarizer

Mathieu

On Wed, May 29, 2013 at 5:17 PM, Joel Nothman
<jnoth...@student.usyd.edu.au>wrote:

> TL;DR: do we need to support two forms of multilabel targets? sequences of
> sequences may have unexpected behaviour.
>
> With Arnaud Joly's recent implementation of multilabel support in a number
> of metrics, there has been some extensive discussion of multilabel targets
> and their format at 
> #1985<https://github.com/scikit-learn/scikit-learn/issues/1985>
>  and #1987 <https://github.com/scikit-learn/scikit-learn/pull/1987>.
>
> Currently, scikit-learn seem to work with targets (y passed to fit; the
> output of predict) of a number of different formats:
> * single regression / binary / multiclass target: a 1d array-like or
> column vector
> * multiple regression / multiclass targets: a 2d array
> * multilabel sequence of sequences: often a list of lists or list of tuples
> * multilabel label indicator matrix: a 2d binary array
>
> Have I missed any?
>
> Some metric functions now handle multiple forms of input. However,
> handling both multilabel types can be a bit messy. Effectively, the
> sequence of sequences representation is a sparse format of the label
> indicator matrix (equivalent to scipy.sparse.lil_matrix). Unless there are
> many classes, it is more efficient to work with a label indicator matrix.
>
> So:
> * What are the use-cases for supporting both both forms?
> * Is there a reason that the sequence of sequences form uses ordered
> labels, rather than being an array of sets?
>
> It would seem natural for the sequence of sequences format to be an array
> like other scikit-learn targets. The output
> of sklearn.datasets.make_multilabel_classification is instead a tuple of
> lists. If these lists happen to all be of the same length, calling
> np.asarray() (or np.take, etc.) on the data transforms it into a 2d array,
> now making it ambiguously another type of targets. This is very bad for
> cross-validation, for instance, where a slice is taken. A rectangular
> sequence of lists may also be confused for multiple multiclass targets.
>
> It is possible to make an array of lists, but difficult to expect many
> users to do so reliably without a helper functon:
>
> >>> y_tup = ([1,2], [1,4], [2, 3])
> >>> np.array(y_tup)
> array([[1, 2],
>        [1, 4],
>        [2, 3]])
> >>> y = np.empty(len(y_tup), dtype=object)
> >>> y[:] = y_tup
> >>> y
> array([[1, 2], [1, 4], [2, 3]], dtype=object)
>
> Note that because the following happens to work:
> >>> y_tup = ([1, 2], [1])
> >>> np.array(y_tup)
> array([[1, 2], [1]], dtype=object)
> someone writing code to handle these might not realise that it won't work
> in all cases.
>
> It would be much safer if sequence of sequences was disallowed
> (deprecated) and one of:
> * label indicator matrices used alone
> * array of [frozen]sets supported instead (note np.array([set([1,2]),
> set([1,4])]) -> array([set([1, 2]), set([1, 4])], dtype=object))
> * scipy.sparse label indicator matrices supported instead
>
> Any of the above also obviates making sure all samples have unique labels
> before calculating metrics.
>
> This is argued at greater length (with more discussion of using sparse
> matrices and arrays of sets) at
> https://github.com/scikit-learn/scikit-learn/pull/1987#issuecomment-18584806
>
> Cheers,
>
> - Joel
>
>
>
>
> ------------------------------------------------------------------------------
> Introducing AppDynamics Lite, a free troubleshooting tool for Java/.NET
> Get 100% visibility into your production application - at no cost.
> Code-level diagnostics for performance bottlenecks with <2% overhead
> Download for free and get started troubleshooting in minutes.
> http://p.sf.net/sfu/appdyn_d2d_ap1
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Multilabel sequences of sequences considered harmful

Reply via email to