Re: [Scikit-learn-general] Multilabel sequences of sequences considered harmful

Joel Nothman Sat, 01 Jun 2013 21:45:38 -0700

On Sun, Jun 2, 2013 at 1:35 PM, Mathieu Blondel <[email protected]>wrote:


> Sorry for the late answer. It's hard for me to keep track of all the
> design-related discussions lately.
>

No worries. Thanks for the reply!


> For me, the advantages of the sequences of sequences format are:
> - they are quite natural from a user point of view (although, as you said,
> an array of sets would be technically better)
> - they are easy to construct
>

Yes, easy to construct is important. Do you think arrays of sets are
substantially harder? Would it be appropriate to deprecate sequences of
sequences and replace them with arrays of sets, on the basis that sequences
of sequences can't be easily manipulated with numpy?

Their main disadvantages is that they are slow to process (since they often
> require for loops).
>
> Label indicator matrices represented as a numpy array are easy to process
> but they require linear complexity for access to the labels of a given
> training instance (all non-zero elements of a row).


Linear in what's usually a relatively small constant, though? What datasets
do we have for scikit-learn-appropriate multilabel classification with
hundreds/thousands of labels?

This could be fixed by using the CSR matrix format. The question is how to
> build the matrix? CSR matrices are non-trivial to build by hand and I would
> not recommend them. lil_matrix and dok_matrix formats are usually used for
> incremental matrix building but AFAIK they require you to specify an
> initial shape, which is not very convenient if you're parsing a file
> incrementally.
>
> It's hard to reason about what format is the best since we don't have any
> support for multi-label estimators, except in OneVsRestClassifier.


So this with the above clarifies the use-case of sequences of sequences:
easy to construct. (And not really intended for more efficient handling due
to sparse representation.)


> And even so, most of the multi-label support in OneVsRestClassifier relies
> on LabelBinarizer.
>

But in a validation situation, y_true is not passed through the
LabelBinarizer (and the slicing for cv is done before binarization).


> One concern with the current implementation of the multi-label metrics is
> that it is quite complex, since it supports both label-indicator and
> sequences of sequences. I believe we could simplify the implementation if
> we delegated the conversion from sequences of sequences (or arrays of sets)
> to CSR to LabelBinarizer. Then, internally, the metrics would make their
> computations on the CSR format.
>

>From the sounds of things, it would be easier and probably more efficient
to just always convert to dense binarized matrices, unless we have a good
case for requiring sparse handling of labels. In particular, scipy.sparse
does not currently support important operations for metrics: ==, !=, &, |,
^.


> This way we would have the following advantages:
> - easy incremental building of the labels, thanks to the support for
> sequences of sequences or arrays of sets
> - simplified implementation of metrics (and future estimators), since the
> handling of sequences of sequences / arrays of sets would be delegated to
> LabelBinarizer
>

Sounds good to me. Only I would like some confirmation on whether
deprecating support for sequences of sequences is sensible.

- Joel


On Wed, May 29, 2013 at 5:17 PM, Joel Nothman
<[email protected]>wrote:

> TL;DR: do we need to support two forms of multilabel targets? sequences of
> sequences may have unexpected behaviour.
>
> With Arnaud Joly's recent implementation of multilabel support in a number
> of metrics, there has been some extensive discussion of multilabel targets
> and their format at 
> #1985<https://github.com/scikit-learn/scikit-learn/issues/1985>
>  and #1987 <https://github.com/scikit-learn/scikit-learn/pull/1987>.
>
> Currently, scikit-learn seem to work with targets (y passed to fit; the
> output of predict) of a number of different formats:
> * single regression / binary / multiclass target: a 1d array-like or
> column vector
> * multiple regression / multiclass targets: a 2d array
> * multilabel sequence of sequences: often a list of lists or list of tuples
> * multilabel label indicator matrix: a 2d binary array
>
> Have I missed any?
>
> Some metric functions now handle multiple forms of input. However,
> handling both multilabel types can be a bit messy. Effectively, the
> sequence of sequences representation is a sparse format of the label
> indicator matrix (equivalent to scipy.sparse.lil_matrix). Unless there are
> many classes, it is more efficient to work with a label indicator matrix.
>
> So:
> * What are the use-cases for supporting both both forms?
> * Is there a reason that the sequence of sequences form uses ordered
> labels, rather than being an array of sets?
>
> It would seem natural for the sequence of sequences format to be an array
> like other scikit-learn targets. The output
> of sklearn.datasets.make_multilabel_classification is instead a tuple of
> lists. If these lists happen to all be of the same length, calling
> np.asarray() (or np.take, etc.) on the data transforms it into a 2d array,
> now making it ambiguously another type of targets. This is very bad for
> cross-validation, for instance, where a slice is taken. A rectangular
> sequence of lists may also be confused for multiple multiclass targets.
>
> It is possible to make an array of lists, but difficult to expect many
> users to do so reliably without a helper functon:
>
> >>> y_tup = ([1,2], [1,4], [2, 3])
> >>> np.array(y_tup)
> array([[1, 2],
>        [1, 4],
>        [2, 3]])
> >>> y = np.empty(len(y_tup), dtype=object)
> >>> y[:] = y_tup
> >>> y
> array([[1, 2], [1, 4], [2, 3]], dtype=object)
>
> Note that because the following happens to work:
> >>> y_tup = ([1, 2], [1])
> >>> np.array(y_tup)
> array([[1, 2], [1]], dtype=object)
> someone writing code to handle these might not realise that it won't work
> in all cases.
>
> It would be much safer if sequence of sequences was disallowed
> (deprecated) and one of:
> * label indicator matrices used alone
> * array of [frozen]sets supported instead (note np.array([set([1,2]),
> set([1,4])]) -> array([set([1, 2]), set([1, 4])], dtype=object))
> * scipy.sparse label indicator matrices supported instead
>
> Any of the above also obviates making sure all samples have unique labels
> before calculating metrics.
>
> This is argued at greater length (with more discussion of using sparse
> matrices and arrays of sets) at
> https://github.com/scikit-learn/scikit-learn/pull/1987#issuecomment-18584806
>
> Cheers,
>
> - Joel
>
>
>
>
> ------------------------------------------------------------------------------
> Introducing AppDynamics Lite, a free troubleshooting tool for Java/.NET
> Get 100% visibility into your production application - at no cost.
> Code-level diagnostics for performance bottlenecks with <2% overhead
> Download for free and get started troubleshooting in minutes.
> http://p.sf.net/sfu/appdyn_d2d_ap1
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Multilabel sequences of sequences considered harmful

Reply via email to