TL;DR: do we need to support two forms of multilabel targets? sequences of
sequences may have unexpected behaviour.

With Arnaud Joly's recent implementation of multilabel support in a number
of metrics, there has been some extensive discussion of multilabel targets
and their format at
#1985<https://github.com/scikit-learn/scikit-learn/issues/1985>
 and #1987 <https://github.com/scikit-learn/scikit-learn/pull/1987>.

Currently, scikit-learn seem to work with targets (y passed to fit; the
output of predict) of a number of different formats:
* single regression / binary / multiclass target: a 1d array-like or column
vector
* multiple regression / multiclass targets: a 2d array
* multilabel sequence of sequences: often a list of lists or list of tuples
* multilabel label indicator matrix: a 2d binary array

Have I missed any?

Some metric functions now handle multiple forms of input. However, handling
both multilabel types can be a bit messy. Effectively, the sequence of
sequences representation is a sparse format of the label indicator matrix
(equivalent to scipy.sparse.lil_matrix). Unless there are many classes, it
is more efficient to work with a label indicator matrix.

So:
* What are the use-cases for supporting both both forms?
* Is there a reason that the sequence of sequences form uses ordered
labels, rather than being an array of sets?

It would seem natural for the sequence of sequences format to be an array
like other scikit-learn targets. The output
of sklearn.datasets.make_multilabel_classification is instead a tuple of
lists. If these lists happen to all be of the same length, calling
np.asarray() (or np.take, etc.) on the data transforms it into a 2d array,
now making it ambiguously another type of targets. This is very bad for
cross-validation, for instance, where a slice is taken. A rectangular
sequence of lists may also be confused for multiple multiclass targets.

It is possible to make an array of lists, but difficult to expect many
users to do so reliably without a helper functon:

>>> y_tup = ([1,2], [1,4], [2, 3])
>>> np.array(y_tup)
array([[1, 2],
       [1, 4],
       [2, 3]])
>>> y = np.empty(len(y_tup), dtype=object)
>>> y[:] = y_tup
>>> y
array([[1, 2], [1, 4], [2, 3]], dtype=object)

Note that because the following happens to work:
>>> y_tup = ([1, 2], [1])
>>> np.array(y_tup)
array([[1, 2], [1]], dtype=object)
someone writing code to handle these might not realise that it won't work
in all cases.

It would be much safer if sequence of sequences was disallowed (deprecated)
and one of:
* label indicator matrices used alone
* array of [frozen]sets supported instead (note np.array([set([1,2]),
set([1,4])]) -> array([set([1, 2]), set([1, 4])], dtype=object))
* scipy.sparse label indicator matrices supported instead

Any of the above also obviates making sure all samples have unique labels
before calculating metrics.

This is argued at greater length (with more discussion of using sparse
matrices and arrays of sets) at
https://github.com/scikit-learn/scikit-learn/pull/1987#issuecomment-18584806

Cheers,

- Joel
------------------------------------------------------------------------------
Introducing AppDynamics Lite, a free troubleshooting tool for Java/.NET
Get 100% visibility into your production application - at no cost.
Code-level diagnostics for performance bottlenecks with <2% overhead
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap1
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to