I have a multi-class multi-label decision tree learnt using DecisionTreeClassifier class. The input looks like follows:
X = [[2, 51], [3, 20], [5, 30], [7, 1], [20, 46], [25, 25], [45, 70]] Y = [[1,2,3],[1,2,3],[1,2,3],[1,2],[1,2],[1],[1]] I have used MultiLabelBinarizer to convert Y into [[1 1 1] [1 1 1] [1 1 1] [1 1 0] [1 1 0] [1 0 0] [1 0 0]] After training, the _tree.values looks like follows: array([[[7., 0.], [2., 5.], [4., 3.]], [[3., 0.], [0., 3.], [0., 3.]], [[4., 0.], [2., 2.], [4., 0.]], [[2., 0.], [0., 2.], [2., 0.]], [[2., 0.], [2., 0.], [2., 0.]]]) I had the impression that the value array contains for each node, a list of lists [[n_1, y_1], [n_2, y_2], [n_3, y_3]] such that n_i are the number of samples disagreeing with class i and y_i are the number of samples agreeing with class i. But after seeing this output, it does not make sense. For example, the root node has the value [[7,0],[2,5],[4,3]]. According to my interpretation, this would mean 7 samples disagree with class 1; 2 disagree with class 2 and 5 agree with class 2; 4 disagree with class 3 and 3 agree with class 3. which, according to the input dataset is wrong. Could someone please help me understand the semantics of _tree.value for multi-label DTs?
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn