Hi Adrin, Thanks for the clarification. Is there a right way of letting DecisionTreeClassifier know that the first column can take both 0 or 1, but in the current dataset we are only using 0?
For example, we can let MultiLabelBinarizer know that we have three classes by instantiating it like this: MultiLabelBinarizer([1,2,3]). I tried class_weight=[{0: 1, 1: 1}, {0: 1, 1: 1}, {0: 1, 1: 1}] but that doesn't work. Thanks, Pranav On Mon, Oct 8, 2018 at 2:32 PM Adrin <adrin.jal...@gmail.com> wrote: > Hi Pranav, > > The reason you're getting that output is that your first column has a > single value (1), and that becomes your "first" class, hence your first > value in the rows you're interpreting. > > To understand it better, you can try to check this code: > > >>> from sklearn.preprocessing import MultiLabelBinarizer > >>> from sklearn.tree import DecisionTreeClassifier > >>> > >>> X = [[2, 51], [3, 20], [5, 30], [7, 1], [20, 46], [25, 25], [45, 70]] > >>> Y = [[2,3],[1,2,3],[1,2,3],[1,2],[1,2],[1],[1]] > >>> > >>> y = MultiLabelBinarizer().fit_transform(Y) + 40 > >>> y[0, 1] = 0 > >>> > >>> clf = DecisionTreeClassifier().fit(X, y) > >>> print(clf.tree_.value) > [[[1. 6. 0.] > [1. 2. 4.] > [4. 3. 0.]] > > [[1. 2. 0.] > [1. 0. 2.] > [0. 3. 0.]] > > [[0. 2. 0.] > [0. 0. 2.] > [0. 2. 0.]] > > [[1. 0. 0.] > [1. 0. 0.] > [0. 1. 0.]] > > [[0. 4. 0.] > [0. 2. 2.] > [4. 0. 0.]] > > [[0. 2. 0.] > [0. 0. 2.] > [2. 0. 0.]] > > [[0. 2. 0.] > [0. 2. 0.] > [2. 0. 0.]]] > > > On Mon, 8 Oct 2018 at 20:53 Pranav Ashok <pranavas...@gmail.com> wrote: > >> I have a multi-class multi-label decision tree learnt using >> DecisionTreeClassifier class. The input looks like follows: >> >> X = [[2, 51], [3, 20], [5, 30], [7, 1], [20, 46], [25, 25], [45, 70]] >> Y = [[1,2,3],[1,2,3],[1,2,3],[1,2],[1,2],[1],[1]] >> >> I have used MultiLabelBinarizer to convert Y into >> >> [[1 1 1] >> [1 1 1] >> [1 1 1] >> [1 1 0] >> [1 1 0] >> [1 0 0] >> [1 0 0]] >> >> >> After training, the _tree.values looks like follows: >> >> array([[[7., 0.], >> [2., 5.], >> [4., 3.]], >> >> [[3., 0.], >> [0., 3.], >> [0., 3.]], >> >> [[4., 0.], >> [2., 2.], >> [4., 0.]], >> >> [[2., 0.], >> [0., 2.], >> [2., 0.]], >> >> [[2., 0.], >> [2., 0.], >> [2., 0.]]]) >> >> I had the impression that the value array contains for each node, a list of >> lists [[n_1, y_1], [n_2, y_2], [n_3, y_3]] >> such that n_i are the number of samples disagreeing with class i and y_i are >> the number of samples agreeing with >> class i. But after seeing this output, it does not make sense. >> >> For example, the root node has the value [[7,0],[2,5],[4,3]]. According to >> my interpretation, this would mean >> 7 samples disagree with class 1; 2 disagree with class 2 and 5 agree with >> class 2; 4 disagree with class 3 and 3 agree with class 3. >> >> which, according to the input dataset is wrong. >> >> Could someone please help me understand the semantics of _tree.value for >> multi-label DTs? >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn