Dear Yegle,

1) What does your data represent? Are your features numbers or concepts?

In the first case, you should try to build your estimator without
encoding anything. In the second case, it might also not be necessary
to one-hot encode your categorical features. Try with and without
encoding and compare your results.

2) You should evaluate the accuracy of your model on an independent
test set before looking at the tree. If it doesn't perform well, then
it might not be worth to look at the tree at all.

3) When you one-hot encode a feature with n values, you actually add n
binary new features to your dataset. However, the decision tree is
agnostic of that. It doesn't know you one-hot encoded your features.
So basically, you have to un-vectorize the representation yourelf.

> What I expect was each node marked with `FEATURE_1 == VALUE_1`, instead of
> `X[1] <= 0.5`

If you know that, in your vectorized representation, X[1] is the
binary encoding for VALUE_1 of FEATURE_1, then  `X[1] <= 0.5` is
actually equivalent to `FEATURE_1 != value_1`. But I agree this is not
very handy. You have to do the mapping yourself...

4) Consider building a random forest if you want better performance.

Hope this helps,

Gilles



On 12 September 2013 03:38, yegle <cnye...@gmail.com> wrote:
> Hi list,
>
> I'm a beginner in Machine Learning and trying to write a classifier using
> training set containing categorical values.
>
> From the document [1] I learned that I need to encode (vectorize) my
> categorical features in order to be learned by the classifier. So I uses
> `DictVectorizer` to do this
>
> The code I'm using: http://pastie.org/8318625
>
> But the result graph of the decision tree doesn't make much sense to me.
> What I expect was each node marked with `FEATURE_1 == VALUE_1`, instead of
> `X[1] <= 0.5`
>
> So here's my question:
>
> 1. Am I dong right in handling features with categorical values?
> 2. If the previous answer is yes, is it possible to `un-vectorize` in the
> final tree graph so that I don't need to know that `X[1]` and `X[2]`
> together represents a feature?
>
> IMHO WEKA handles categorical values much better than in scikit-learn. I
> don't need to vectorize the training set myself and the graph makes more
> sense to a beginner.
>
>
>
> [1]:
> http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
>
> --
> yegle
> http://about.me/yegle
>
>
> ------------------------------------------------------------------------------
> How ServiceNow helps IT people transform IT departments:
> 1. Consolidate legacy IT systems to a single system of record for IT
> 2. Standardize and globalize service processes across IT
> 3. Implement zero-touch automation to replace manual, redundant tasks
> http://pubads.g.doubleclick.net/gampad/clk?id=51271111&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>

------------------------------------------------------------------------------
How ServiceNow helps IT people transform IT departments:
1. Consolidate legacy IT systems to a single system of record for IT
2. Standardize and globalize service processes across IT
3. Implement zero-touch automation to replace manual, redundant tasks
http://pubads.g.doubleclick.net/gampad/clk?id=51271111&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to