Hi,

I have been trying to deal with categorical data in Decision Trees for a
while now and I am wondering if I'm missing something or if what I want to
do is just not implemented.

My dataset consists of sections of highways, one of the columns being the
type of the highway (e.g. 'motorway', 'primary', 'secondary',
'residential', 'service'). That column is categorical and unordered (in
fact one could argue that it is partially ordered, but since the order is
not total I believe it would make my problem even more complicated and want
to consider it as unordered).

I therefore one-hot encode it before fitting the tree (i.e. one column per
type of highway containing binary data). My problem is that this
representation of the information doesn't allow the algorithm to choose non
trivial splits on that column. For example the algorithm can choose to put
'motorways' on the left and the rest on the right. But it can not split
between ('motorway', 'primary', 'secondary') and ('residential',
'service'), which would intuitively make sense (rural vs urban).

I understand that this behaviour is normal since only splits on a single
column are considered, but would like to have a confirmation that what I'm
saying is correct. Moreover the R library rpart handles categorical data
with no encoding and does allow these splits. It would be very useful to
have this functionality in sklearn too. Of course a solution is to add
columns representing the combination of multiple categories (e.g. one
column for ('motorway' or 'primary' or 'secondary') and one for
('residential' or 'service'), but if you want to exhaustively represent all
the splitting combinations, 2^(n-1) - 1 columns are needed when there are n
categories, which is quickly undoable.

Thanks in advance for your help,
Amaury
------------------------------------------------------------------------------
HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
Find What Matters Most in Your Big Data with HPCC Systems
Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
Leverages Graph Analysis for Fast Processing & Easy Data Exploration
http://p.sf.net/sfu/hpccsystems
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to