For large enough models (e.g. random forests or gradient boosted trees ensembles) I would definitely recommend arbitrary integer coding for the categorical variables.
Try both, use cross-validation and see for yourself. -- Olivier _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn