Hi fellows,

I found some numbers generated are not right, and it took me half a day of
debugging. Finally, I found it was due to the loaded CSV file contains
special characters for a few categorical features. These categorical values
are all in UNICODE in different languages, Hindi, Chinese, English, and
special characters ',".

They cause the problem in DictVectorization. Because dict is key-value
pairs with comma and single quote have special meanings.

Any Python function to encode these characters into "A-Z0-9"?

Because export_graphviz will generate a dot file, which will be finally
converted into SVG file using a browser to view it. We will also need a
function to decode these characters back compatible with major browsers. Is
there any such standard function exist?
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to