Hi fellows, I found some numbers generated are not right, and it took me half a day of debugging. Finally, I found it was due to the loaded CSV file contains special characters for a few categorical features. These categorical values are all in UNICODE in different languages, Hindi, Chinese, English, and special characters ',".
They cause the problem in DictVectorization. Because dict is key-value pairs with comma and single quote have special meanings. Any Python function to encode these characters into "A-Z0-9"? Because export_graphviz will generate a dot file, which will be finally converted into SVG file using a browser to view it. We will also need a function to decode these characters back compatible with major browsers. Is there any such standard function exist?
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general