2012/9/4 Marcos Wolff <[email protected]>: > Hi, > > I was wondering if there are plans on implementing CHAID techniques for tree > growing > > http://en.wikipedia.org/wiki/CHAID > Gordon Kass 1980's paper: > http://ebookbrowse.com/gdoc.php?id=60655988&url=705c072c97190f9f1c59ac51aa72a258. > > SPSS uses it and: > -it's very effective for multi-class classification, it out performs CART in > every situation (this may be an SPSS implementation issue, of course) > -it is not sensitive to unbalanced dataset (no need for prior probabilities > if you have very little positives and very much negative instances in your > data) > -it does multiple partitions on the data (CART does only binary partition) > -and performs very well on large datasets because of the simplicity of the > algorithm (I used it for classification in dataset of 350.000 rows and 200 > columns of numbers, ordinal and categorical data) > > I searched in github scikit issues for requested features and I didn't see > mentions to it ( > https://github.com/scikit-learn/scikit-learn/issues/search?q=chaid ) > > I'm not really an experienced python developer, I just use python for data > cleaning, scrapping and for running data mining algorithms. > I don't know if I am experienced enough for developing this feature. But, > I'll be happy to try or help the community to do it if you are interested. > > What would you recommend me to read or do, apart from reading this guide > http://scikit-learn.org/stable/developers/index.html#contributing-code, > if I wanted to contribute developing this feature?
Sounds interesting. I especially appreciate algorithms that are scalable to at least medium-sized, real world datasets :) If you feel like contributing an implementation of this, please read carefully the following guide: http://scikit-learn.org/dev/developers/index.html Also have a look at the existing pull requests (even if completely unrelated): https://github.com/scikit-learn/scikit-learn/pulls It's a good way to understand how the contribution / reviewing process is working in practice. Beware that each contribution will have to be maintained in the future so will add a burden to the developers of the project. This burden can only be alleviated by extensive documentations, tests, usage examples and API and variable names consistent with the rest of the project. Hence don't expect a fast code, submit and forget contribution process. Also more specific to this particular algorithm: in scikit-learn, categorical features are traditionally encoded using 1 hot binary features stored in a scipy.sparse matrix. This datastructure is a bit peculiar so you might want to have a look at existing implementations of estimators that are able to deal with it before engaging in the design process. Typically dict-like representations (typically used in datamining) can be converted into sparse data using the DictVectorizer class: http://scikit-learn.org/dev/modules/feature_extraction.html#loading-features-from-dicts -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
