2012/9/4 Marcos Wolff <[email protected]>:
> Hi,
>
> I was wondering if there are plans on implementing CHAID techniques for tree
> growing
>
> http://en.wikipedia.org/wiki/CHAID
> Gordon Kass 1980's paper:
> http://ebookbrowse.com/gdoc.php?id=60655988&url=705c072c97190f9f1c59ac51aa72a258.
>
> SPSS uses it and:
> -it's very effective for multi-class classification, it out performs CART in
> every situation (this may be an SPSS implementation issue, of course)
> -it is not sensitive to unbalanced dataset (no need for prior probabilities
> if you have very little positives and very much negative instances in your
> data)
> -it does multiple partitions on the data (CART does only binary partition)
> -and performs very well on large datasets because of the simplicity of the
> algorithm (I used it for classification in dataset of 350.000 rows and 200
> columns of numbers, ordinal and categorical data)
>
> I searched in github scikit issues for requested features and I didn't see
> mentions to it (
> https://github.com/scikit-learn/scikit-learn/issues/search?q=chaid )
>
> I'm not really an experienced python developer, I just use python for data
> cleaning, scrapping and for running data mining algorithms.
> I don't know if I am experienced enough for developing this feature. But,
> I'll be happy to try or help the community to do it if you are interested.
>
> What would you recommend me to read or do, apart from reading this guide
> http://scikit-learn.org/stable/developers/index.html#contributing-code,
> if I wanted to contribute developing this feature?

Sounds interesting. I especially appreciate algorithms that are
scalable to at least medium-sized, real world datasets :)

If you feel like contributing an implementation of this, please read
carefully the following guide:

  http://scikit-learn.org/dev/developers/index.html

Also have a look at the existing pull requests (even if completely unrelated):

  https://github.com/scikit-learn/scikit-learn/pulls

It's a good way to understand how the contribution / reviewing process
is working in practice.

Beware that each contribution will have to be maintained in the future
so will add a burden to the developers of the project. This burden can
only be alleviated by extensive documentations, tests, usage examples
and API and variable names consistent with the rest of the project.
Hence don't expect a fast code, submit and forget contribution
process.

Also more specific to this particular algorithm: in scikit-learn,
categorical features are traditionally encoded using 1 hot binary
features stored in a scipy.sparse matrix. This datastructure is a bit
peculiar so you might want to have a look at existing implementations
of estimators that are able to deal with it before engaging in the
design process. Typically dict-like representations (typically used in
datamining) can be converted into sparse data using the DictVectorizer
class: 
http://scikit-learn.org/dev/modules/feature_extraction.html#loading-features-from-dicts

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to