Hi Olivier,
Ok. I forked scikit-learn on github so I'm aware of pull requests and other
activities.
If I could contribute to this feature, I would be glad to mantain it with
documentations, tests, usage examples, etc.
I think it's a very interesting algorithm and it's worth devote time to it.
Oh yes I'm aware of DicVectorizer. I stumbled upon it trying to classify
that dataset I told you about, so I had to figure out how to make it work.
Thanks for the info! any suggestion will be helpful since I'm just starting
with contributing to a library, and even with using github.
Congratulations for the mailing list, It's super responsive and everyone is
very well informed and with lots of useful suggestions!
Marcos.
On Tue, Sep 4, 2012 at 10:33 AM, Olivier Grisel <[email protected]>wrote:
> 2012/9/4 Marcos Wolff <[email protected]>:
> > Hi,
> >
> > I was wondering if there are plans on implementing CHAID techniques for
> tree
> > growing
> >
> > http://en.wikipedia.org/wiki/CHAID
> > Gordon Kass 1980's paper:
> >
> http://ebookbrowse.com/gdoc.php?id=60655988&url=705c072c97190f9f1c59ac51aa72a258
> .
> >
> > SPSS uses it and:
> > -it's very effective for multi-class classification, it out performs
> CART in
> > every situation (this may be an SPSS implementation issue, of course)
> > -it is not sensitive to unbalanced dataset (no need for prior
> probabilities
> > if you have very little positives and very much negative instances in
> your
> > data)
> > -it does multiple partitions on the data (CART does only binary
> partition)
> > -and performs very well on large datasets because of the simplicity of
> the
> > algorithm (I used it for classification in dataset of 350.000 rows and
> 200
> > columns of numbers, ordinal and categorical data)
> >
> > I searched in github scikit issues for requested features and I didn't
> see
> > mentions to it (
> > https://github.com/scikit-learn/scikit-learn/issues/search?q=chaid )
> >
> > I'm not really an experienced python developer, I just use python for
> data
> > cleaning, scrapping and for running data mining algorithms.
> > I don't know if I am experienced enough for developing this feature. But,
> > I'll be happy to try or help the community to do it if you are
> interested.
> >
> > What would you recommend me to read or do, apart from reading this guide
> > http://scikit-learn.org/stable/developers/index.html#contributing-code,
> > if I wanted to contribute developing this feature?
>
> Sounds interesting. I especially appreciate algorithms that are
> scalable to at least medium-sized, real world datasets :)
>
> If you feel like contributing an implementation of this, please read
> carefully the following guide:
>
> http://scikit-learn.org/dev/developers/index.html
>
> Also have a look at the existing pull requests (even if completely
> unrelated):
>
> https://github.com/scikit-learn/scikit-learn/pulls
>
> It's a good way to understand how the contribution / reviewing process
> is working in practice.
>
> Beware that each contribution will have to be maintained in the future
> so will add a burden to the developers of the project. This burden can
> only be alleviated by extensive documentations, tests, usage examples
> and API and variable names consistent with the rest of the project.
> Hence don't expect a fast code, submit and forget contribution
> process.
>
> Also more specific to this particular algorithm: in scikit-learn,
> categorical features are traditionally encoded using 1 hot binary
> features stored in a scipy.sparse matrix. This datastructure is a bit
> peculiar so you might want to have a look at existing implementations
> of estimators that are able to deal with it before engaging in the
> design process. Typically dict-like representations (typically used in
> datamining) can be converted into sparse data using the DictVectorizer
> class:
> http://scikit-learn.org/dev/modules/feature_extraction.html#loading-features-from-dicts
>
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general