Re: [Scikit-learn-general] GSoc Idea

Andreas Sat, 10 Mar 2012 10:27:06 -0800

Hi Vikram.
Thanks for sending in your idea.

I am not so familiar with the differences between C4.5 and they are notvery clear

to me from the webpage.
From what I understand, C5.0 features include:

- sample weights (called case weights), for which already a pull requestexists <https://github.com/scikit-learn/scikit-learn/pull/522>.

- class weights
- support for (more) different data types

From that, I can not see why the algorithm should be faster or moreaccurate.Can you elaborate a bit more on the differences? Also, it is not reallyclear

to me whether the algorithm uses any pruning.

Class weights would be a neat feature but that should not be _too_ muchwork.Supporting different data types seems a bit out of the scope ofscikit-learn to meat the current time. I'm not sure how these could be implemented usingnumpy arrays.


If you are interested in working on the tree module, there is something that

I think would be interesting to explore, though I am not familiar enoughwith

the code to be certain that this is a good idea.

If have the hunch that using a different data structure in the trees could

significantly speed things up. Currently, arrays with boolean masks areused,

where I think lists/iterators might be more appropriate.
That would be quite a project, though, and requite some good knowledge of
cython.

I'd like to hear the opinion from the "tree-people" on that, though :)

So to wrap up, could you please explain what the differences betweenC5.0 and

the current method are and how using C5.0 would improve things?


Cheers,
Andy



On 03/09/2012 02:21 PM, Vikram Kamath wrote:

Hi,
My name is Vikram and I'm a prospective GSoc applicant. I justwanted the community's thoughts on my proposal and whether it would bea useful feature to implement. I spoke to Nelle on IRC and shesuggested I propose it on the mailing list, so here goes:Currently, Scikit-Learn uses an optimised version of the CART DecisionTree algorithm<http://scikit-learn.sourceforge.net/dev/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart.>. C4.5 is another decision tree algorithm which creates m-ary trees,the advantage of which is dependent on the domain of usage. C5.0 is alogical successor to C4.5 and is said to be much better than C4.5<http://www.rulequest.com/see5-comparison.html>.Ross Quinlan (who developed C4.5 and C5.0) has released a C version ofthe code under the GPL. My proposal would be to implement C5.0 andhence make it a part of scikit-learn. My proposal would also includecreating some documentation/examples for the same. From mycorrespondence with Ross Quinlan, I understand that the documentationof C5.0 hasn't been released under the GPL and hence only a link canbe provided to it.
       Any thoughts/advice on the above would be largely appreciated.


Thanks
Vikram Kamath

--
Vikram Kamath
#401, Vinyas Renaissance,
Jnanabharati Main Road,
Bangalore 560072
Phone: 9036823513/08023390047
Email: [email protected] <mailto:[email protected]>


------------------------------------------------------------------------------
Virtualization&  Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/


_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] GSoc Idea

Reply via email to