Hi Vikram.
Thanks for sending in your idea.
I am not so familiar with the differences between C4.5 and they are not very clear
to me from the webpage.
From what I understand, C5.0 features include:
- sample weights (called case weights), for which already a pull request exists <https://github.com/scikit-learn/scikit-learn/pull/522>.
- class weights
- support for (more) different data types

From that, I can not see why the algorithm should be faster or more accurate. Can you elaborate a bit more on the differences? Also, it is not really clear
to me whether the algorithm uses any pruning.

Class weights would be a neat feature but that should not be _too_ much work. Supporting different data types seems a bit out of the scope of scikit-learn to me at the current time. I'm not sure how these could be implemented using numpy arrays.

If you are interested in working on the tree module, there is something that
I think would be interesting to explore, though I am not familiar enough with
the code to be certain that this is a good idea.

If have the hunch that using a different data structure in the trees could
significantly speed things up. Currently, arrays with boolean masks are used,
where I think lists/iterators might be more appropriate.
That would be quite a project, though, and requite some good knowledge of
cython.

I'd like to hear the opinion from the "tree-people" on that, though :)

So to wrap up, could you please explain what the differences between C5.0 and
the current method are and how using C5.0 would improve things?


Cheers,
Andy



On 03/09/2012 02:21 PM, Vikram Kamath wrote:
Hi,
My name is Vikram and I'm a prospective GSoc applicant. I just wanted the community's thoughts on my proposal and whether it would be a useful feature to implement. I spoke to Nelle on IRC and she suggested I propose it on the mailing list, so here goes: Currently, Scikit-Learn uses an optimised version of the CART Decision Tree algorithm <http://scikit-learn.sourceforge.net/dev/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart.> . C4.5 is another decision tree algorithm which creates m-ary trees, the advantage of which is dependent on the domain of usage. C5.0 is a logical successor to C4.5 and is said to be much better than C4.5 <http://www.rulequest.com/see5-comparison.html>. Ross Quinlan (who developed C4.5 and C5.0) has released a C version of the code under the GPL. My proposal would be to implement C5.0 and hence make it a part of scikit-learn. My proposal would also include creating some documentation/examples for the same. From my correspondence with Ross Quinlan, I understand that the documentation of C5.0 hasn't been released under the GPL and hence only a link can be provided to it.

       Any thoughts/advice on the above would be largely appreciated.


Thanks
Vikram Kamath

--
Vikram Kamath
#401, Vinyas Renaissance,
Jnanabharati Main Road,
Bangalore 560072
Phone: 9036823513/08023390047
Email: [email protected] <mailto:[email protected]>


------------------------------------------------------------------------------
Virtualization&  Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/


_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to