Re: [Scikit-learn-general] GSoc Idea

Peter Prettenhofer Sun, 11 Mar 2012 09:08:02 -0700

2012/3/10 Andreas <[email protected]>

> **
> Hi Vikram.
> Thanks for sending in your idea.
> I am not so familiar with the differences between C4.5 and they are not
> very clear
> to me from the webpage.
> >From what I understand, C5.0 features include:
> - sample weights (called case weights), for which already a pull request
> exists <https://github.com/scikit-learn/scikit-learn/pull/522>.
> - class weights
> - support for (more) different data types
>
> >From that, I can not see why the algorithm should be faster or more
> accurate.
> Can you elaborate a bit more on the differences? Also, it is not really
> clear
> to me whether the algorithm uses any pruning.
>



Hi everybody,

it would be great indeed if you could elaborate on the differences between
CART and C5.0 in more detail. According to the benchmarks, the major
advantage of C5.0 is efficiency (both memory and runtime) - I'd like to
note that our current implementation is rather naive and has a lot of
potential for efficiency enhancements (e.g. its especially in-efficient for
features with few potential split points).

I find the benchmark results on the C5.0 site rather impressive - 3sec for
covertype is quite fast - our current CART implementation requires 20
seconds on my 1.8GHz i7, the error is worse (7.4% vs. 6.1%) and it
creates substantially more nodes (17928 vs. 9185).



>
> Class weights would be a neat feature but that should not be _too_ much
> work.
> Supporting different data types seems a bit out of the scope of
> scikit-learn to me
> at the current time. I'm not sure how these could be implemented using
> numpy arrays.
>
> If you are interested in working on the tree module, there is something
> that
> I think would be interesting to explore, though I am not familiar enough
> with
> the code to be certain that this is a good idea.
>
> If have the hunch that using a different data structure in the trees could
> significantly speed things up. Currently, arrays with boolean masks are
> used,
> where I think lists/iterators might be more appropriate.
> That would be quite a project, though, and requite some good knowledge of
> cython.
>

I agree, boolean masks work ok for features with a large number of
potential splits; but for features with few potential splits (e.g.
covertype has a number of categorical features in one-hot encoding, thus,
there is only one potential split point) we could do much better.


>
> I'd like to hear the opinion from the "tree-people" on that, though :)
>
> So to wrap up, could you please explain what the differences between C5.0
> and
> the current method are and how using C5.0 would improve things?
>
>
> Cheers,
> Andy
>
>
>
>
> On 03/09/2012 02:21 PM, Vikram Kamath wrote:
>
> Hi,
>     My name is Vikram and I'm a prospective GSoc applicant. I just wanted
> the community's thoughts on my proposal and whether it would be a useful
> feature to implement. I spoke to Nelle on IRC and she suggested I propose
> it on the mailing list, so here goes:
> Currently, Scikit-Learn uses an optimised version of the CART Decision
> Tree 
> algorithm<http://scikit-learn.sourceforge.net/dev/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart.>.
>  C4.5 is another decision tree algorithm which creates m-ary trees, the
> advantage of which is dependent on the domain of usage. C5.0 is a logical
> successor to C4.5 and is said to be much better than 
> C4.5<http://www.rulequest.com/see5-comparison.html>
> .
> Ross Quinlan (who developed C4.5 and C5.0) has released a C version of the
> code under the GPL. My proposal would be to implement C5.0 and hence make
> it a part of scikit-learn. My proposal would also include creating some
> documentation/examples for the same. From my correspondence with Ross
> Quinlan, I understand that the documentation of C5.0 hasn't been released
> under the GPL and hence only a link can be provided to it.
>
>        Any thoughts/advice on the above would be largely appreciated.
>
>
> Thanks
> Vikram Kamath
>
> --
> Vikram Kamath
> #401, Vinyas Renaissance,
> Jnanabharati Main Road,
> Bangalore 560072
> Phone: 9036823513/08023390047
> Email: [email protected]
>
>
> ------------------------------------------------------------------------------
> Virtualization & Cloud Management Using Capacity Planning
> Cloud computing makes use of virtualization - but cloud computing
> also focuses on allowing computing to be delivered as a 
> service.http://www.accelacomm.com/jaw/sfnl/114/51521223/
>
>
> _______________________________________________
> Scikit-learn-general mailing 
> [email protected]https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
>
> ------------------------------------------------------------------------------
> Virtualization & Cloud Management Using Capacity Planning
> Cloud computing makes use of virtualization - but cloud computing
> also focuses on allowing computing to be delivered as a service.
> http://www.accelacomm.com/jaw/sfnl/114/51521223/
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>


-- 
Peter Prettenhofer

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] GSoc Idea

Reply via email to