Hi Vikram.
Thanks for sending in your idea.
I am not so familiar with the differences between C4.5 and they are not
very clear
to me from the webpage.
From what I understand, C5.0 features include:
- sample weights (called case weights), for which already a pull request
exists <https://github.com/scikit-learn/scikit-learn/pull/522>.
- class weights
- support for (more) different data types
From that, I can not see why the algorithm should be faster or more
accurate.
Can you elaborate a bit more on the differences? Also, it is not really
clear
to me whether the algorithm uses any pruning.
Class weights would be a neat feature but that should not be _too_ much
work.
Supporting different data types seems a bit out of the scope of
scikit-learn to me
at the current time. I'm not sure how these could be implemented using
numpy arrays.
If you are interested in working on the tree module, there is something that
I think would be interesting to explore, though I am not familiar enough
with
the code to be certain that this is a good idea.
If have the hunch that using a different data structure in the trees could
significantly speed things up. Currently, arrays with boolean masks are
used,
where I think lists/iterators might be more appropriate.
That would be quite a project, though, and requite some good knowledge of
cython.
I'd like to hear the opinion from the "tree-people" on that, though :)
So to wrap up, could you please explain what the differences between
C5.0 and
the current method are and how using C5.0 would improve things?
Cheers,
Andy
On 03/09/2012 02:21 PM, Vikram Kamath wrote:
Hi,
My name is Vikram and I'm a prospective GSoc applicant. I just
wanted the community's thoughts on my proposal and whether it would be
a useful feature to implement. I spoke to Nelle on IRC and she
suggested I propose it on the mailing list, so here goes:
Currently, Scikit-Learn uses an optimised version of the CART Decision
Tree algorithm
<http://scikit-learn.sourceforge.net/dev/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart.>
. C4.5 is another decision tree algorithm which creates m-ary trees,
the advantage of which is dependent on the domain of usage. C5.0 is a
logical successor to C4.5 and is said to be much better than C4.5
<http://www.rulequest.com/see5-comparison.html>.
Ross Quinlan (who developed C4.5 and C5.0) has released a C version of
the code under the GPL. My proposal would be to implement C5.0 and
hence make it a part of scikit-learn. My proposal would also include
creating some documentation/examples for the same. From my
correspondence with Ross Quinlan, I understand that the documentation
of C5.0 hasn't been released under the GPL and hence only a link can
be provided to it.
Any thoughts/advice on the above would be largely appreciated.
Thanks
Vikram Kamath
--
Vikram Kamath
#401, Vinyas Renaissance,
Jnanabharati Main Road,
Bangalore 560072
Phone: 9036823513/08023390047
Email: [email protected] <mailto:[email protected]>
------------------------------------------------------------------------------
Virtualization& Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general