Hi,
I have written a
proposal<https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2014-Application:-Sparse-matrix-support-for-decision-tree>for
gsoc, can someone please review it .
cheers,
kaushik varanasi
On Wed, Mar 19, 2014 at 4:50 AM, vamsi kaushik
<kaushik.varana...@gmail.com>wrote:
> Hi Arnaud,
>
> I am looking forward to include these in my proposal
> http://cam.zcu.cz/~brandner/PSfiles/pooch.pdf
> http://arxiv.org/pdf/1301.6685v1.pdf
>
> as algorithms for sparse storage within the decision tree. Have your say
> on this one
>
> cheers,
> kaushik Varanasi
>
>
> On Tue, Mar 18, 2014 at 7:45 PM, vamsi kaushik <
> kaushik.varana...@gmail.com> wrote:
>
>> Hi Arnaud,
>>
>> I just saw your mail regarding the fest implementation and dropped it
>> from the proposal.
>> Im currently working on my proposal and will post it on the wiki by
>> tonight.
>> The core idea is to alter the _tree.pyx module and then extend sparse
>> support to ensemble methods and im also new to the community so i couldn't
>> fix a lot of patches.
>> I would like to know if there are other important points to mention in
>> the proposal regarding the project
>>
>> cheers,
>> kaushik Varanasi
>>
>>
>> On Mon, Mar 17, 2014 at 7:49 PM, vamsi kaushik <
>> kaushik.varana...@gmail.com> wrote:
>>
>>> Hi,
>>> And of the fest implementation i would say we just rewrite the algorithm
>>> in cython instead of writing wrappers for it, this dosn't need licence and
>>> is readable for other contributors and also easily changeable in future.
>>> This boils down to a few 500 lines of code in cython.
>>>
>>>
>>> On Mon, Mar 17, 2014 at 7:26 PM, vamsi kaushik <
>>> kaushik.varana...@gmail.com> wrote:
>>>
>>>> hi arnaud,
>>>>
>>>> yeah i have already gone through that implementation. I feel at least
>>>> we should have separate splitter classes for the node splitting to be more
>>>> efficient and flexible.
>>>> And as of the choice of matrix, I don't have one in mind but mostly the
>>>> elements are sorted along the feature columns. so I would say, csr maybe.
>>>> But there can be
>>>> other efficient formats too.
>>>>
>>>> And as of the input matrices, we can follow the same procedure followed
>>>> in
>>>>
>>>> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/cd_fast.pyx
>>>>
>>>> so as to not break the user code. And moreover i have read somewhere on
>>>> the internet about sparse coding format, is it possible to use a more
>>>> efficient sparse structure within the tree algorithm. But the input types
>>>> remain the same, they get converted to another sparse structure once they
>>>> are passed to the tree, since we are working with non-zero element
>>>> pointers, this should not be memory/time inefficient. But my question is,
>>>> is this acceptable?
>>>>
>>>>
>>>> On Mon, Mar 17, 2014 at 6:49 PM, Arnaud Joly <a.j...@ulg.ac.be> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>> The support for sparse matrices should exploit as much as possible the
>>>>> sparsity structure of the matrix
>>>>> without blowing up memory as to obtain optimal performance. Note that
>>>>> the choice of the best sparse matrix
>>>>> representation is still open (csc, csr, coo, ... matrix?).
>>>>>
>>>>> To my knowldege, there is one implementation of sparse decision tree
>>>>> https://github.com/fest/fest. But I don't think that the license is
>>>>> BSD compatible.
>>>>>
>>>>> If input sparsity is added, adaboost, GBRT, bagging and random
>>>>> forest should support sparse input data.
>>>>>
>>>>> Best,
>>>>> Arnaud
>>>>>
>>>>>
>>>>> On 16 Mar 2014, at 02:49, vamsi kaushik <kaushik.varana...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Hi arnaud, Eltermann
>>>>>
>>>>> Thanks for the reply. Firstly I am a noob here so excuse me for any
>>>>> stupid questions. Eltermann Just my 2 cents, I felt your approach(in
>>>>> BestSplitter ) somewhat inefficient when you are calling the
>>>>> get_sparse_item. Firstly even though we have indices of all the non zero
>>>>> items, it is being called node_samples times, iterating over nnz times. An
>>>>> easier way of doing this would be to sort the samples in sparse matrix
>>>>> along features and then extracting if necessary. because the whole point
>>>>> in
>>>>> using Xf is to sort them which becomes difficult when they are in an n-d
>>>>> array. like for example
>>>>>
>>>>> http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.nonzero.html#scipy.sparse.csr_matrix.nonzero
>>>>> gives the non zero elements. My solution might not be perfect but my
>>>>> point is when we are adding support for sparse matrices, why not exploit
>>>>> their structure as much as we can. Arnaud is this feasible ? Eltermann
>>>>> anything wrong in my thinking ?
>>>>>
>>>>>
>>>>> cheers,
>>>>> kaushik varanasi
>>>>> On Wed, Mar 12, 2014 at 8:29 PM, Arnaud Joly <a.j...@ulg.ac.be> wrote:
>>>>>
>>>>>> For the number of contributions, I would advise you to do as many
>>>>>> as possible to get familiar with the codebase and scikit-learn
>>>>>> practices.
>>>>>> Of course starting to think and to work on how to add sparsity
>>>>>> support to decision tree
>>>>>> is important to grasp the challenges and write a convincing proposal.
>>>>>> Thus, the right proportion is difficult to suggest.
>>>>>>
>>>>>> The goal of this GSOC is to have an efficient and a maintenable
>>>>>> implementation.
>>>>>> The overall project plan should be dictated by this goal.
>>>>>>
>>>>>> Best,
>>>>>> Arnaud
>>>>>>
>>>>>> On 12 Mar 2014, at 13:44, Gilles Louppe <g.lou...@gmail.com> wrote:
>>>>>>
>>>>>> > On 12 March 2014 13:08, Felipe Eltermann <
>>>>>> felipe.elterm...@gmail.com> wrote:
>>>>>> >> Hello Vamsi,
>>>>>> >>
>>>>>> >> "Firstly, regarding the implementation of sparse functions.
>>>>>> _tree.pxy is the
>>>>>> >> back end cython code to handle the operations Splitting, Evaluating
>>>>>> >> impurities at nodes and then constructing the tree."
>>>>>> >> That's right.
>>>>>> >>
>>>>>> >> "The basic implementation that i have in mind is that we duplicate
>>>>>> the
>>>>>> >> splitter classes and The tree architecture itself."
>>>>>> >> Duplicating the code doesn't seem to be the best approach. It will
>>>>>> be
>>>>>> >> simpler if the same splitter class should handle both dense and
>>>>>> sparse
>>>>>> >> matrices.
>>>>>> >> I started a PR that is already in progress:
>>>>>> >> https://github.com/scikit-learn/scikit-learn/pull/2848
>>>>>> >> If you want to help, you can read up the commits and help on the
>>>>>> open
>>>>>> >> issues.
>>>>>> >> (however, I don't think this apply as a GSoC project though)
>>>>>> >
>>>>>> > Just my 2 cents, but I am not sure having both dense and sparse
>>>>>> > support in the same codebase is the best choice. Both
>>>>>> implementations
>>>>>> > should in my opinion be separated, with the common code factored out
>>>>>> > to avoid code dupplication.
>>>>>> >
>>>>>> > In particular, having an *efficient* sparse splitter will require
>>>>>> > specific changes to the splitter. I am really not sure having this
>>>>>> > code intermingled with the dense support is the best option.
>>>>>> >
>>>>>> >>
>>>>>> >> On Tue, Mar 11, 2014 at 7:07 PM, vamsi kaushik <
>>>>>> kaushik.varana...@gmail.com>
>>>>>> >> wrote:
>>>>>> >>>
>>>>>> >>> hi Joly,
>>>>>> >>>
>>>>>> >>> I actually have a few questions:
>>>>>> >>>
>>>>>> >>> Firstly, regarding the implementation of sparse functions.
>>>>>> _tree.pxy is
>>>>>> >>> the back end cython code to handle the operations Splitting,
>>>>>> Evaluating
>>>>>> >>> impurities at nodes and then constructing the tree. The basic
>>>>>> implementation
>>>>>> >>> that i have in mind is that we duplicate the splitter classes and
>>>>>> The tree
>>>>>> >>> architecture itself. At each leaf node we might have a sparse
>>>>>> matrix
>>>>>> >>> representing the output data and at every non-leaf node,
>>>>>> splitting should be
>>>>>> >>> done on the sparse matrix at that node and produce few other
>>>>>> nodes with
>>>>>> >>> sparse output. So the project boils down to implementing good
>>>>>> access and
>>>>>> >>> construction methods for sparse matrices in cython. I just want
>>>>>> to know if
>>>>>> >>> this is a possible implementation criterion ?. And i would also
>>>>>> like to know
>>>>>> >>> the implementation plan that you have in mind ?
>>>>>> >>>
>>>>>> >>> Secondly, In my previous mail i have shown you my contributions.
>>>>>> I would
>>>>>> >>> just like to know if the patches are upto the mark for a
>>>>>> selection. If they
>>>>>> >>> are, i will turn my attention to the DT sparse problem, else i
>>>>>> will work on
>>>>>> >>> the issues a bit more.
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> On Tue, Mar 11, 2014 at 1:12 PM, Arnaud Joly <a.j...@ulg.ac.be>
>>>>>> wrote:
>>>>>> >>>>
>>>>>> >>>> Thanks for your contribution.
>>>>>> >>>>
>>>>>> >>>> Keep up!
>>>>>> >>>>
>>>>>> >>>> Arnaud
>>>>>> >>>>
>>>>>> >>>> On 10 Mar 2014, at 23:51, vamsi kaushik <
>>>>>> kaushik.varana...@gmail.com>
>>>>>> >>>> wrote:
>>>>>> >>>>
>>>>>> >>>> My name is actually Varanasi Vamsi Kaushik(yeah its pretty long).
>>>>>> >>>>
>>>>>> >>>> And i have already contributed a little.
>>>>>> >>>> Please take a look at :
>>>>>> >>>>
>>>>>> >>>> https://github.com/scikit-learn/scikit-learn/pull/2905
>>>>>> >>>>
>>>>>> >>>> and also:
>>>>>> >>>> https://github.com/scikit-learn/scikit-learn/pull/2895
>>>>>> >>>> https://github.com/scikit-learn/scikit-learn/issues/2727
>>>>>> >>>>
>>>>>> >>>> cheers,
>>>>>> >>>> kaushik
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>> On Tue, Mar 11, 2014 at 3:13 AM, vamsi kaushik
>>>>>> >>>> <kaushik.varana...@gmail.com> wrote:
>>>>>> >>>>>
>>>>>> >>>>> hi Arnaud,
>>>>>> >>>>>
>>>>>> >>>>> Vamsi kaushik is actually me.
>>>>>> >>>>> Thanks for your reply, i'll get to the issue soon
>>>>>> >>>>>
>>>>>> >>>>> cheers,
>>>>>> >>>>> vamsi kaushik
>>>>>> >>>>>
>>>>>> >>>>>
>>>>>> >>>>> On Mon, Mar 10, 2014 at 3:16 PM, Arnaud Joly <a.j...@ulg.ac.be>
>>>>>> wrote:
>>>>>> >>>>>>
>>>>>> >>>>>> Hi,
>>>>>> >>>>>>
>>>>>> >>>>>> Anything concerning the GSOC should pass by the scikit-learn
>>>>>> >>>>>> mailing list.
>>>>>> >>>>>>
>>>>>> >>>>>> Thanks for your interest in the subject. If you intend to
>>>>>> apply for a
>>>>>> >>>>>> GSOC, I suggest you to read
>>>>>> >>>>>>
>>>>>> >>>>>>
>>>>>> https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-%28GSOC%29-2014
>>>>>> >>>>>> and start contributing to scikit-learn.
>>>>>> >>>>>>
>>>>>> >>>>>> Right now, three people have shown interest for this topic:
>>>>>> Maheshakya
>>>>>> >>>>>> (who is now
>>>>>> >>>>>> applying for a GSOC about LSH), Vamsi Kaushik and you.
>>>>>> Several
>>>>>> >>>>>> candidates may apply for the
>>>>>> >>>>>> same subject. However it is likely that if a GSOC is awarded
>>>>>> for a
>>>>>> >>>>>> given subject, only the best proposal will
>>>>>> >>>>>> be selected.
>>>>>> >>>>>>
>>>>>> >>>>>> Best regards,
>>>>>> >>>>>> Arnaud
>>>>>> >>>>>>
>>>>>> >>>>>>
>>>>>> >>>>>> On 28 Feb 2014, at 22:33, João <joaopalo...@gmail.com> wrote:
>>>>>> >>>>>>
>>>>>> >>>>>> Hi Arnaud,
>>>>>> >>>>>>
>>>>>> >>>>>> While browsing the current GSoC projects I saw an interesting
>>>>>> one for
>>>>>> >>>>>> which you are assigned as mentor: "Add scipy.sparse matrix
>>>>>> support to the
>>>>>> >>>>>> Decision Tree".
>>>>>> >>>>>>
>>>>>> >>>>>> I am considering applying for this project as I have already
>>>>>> faced the
>>>>>> >>>>>> necessity of sparse matrices in sklearn and the outcome was
>>>>>> not totally
>>>>>> >>>>>> satisfactory.
>>>>>> >>>>>> Right now I am participating in a kaggle contest
>>>>>> >>>>>> (http://www.kaggle.com/c/lshtc/) and facing several
>>>>>> difficulties even with
>>>>>> >>>>>> algorithms that already support spase matrices (even simple
>>>>>> algorithms such
>>>>>> >>>>>> as NB). In general I get some memory error and sometimes
>>>>>> segfaults.
>>>>>> >>>>>> I would be happy if I could implement the necessary support
>>>>>> for DT (and
>>>>>> >>>>>> related algorithms such as random forest and extra trees) and,
>>>>>> if the time
>>>>>> >>>>>> constraint allows, improve as much as possible the general
>>>>>> support for
>>>>>> >>>>>> sparse matrices.
>>>>>> >>>>>>
>>>>>> >>>>>> With this email, I want to show my interest and ask you if
>>>>>> there are
>>>>>> >>>>>> already any candidates for this place.
>>>>>> >>>>>>
>>>>>> >>>>>> Best regards,
>>>>>> >>>>>> --
>>>>>> >>>>>> João
>>>>>> >>>>>>
>>>>>> >>>>>>
>>>>>> >>>>>>
>>>>>> >>>>>>
>>>>>> >>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> >>>>>> Learn Graph Databases - Download FREE O'Reilly Book
>>>>>> >>>>>> "Graph Databases" is the definitive new guide to graph
>>>>>> databases and
>>>>>> >>>>>> their
>>>>>> >>>>>> applications. Written by three acclaimed leaders in the field,
>>>>>> >>>>>> this first edition is now available. Download your free book
>>>>>> today!
>>>>>> >>>>>> http://p.sf.net/sfu/13534_NeoTech
>>>>>> >>>>>> _______________________________________________
>>>>>> >>>>>> Scikit-learn-general mailing list
>>>>>> >>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>>> >>>>>>
>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>> >>>>>>
>>>>>> >>>>>
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> >>>> Learn Graph Databases - Download FREE O'Reilly Book
>>>>>> >>>> "Graph Databases" is the definitive new guide to graph databases
>>>>>> and
>>>>>> >>>> their
>>>>>> >>>> applications. Written by three acclaimed leaders in the field,
>>>>>> >>>> this first edition is now available. Download your free book
>>>>>> today!
>>>>>> >>>>
>>>>>> >>>>
>>>>>> http://p.sf.net/sfu/13534_NeoTech_______________________________________________
>>>>>> >>>> Scikit-learn-general mailing list
>>>>>> >>>> Scikit-learn-general@lists.sourceforge.net
>>>>>> >>>>
>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> >>>> Learn Graph Databases - Download FREE O'Reilly Book
>>>>>> >>>> "Graph Databases" is the definitive new guide to graph databases
>>>>>> and
>>>>>> >>>> their
>>>>>> >>>> applications. Written by three acclaimed leaders in the field,
>>>>>> >>>> this first edition is now available. Download your free book
>>>>>> today!
>>>>>> >>>> http://p.sf.net/sfu/13534_NeoTech
>>>>>> >>>> _______________________________________________
>>>>>> >>>> Scikit-learn-general mailing list
>>>>>> >>>> Scikit-learn-general@lists.sourceforge.net
>>>>>> >>>>
>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>> >>>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> >>> Learn Graph Databases - Download FREE O'Reilly Book
>>>>>> >>> "Graph Databases" is the definitive new guide to graph databases
>>>>>> and their
>>>>>> >>> applications. Written by three acclaimed leaders in the field,
>>>>>> >>> this first edition is now available. Download your free book
>>>>>> today!
>>>>>> >>> http://p.sf.net/sfu/13534_NeoTech
>>>>>> >>> _______________________________________________
>>>>>> >>> Scikit-learn-general mailing list
>>>>>> >>> Scikit-learn-general@lists.sourceforge.net
>>>>>> >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>> >>>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> ------------------------------------------------------------------------------
>>>>>> >> Learn Graph Databases - Download FREE O'Reilly Book
>>>>>> >> "Graph Databases" is the definitive new guide to graph databases
>>>>>> and their
>>>>>> >> applications. Written by three acclaimed leaders in the field,
>>>>>> >> this first edition is now available. Download your free book today!
>>>>>> >> http://p.sf.net/sfu/13534_NeoTech
>>>>>> >> _______________________________________________
>>>>>> >> Scikit-learn-general mailing list
>>>>>> >> Scikit-learn-general@lists.sourceforge.net
>>>>>> >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>> >>
>>>>>> >
>>>>>> >
>>>>>> ------------------------------------------------------------------------------
>>>>>> > Learn Graph Databases - Download FREE O'Reilly Book
>>>>>> > "Graph Databases" is the definitive new guide to graph databases
>>>>>> and their
>>>>>> > applications. Written by three acclaimed leaders in the field,
>>>>>> > this first edition is now available. Download your free book today!
>>>>>> > http://p.sf.net/sfu/13534_NeoTech
>>>>>> > _______________________________________________
>>>>>> > Scikit-learn-general mailing list
>>>>>> > Scikit-learn-general@lists.sourceforge.net
>>>>>> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> Learn Graph Databases - Download FREE O'Reilly Book
>>>>>> "Graph Databases" is the definitive new guide to graph databases and
>>>>>> their
>>>>>> applications. Written by three acclaimed leaders in the field,
>>>>>> this first edition is now available. Download your free book today!
>>>>>> http://p.sf.net/sfu/13534_NeoTech
>>>>>> _______________________________________________
>>>>>> Scikit-learn-general mailing list
>>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Learn Graph Databases - Download FREE O'Reilly Book
>>>>> "Graph Databases" is the definitive new guide to graph databases and
>>>>> their
>>>>> applications. Written by three acclaimed leaders in the field,
>>>>> this first edition is now available. Download your free book today!
>>>>>
>>>>> http://p.sf.net/sfu/13534_NeoTech_______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Learn Graph Databases - Download FREE O'Reilly Book
>>>>> "Graph Databases" is the definitive new guide to graph databases and
>>>>> their
>>>>> applications. Written by three acclaimed leaders in the field,
>>>>> this first edition is now available. Download your free book today!
>>>>> http://p.sf.net/sfu/13534_NeoTech
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>>>
>>>>
>>>
>>
>
------------------------------------------------------------------------------
Put Bad Developers to Shame
Dominate Development with Jenkins Continuous Integration
Continuously Automate Build, Test & Deployment
Start a new project now. Try Jenkins in the cloud.
http://p.sf.net/sfu/13600_Cloudbees
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general