For the number of contributions, I would advise you to do as many as possible to get familiar with the codebase and scikit-learn practices. Of course starting to think and to work on how to add sparsity support to decision tree is important to grasp the challenges and write a convincing proposal. Thus, the right proportion is difficult to suggest.
The goal of this GSOC is to have an efficient and a maintenable implementation. The overall project plan should be dictated by this goal. Best, Arnaud On 12 Mar 2014, at 13:44, Gilles Louppe <g.lou...@gmail.com> wrote: > On 12 March 2014 13:08, Felipe Eltermann <felipe.elterm...@gmail.com> wrote: >> Hello Vamsi, >> >> "Firstly, regarding the implementation of sparse functions. _tree.pxy is the >> back end cython code to handle the operations Splitting, Evaluating >> impurities at nodes and then constructing the tree." >> That's right. >> >> "The basic implementation that i have in mind is that we duplicate the >> splitter classes and The tree architecture itself." >> Duplicating the code doesn't seem to be the best approach. It will be >> simpler if the same splitter class should handle both dense and sparse >> matrices. >> I started a PR that is already in progress: >> https://github.com/scikit-learn/scikit-learn/pull/2848 >> If you want to help, you can read up the commits and help on the open >> issues. >> (however, I don't think this apply as a GSoC project though) > > Just my 2 cents, but I am not sure having both dense and sparse > support in the same codebase is the best choice. Both implementations > should in my opinion be separated, with the common code factored out > to avoid code dupplication. > > In particular, having an *efficient* sparse splitter will require > specific changes to the splitter. I am really not sure having this > code intermingled with the dense support is the best option. > >> >> On Tue, Mar 11, 2014 at 7:07 PM, vamsi kaushik <kaushik.varana...@gmail.com> >> wrote: >>> >>> hi Joly, >>> >>> I actually have a few questions: >>> >>> Firstly, regarding the implementation of sparse functions. _tree.pxy is >>> the back end cython code to handle the operations Splitting, Evaluating >>> impurities at nodes and then constructing the tree. The basic implementation >>> that i have in mind is that we duplicate the splitter classes and The tree >>> architecture itself. At each leaf node we might have a sparse matrix >>> representing the output data and at every non-leaf node, splitting should be >>> done on the sparse matrix at that node and produce few other nodes with >>> sparse output. So the project boils down to implementing good access and >>> construction methods for sparse matrices in cython. I just want to know if >>> this is a possible implementation criterion ?. And i would also like to know >>> the implementation plan that you have in mind ? >>> >>> Secondly, In my previous mail i have shown you my contributions. I would >>> just like to know if the patches are upto the mark for a selection. If they >>> are, i will turn my attention to the DT sparse problem, else i will work on >>> the issues a bit more. >>> >>> >>> >>> On Tue, Mar 11, 2014 at 1:12 PM, Arnaud Joly <a.j...@ulg.ac.be> wrote: >>>> >>>> Thanks for your contribution. >>>> >>>> Keep up! >>>> >>>> Arnaud >>>> >>>> On 10 Mar 2014, at 23:51, vamsi kaushik <kaushik.varana...@gmail.com> >>>> wrote: >>>> >>>> My name is actually Varanasi Vamsi Kaushik(yeah its pretty long). >>>> >>>> And i have already contributed a little. >>>> Please take a look at : >>>> >>>> https://github.com/scikit-learn/scikit-learn/pull/2905 >>>> >>>> and also: >>>> https://github.com/scikit-learn/scikit-learn/pull/2895 >>>> https://github.com/scikit-learn/scikit-learn/issues/2727 >>>> >>>> cheers, >>>> kaushik >>>> >>>> >>>> On Tue, Mar 11, 2014 at 3:13 AM, vamsi kaushik >>>> <kaushik.varana...@gmail.com> wrote: >>>>> >>>>> hi Arnaud, >>>>> >>>>> Vamsi kaushik is actually me. >>>>> Thanks for your reply, i'll get to the issue soon >>>>> >>>>> cheers, >>>>> vamsi kaushik >>>>> >>>>> >>>>> On Mon, Mar 10, 2014 at 3:16 PM, Arnaud Joly <a.j...@ulg.ac.be> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> Anything concerning the GSOC should pass by the scikit-learn >>>>>> mailing list. >>>>>> >>>>>> Thanks for your interest in the subject. If you intend to apply for a >>>>>> GSOC, I suggest you to read >>>>>> >>>>>> https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-%28GSOC%29-2014 >>>>>> and start contributing to scikit-learn. >>>>>> >>>>>> Right now, three people have shown interest for this topic: Maheshakya >>>>>> (who is now >>>>>> applying for a GSOC about LSH), Vamsi Kaushik and you. Several >>>>>> candidates may apply for the >>>>>> same subject. However it is likely that if a GSOC is awarded for a >>>>>> given subject, only the best proposal will >>>>>> be selected. >>>>>> >>>>>> Best regards, >>>>>> Arnaud >>>>>> >>>>>> >>>>>> On 28 Feb 2014, at 22:33, João <joaopalo...@gmail.com> wrote: >>>>>> >>>>>> Hi Arnaud, >>>>>> >>>>>> While browsing the current GSoC projects I saw an interesting one for >>>>>> which you are assigned as mentor: "Add scipy.sparse matrix support to the >>>>>> Decision Tree". >>>>>> >>>>>> I am considering applying for this project as I have already faced the >>>>>> necessity of sparse matrices in sklearn and the outcome was not totally >>>>>> satisfactory. >>>>>> Right now I am participating in a kaggle contest >>>>>> (http://www.kaggle.com/c/lshtc/) and facing several difficulties even >>>>>> with >>>>>> algorithms that already support spase matrices (even simple algorithms >>>>>> such >>>>>> as NB). In general I get some memory error and sometimes segfaults. >>>>>> I would be happy if I could implement the necessary support for DT (and >>>>>> related algorithms such as random forest and extra trees) and, if the >>>>>> time >>>>>> constraint allows, improve as much as possible the general support for >>>>>> sparse matrices. >>>>>> >>>>>> With this email, I want to show my interest and ask you if there are >>>>>> already any candidates for this place. >>>>>> >>>>>> Best regards, >>>>>> -- >>>>>> João >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> Learn Graph Databases - Download FREE O'Reilly Book >>>>>> "Graph Databases" is the definitive new guide to graph databases and >>>>>> their >>>>>> applications. Written by three acclaimed leaders in the field, >>>>>> this first edition is now available. Download your free book today! >>>>>> http://p.sf.net/sfu/13534_NeoTech >>>>>> _______________________________________________ >>>>>> Scikit-learn-general mailing list >>>>>> Scikit-learn-general@lists.sourceforge.net >>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>>>> >>>>> >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> Learn Graph Databases - Download FREE O'Reilly Book >>>> "Graph Databases" is the definitive new guide to graph databases and >>>> their >>>> applications. Written by three acclaimed leaders in the field, >>>> this first edition is now available. Download your free book today! >>>> >>>> http://p.sf.net/sfu/13534_NeoTech_______________________________________________ >>>> Scikit-learn-general mailing list >>>> Scikit-learn-general@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>> >>>> >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> Learn Graph Databases - Download FREE O'Reilly Book >>>> "Graph Databases" is the definitive new guide to graph databases and >>>> their >>>> applications. Written by three acclaimed leaders in the field, >>>> this first edition is now available. Download your free book today! >>>> http://p.sf.net/sfu/13534_NeoTech >>>> _______________________________________________ >>>> Scikit-learn-general mailing list >>>> Scikit-learn-general@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>> >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> Learn Graph Databases - Download FREE O'Reilly Book >>> "Graph Databases" is the definitive new guide to graph databases and their >>> applications. Written by three acclaimed leaders in the field, >>> this first edition is now available. Download your free book today! >>> http://p.sf.net/sfu/13534_NeoTech >>> _______________________________________________ >>> Scikit-learn-general mailing list >>> Scikit-learn-general@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>> >> >> >> ------------------------------------------------------------------------------ >> Learn Graph Databases - Download FREE O'Reilly Book >> "Graph Databases" is the definitive new guide to graph databases and their >> applications. Written by three acclaimed leaders in the field, >> this first edition is now available. Download your free book today! >> http://p.sf.net/sfu/13534_NeoTech >> _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> > > ------------------------------------------------------------------------------ > Learn Graph Databases - Download FREE O'Reilly Book > "Graph Databases" is the definitive new guide to graph databases and their > applications. Written by three acclaimed leaders in the field, > this first edition is now available. Download your free book today! > http://p.sf.net/sfu/13534_NeoTech > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ Learn Graph Databases - Download FREE O'Reilly Book "Graph Databases" is the definitive new guide to graph databases and their applications. Written by three acclaimed leaders in the field, this first edition is now available. Download your free book today! http://p.sf.net/sfu/13534_NeoTech _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general