Hi,
And of the fest implementation i would say we just rewrite the algorithm in
cython instead of writing wrappers for it, this dosn't need licence and is
readable for other contributors and also easily changeable in future. This
boils down to a few 500 lines of code in cython.
On Mon, Mar 17, 2014 at 7:26 PM, vamsi kaushik
<kaushik.varana...@gmail.com>wrote:
> hi arnaud,
>
> yeah i have already gone through that implementation. I feel at least we
> should have separate splitter classes for the node splitting to be more
> efficient and flexible.
> And as of the choice of matrix, I don't have one in mind but mostly the
> elements are sorted along the feature columns. so I would say, csr maybe.
> But there can be
> other efficient formats too.
>
> And as of the input matrices, we can follow the same procedure followed in
>
> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/cd_fast.pyx
>
> so as to not break the user code. And moreover i have read somewhere on
> the internet about sparse coding format, is it possible to use a more
> efficient sparse structure within the tree algorithm. But the input types
> remain the same, they get converted to another sparse structure once they
> are passed to the tree, since we are working with non-zero element
> pointers, this should not be memory/time inefficient. But my question is,
> is this acceptable?
>
>
> On Mon, Mar 17, 2014 at 6:49 PM, Arnaud Joly <a.j...@ulg.ac.be> wrote:
>
>> Hi,
>>
>>
>> The support for sparse matrices should exploit as much as possible the
>> sparsity structure of the matrix
>> without blowing up memory as to obtain optimal performance. Note that the
>> choice of the best sparse matrix
>> representation is still open (csc, csr, coo, ... matrix?).
>>
>> To my knowldege, there is one implementation of sparse decision tree
>> https://github.com/fest/fest. But I don't think that the license is BSD
>> compatible.
>>
>> If input sparsity is added, adaboost, GBRT, bagging and random
>> forest should support sparse input data.
>>
>> Best,
>> Arnaud
>>
>>
>> On 16 Mar 2014, at 02:49, vamsi kaushik <kaushik.varana...@gmail.com>
>> wrote:
>>
>> Hi arnaud, Eltermann
>>
>> Thanks for the reply. Firstly I am a noob here so excuse me for any
>> stupid questions. Eltermann Just my 2 cents, I felt your approach(in
>> BestSplitter ) somewhat inefficient when you are calling the
>> get_sparse_item. Firstly even though we have indices of all the non zero
>> items, it is being called node_samples times, iterating over nnz times. An
>> easier way of doing this would be to sort the samples in sparse matrix
>> along features and then extracting if necessary. because the whole point in
>> using Xf is to sort them which becomes difficult when they are in an n-d
>> array. like for example
>>
>> http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.nonzero.html#scipy.sparse.csr_matrix.nonzero
>> gives the non zero elements. My solution might not be perfect but my
>> point is when we are adding support for sparse matrices, why not exploit
>> their structure as much as we can. Arnaud is this feasible ? Eltermann
>> anything wrong in my thinking ?
>>
>>
>> cheers,
>> kaushik varanasi
>> On Wed, Mar 12, 2014 at 8:29 PM, Arnaud Joly <a.j...@ulg.ac.be> wrote:
>>
>>> For the number of contributions, I would advise you to do as many
>>> as possible to get familiar with the codebase and scikit-learn practices.
>>> Of course starting to think and to work on how to add sparsity support
>>> to decision tree
>>> is important to grasp the challenges and write a convincing proposal.
>>> Thus, the right proportion is difficult to suggest.
>>>
>>> The goal of this GSOC is to have an efficient and a maintenable
>>> implementation.
>>> The overall project plan should be dictated by this goal.
>>>
>>> Best,
>>> Arnaud
>>>
>>> On 12 Mar 2014, at 13:44, Gilles Louppe <g.lou...@gmail.com> wrote:
>>>
>>> > On 12 March 2014 13:08, Felipe Eltermann <felipe.elterm...@gmail.com>
>>> wrote:
>>> >> Hello Vamsi,
>>> >>
>>> >> "Firstly, regarding the implementation of sparse functions. _tree.pxy
>>> is the
>>> >> back end cython code to handle the operations Splitting, Evaluating
>>> >> impurities at nodes and then constructing the tree."
>>> >> That's right.
>>> >>
>>> >> "The basic implementation that i have in mind is that we duplicate the
>>> >> splitter classes and The tree architecture itself."
>>> >> Duplicating the code doesn't seem to be the best approach. It will be
>>> >> simpler if the same splitter class should handle both dense and sparse
>>> >> matrices.
>>> >> I started a PR that is already in progress:
>>> >> https://github.com/scikit-learn/scikit-learn/pull/2848
>>> >> If you want to help, you can read up the commits and help on the open
>>> >> issues.
>>> >> (however, I don't think this apply as a GSoC project though)
>>> >
>>> > Just my 2 cents, but I am not sure having both dense and sparse
>>> > support in the same codebase is the best choice. Both implementations
>>> > should in my opinion be separated, with the common code factored out
>>> > to avoid code dupplication.
>>> >
>>> > In particular, having an *efficient* sparse splitter will require
>>> > specific changes to the splitter. I am really not sure having this
>>> > code intermingled with the dense support is the best option.
>>> >
>>> >>
>>> >> On Tue, Mar 11, 2014 at 7:07 PM, vamsi kaushik <
>>> kaushik.varana...@gmail.com>
>>> >> wrote:
>>> >>>
>>> >>> hi Joly,
>>> >>>
>>> >>> I actually have a few questions:
>>> >>>
>>> >>> Firstly, regarding the implementation of sparse functions. _tree.pxy
>>> is
>>> >>> the back end cython code to handle the operations Splitting,
>>> Evaluating
>>> >>> impurities at nodes and then constructing the tree. The basic
>>> implementation
>>> >>> that i have in mind is that we duplicate the splitter classes and
>>> The tree
>>> >>> architecture itself. At each leaf node we might have a sparse matrix
>>> >>> representing the output data and at every non-leaf node, splitting
>>> should be
>>> >>> done on the sparse matrix at that node and produce few other nodes
>>> with
>>> >>> sparse output. So the project boils down to implementing good access
>>> and
>>> >>> construction methods for sparse matrices in cython. I just want to
>>> know if
>>> >>> this is a possible implementation criterion ?. And i would also like
>>> to know
>>> >>> the implementation plan that you have in mind ?
>>> >>>
>>> >>> Secondly, In my previous mail i have shown you my contributions. I
>>> would
>>> >>> just like to know if the patches are upto the mark for a selection.
>>> If they
>>> >>> are, i will turn my attention to the DT sparse problem, else i will
>>> work on
>>> >>> the issues a bit more.
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Tue, Mar 11, 2014 at 1:12 PM, Arnaud Joly <a.j...@ulg.ac.be>
>>> wrote:
>>> >>>>
>>> >>>> Thanks for your contribution.
>>> >>>>
>>> >>>> Keep up!
>>> >>>>
>>> >>>> Arnaud
>>> >>>>
>>> >>>> On 10 Mar 2014, at 23:51, vamsi kaushik <
>>> kaushik.varana...@gmail.com>
>>> >>>> wrote:
>>> >>>>
>>> >>>> My name is actually Varanasi Vamsi Kaushik(yeah its pretty long).
>>> >>>>
>>> >>>> And i have already contributed a little.
>>> >>>> Please take a look at :
>>> >>>>
>>> >>>> https://github.com/scikit-learn/scikit-learn/pull/2905
>>> >>>>
>>> >>>> and also:
>>> >>>> https://github.com/scikit-learn/scikit-learn/pull/2895
>>> >>>> https://github.com/scikit-learn/scikit-learn/issues/2727
>>> >>>>
>>> >>>> cheers,
>>> >>>> kaushik
>>> >>>>
>>> >>>>
>>> >>>> On Tue, Mar 11, 2014 at 3:13 AM, vamsi kaushik
>>> >>>> <kaushik.varana...@gmail.com> wrote:
>>> >>>>>
>>> >>>>> hi Arnaud,
>>> >>>>>
>>> >>>>> Vamsi kaushik is actually me.
>>> >>>>> Thanks for your reply, i'll get to the issue soon
>>> >>>>>
>>> >>>>> cheers,
>>> >>>>> vamsi kaushik
>>> >>>>>
>>> >>>>>
>>> >>>>> On Mon, Mar 10, 2014 at 3:16 PM, Arnaud Joly <a.j...@ulg.ac.be>
>>> wrote:
>>> >>>>>>
>>> >>>>>> Hi,
>>> >>>>>>
>>> >>>>>> Anything concerning the GSOC should pass by the scikit-learn
>>> >>>>>> mailing list.
>>> >>>>>>
>>> >>>>>> Thanks for your interest in the subject. If you intend to apply
>>> for a
>>> >>>>>> GSOC, I suggest you to read
>>> >>>>>>
>>> >>>>>>
>>> https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-%28GSOC%29-2014
>>> >>>>>> and start contributing to scikit-learn.
>>> >>>>>>
>>> >>>>>> Right now, three people have shown interest for this topic:
>>> Maheshakya
>>> >>>>>> (who is now
>>> >>>>>> applying for a GSOC about LSH), Vamsi Kaushik and you. Several
>>> >>>>>> candidates may apply for the
>>> >>>>>> same subject. However it is likely that if a GSOC is awarded for a
>>> >>>>>> given subject, only the best proposal will
>>> >>>>>> be selected.
>>> >>>>>>
>>> >>>>>> Best regards,
>>> >>>>>> Arnaud
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> On 28 Feb 2014, at 22:33, João <joaopalo...@gmail.com> wrote:
>>> >>>>>>
>>> >>>>>> Hi Arnaud,
>>> >>>>>>
>>> >>>>>> While browsing the current GSoC projects I saw an interesting one
>>> for
>>> >>>>>> which you are assigned as mentor: "Add scipy.sparse matrix
>>> support to the
>>> >>>>>> Decision Tree".
>>> >>>>>>
>>> >>>>>> I am considering applying for this project as I have already
>>> faced the
>>> >>>>>> necessity of sparse matrices in sklearn and the outcome was not
>>> totally
>>> >>>>>> satisfactory.
>>> >>>>>> Right now I am participating in a kaggle contest
>>> >>>>>> (http://www.kaggle.com/c/lshtc/) and facing several difficulties
>>> even with
>>> >>>>>> algorithms that already support spase matrices (even simple
>>> algorithms such
>>> >>>>>> as NB). In general I get some memory error and sometimes
>>> segfaults.
>>> >>>>>> I would be happy if I could implement the necessary support for
>>> DT (and
>>> >>>>>> related algorithms such as random forest and extra trees) and, if
>>> the time
>>> >>>>>> constraint allows, improve as much as possible the general
>>> support for
>>> >>>>>> sparse matrices.
>>> >>>>>>
>>> >>>>>> With this email, I want to show my interest and ask you if there
>>> are
>>> >>>>>> already any candidates for this place.
>>> >>>>>>
>>> >>>>>> Best regards,
>>> >>>>>> --
>>> >>>>>> João
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> ------------------------------------------------------------------------------
>>> >>>>>> Learn Graph Databases - Download FREE O'Reilly Book
>>> >>>>>> "Graph Databases" is the definitive new guide to graph databases
>>> and
>>> >>>>>> their
>>> >>>>>> applications. Written by three acclaimed leaders in the field,
>>> >>>>>> this first edition is now available. Download your free book
>>> today!
>>> >>>>>> http://p.sf.net/sfu/13534_NeoTech
>>> >>>>>> _______________________________________________
>>> >>>>>> Scikit-learn-general mailing list
>>> >>>>>> Scikit-learn-general@lists.sourceforge.net
>>> >>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>> >>>>>>
>>> >>>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> ------------------------------------------------------------------------------
>>> >>>> Learn Graph Databases - Download FREE O'Reilly Book
>>> >>>> "Graph Databases" is the definitive new guide to graph databases and
>>> >>>> their
>>> >>>> applications. Written by three acclaimed leaders in the field,
>>> >>>> this first edition is now available. Download your free book today!
>>> >>>>
>>> >>>>
>>> http://p.sf.net/sfu/13534_NeoTech_______________________________________________
>>> >>>> Scikit-learn-general mailing list
>>> >>>> Scikit-learn-general@lists.sourceforge.net
>>> >>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> ------------------------------------------------------------------------------
>>> >>>> Learn Graph Databases - Download FREE O'Reilly Book
>>> >>>> "Graph Databases" is the definitive new guide to graph databases and
>>> >>>> their
>>> >>>> applications. Written by three acclaimed leaders in the field,
>>> >>>> this first edition is now available. Download your free book today!
>>> >>>> http://p.sf.net/sfu/13534_NeoTech
>>> >>>> _______________________________________________
>>> >>>> Scikit-learn-general mailing list
>>> >>>> Scikit-learn-general@lists.sourceforge.net
>>> >>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>> >>>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> ------------------------------------------------------------------------------
>>> >>> Learn Graph Databases - Download FREE O'Reilly Book
>>> >>> "Graph Databases" is the definitive new guide to graph databases and
>>> their
>>> >>> applications. Written by three acclaimed leaders in the field,
>>> >>> this first edition is now available. Download your free book today!
>>> >>> http://p.sf.net/sfu/13534_NeoTech
>>> >>> _______________________________________________
>>> >>> Scikit-learn-general mailing list
>>> >>> Scikit-learn-general@lists.sourceforge.net
>>> >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>> >>>
>>> >>
>>> >>
>>> >>
>>> ------------------------------------------------------------------------------
>>> >> Learn Graph Databases - Download FREE O'Reilly Book
>>> >> "Graph Databases" is the definitive new guide to graph databases and
>>> their
>>> >> applications. Written by three acclaimed leaders in the field,
>>> >> this first edition is now available. Download your free book today!
>>> >> http://p.sf.net/sfu/13534_NeoTech
>>> >> _______________________________________________
>>> >> Scikit-learn-general mailing list
>>> >> Scikit-learn-general@lists.sourceforge.net
>>> >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>> >>
>>> >
>>> >
>>> ------------------------------------------------------------------------------
>>> > Learn Graph Databases - Download FREE O'Reilly Book
>>> > "Graph Databases" is the definitive new guide to graph databases and
>>> their
>>> > applications. Written by three acclaimed leaders in the field,
>>> > this first edition is now available. Download your free book today!
>>> > http://p.sf.net/sfu/13534_NeoTech
>>> > _______________________________________________
>>> > Scikit-learn-general mailing list
>>> > Scikit-learn-general@lists.sourceforge.net
>>> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Learn Graph Databases - Download FREE O'Reilly Book
>>> "Graph Databases" is the definitive new guide to graph databases and
>>> their
>>> applications. Written by three acclaimed leaders in the field,
>>> this first edition is now available. Download your free book today!
>>> http://p.sf.net/sfu/13534_NeoTech
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> Learn Graph Databases - Download FREE O'Reilly Book
>> "Graph Databases" is the definitive new guide to graph databases and their
>> applications. Written by three acclaimed leaders in the field,
>> this first edition is now available. Download your free book today!
>>
>> http://p.sf.net/sfu/13534_NeoTech_______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Learn Graph Databases - Download FREE O'Reilly Book
>> "Graph Databases" is the definitive new guide to graph databases and their
>> applications. Written by three acclaimed leaders in the field,
>> this first edition is now available. Download your free book today!
>> http://p.sf.net/sfu/13534_NeoTech
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general