Waiting for some feedback.
Have done some more groundwork.... I'll drop a draft proposal in a day or two. Thanks & Regards Aman Pandey On Fri, 27 Mar 2020, 8:34 pm Aman Pandey, <[email protected]> wrote: > > Hi Ryan/Marcus, > Just saw 2017s work on parallelisation by Shikhar Bhardwaj. > > https://www.mlpack.org/gsocblog/profiling-for-parallelization-and-parallel-stochastic-optimization-methods-summary.html > > Impressive work he has done. Excellent documentation I must say. > --------------------- > > In this email, I'll be telling about > > - What Algos I'll be implementing? > - What thoughts I had over parallelisation? > - My rough plan to complete the proposed work in MLPACK > - Just thought of adding something like *Federated Learning* in > Mlpack(this could be very complex though!) > > > I want to have a little more clear understanding of what I am going to do, > please check if I am planning correctly. Also if my waythrough is feasible. > > I will be focusing on following algos, during my GSOC period: > 1) Random Forest > 2) KNN > 3) A few Gradient Boosting Algorithms > > Either I can "Parallelize" the algorithms according to their computation > tasks(e.g. in Random Forest, I can try training its N trees in parallel) or > by Distributing tasks by MapReduce or other distributed computation > algorithms (https://stackoverflow.com/a/8727235/9982106 lists a few > pretty well). MapReduce only works well if very small data moves across the > machine very few times. *This could be a reason why should we try looking > at a few better alternatives*. > For. e.g. after each iteration, derivative-based methods have to calculate > gradient over complete training data, which, in general, requires moving > complete data to a single machine and compute the gradient. As the number > of iterations increases, this could result in a bottleneck. > > I am in favour of working with OpenMP, before trying any such thing. > > Similar can occur with Tree-based algorithms when splitting has to be > calculated on complete data repetitively, > > I would follow the given "rough" timeline: > *(haven't kept it complex and unrealistic)* > > 1) Profiling algorithms to find their bottlenecks, training on a variety > of example datasets(small to big, which brings in a heavy difference in > calculations) - > *Week 1-2*2) Working on at least one GBA, to check it my approach is > cool, and that in complete in accordance with MLPACK. Parallelly working on > profiling and designing parallelism for Random Forest - *Week* *2-3* > 3) Working on Random forest and KNN - *Week 4 - 8* > 4)* Building on Different distributed computing alternatives for > MapReduce. *This one if works well, could transform MLPACK into an actual > *distributed > killer*. However, working randomly on different algos with different > Distributed computation technique may lead to randomness in MLPACK > Development. (*I still have to be sure on this stuff.*) > > ----------------- *An additional idea ---------* > I don't know if this has been discussed before, as I have been away from > MLPACK for almost a year. > Have you ever thought of adding FEDERATED LEARNING support for MLPACK? > Like the *PYSYFT Support(*https://github.com/OpenMined/PySyft*), *can > bring tremendous improvement in MLPACK. This would really help people > working on Big Deep Learning and for the researchers? > > > Please let me know if we can discuss this idea! > > -------------------------------------- > The reason for me choosing MLPACK is that, I have knowledge of its > codebase, as I tried in mlpack last year, and ofc, the team is awesome, I > have always found good support from everyone here. > > And, amid this COVID-19 thing, I *will* *not* be able to complete my > earned internship at *NUS-Singapore,* so I need something of that level > to work upon and utilising these summers. > I am very comfortable with any kind of code, as an example, I have worked > on completely unknown HASKELL code while working as an Undergrad Researcher > at IITK(one of the finest CSE depts in INDIA). > Plus, having knowledge of Advanced C++ can help me be quick and efficient. > > I have started drafting a proposal. Please, let me know your thoughts. > > Will update you soon within the next 2 days. > > ---- > Please be safe! > Looking forward to a wonderful experience with MLPACK. :) > > > > *Aman Pandeyamanpandey.codes <http://amanpandey.codes>* > > On Mon, Mar 16, 2020 at 7:52 PM Aman Pandey <[email protected]> > wrote: > >> Hi Ryan, >> I think that is enough information. >> Thanks a lot. >> I tried MLPACK, the last year, on QGMM, unfortunately, couldn't make it. >> >> Will try once again, with a possibly better proposal. ;) >> In parallelisation this time. >> >> Thanks. >> Aman Pandey >> GitHub Username: johnsoncarl >> >> On Mon, Mar 16, 2020 at 7:33 PM Ryan Curtin <[email protected]> wrote: >> >>> On Sun, Mar 15, 2020 at 12:38:09PM +0530, Aman Pandey wrote: >>> > Hey Ryan/Marcus, >>> > Are there any current coordinates to start with, in "Profiling for >>> > Parallelization"? >>> > I want to know if any, to avoid any redundant work. >>> >>> Hey Aman, >>> >>> I don't think that there are any particular directions. You could >>> consider looking at previous messages from previous years in the mailing >>> list archives (this project has been proposed in the past and there has >>> been some discussion). My suggestion would be to find some algorithms >>> that you think could be useful to parallelize, and spend some time >>> thinking about the right way to do that with OpenMP. The "profiling" >>> part may come in useful here, as when you put your proposal together it >>> could be useful to find algorithms that have bottlenecks that could be >>> easily resolved with parallelism. (Note that not all algorithms have >>> bottlenecks that can be solved with parallelism, and algorithms heavy on >>> linear algebra may already be effectively parallelized via the use of >>> OpenBLAS at a lower level.) >>> >>> Thanks, >>> >>> Ryan >>> >>> -- >>> Ryan Curtin | "I was misinformed." >>> [email protected] | - Rick Blaine >>> >> >> >> -- >> >> Aman Pandey >> Junior Undergraduate, Bachelors of Technology >> Sardar Vallabhbhai National Institute of Technology, >> >> Surat, Gujarat, India. 395007 >> Webpage: https://johnsoncarl.github.io/aboutme/ >> LinkedIn: https://www.linkedin.com/in/amnpandey/ >> > > > -- > > Aman Pandey > Junior Undergraduate, Bachelors of Technology > Sardar Vallabhbhai National Institute of Technology, > > Surat, Gujarat, India. 395007 > Webpage: https://johnsoncarl.github.io/aboutme/ > LinkedIn: https://www.linkedin.com/in/amnpandey/ > > > -- > > Aman Pandey > Junior Undergraduate, Bachelors of Technology > Sardar Vallabhbhai National Institute of Technology, > > Surat, Gujarat, India. 395007 > Webpage: https://johnsoncarl.github.io/aboutme/ > LinkedIn: https://www.linkedin.com/in/amnpandey/ >
_______________________________________________ mlpack mailing list [email protected] http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack
