Hi Ryan/Marcus, Just saw 2017s work on parallelisation by Shikhar Bhardwaj. https://www.mlpack.org/gsocblog/profiling-for-parallelization-and-parallel-stochastic-optimization-methods-summary.html
Impressive work he has done. Excellent documentation I must say. --------------------- In this email, I'll be telling about - What Algos I'll be implementing? - What thoughts I had over parallelisation? - My rough plan to complete the proposed work in MLPACK - Just thought of adding something like *Federated Learning* in Mlpack(this could be very complex though!) I want to have a little more clear understanding of what I am going to do, please check if I am planning correctly. Also if my waythrough is feasible. I will be focusing on following algos, during my GSOC period: 1) Random Forest 2) KNN 3) A few Gradient Boosting Algorithms Either I can "Parallelize" the algorithms according to their computation tasks(e.g. in Random Forest, I can try training its N trees in parallel) or by Distributing tasks by MapReduce or other distributed computation algorithms (https://stackoverflow.com/a/8727235/9982106 lists a few pretty well). MapReduce only works well if very small data moves across the machine very few times. *This could be a reason why should we try looking at a few better alternatives*. For. e.g. after each iteration, derivative-based methods have to calculate gradient over complete training data, which, in general, requires moving complete data to a single machine and compute the gradient. As the number of iterations increases, this could result in a bottleneck. I am in favour of working with OpenMP, before trying any such thing. Similar can occur with Tree-based algorithms when splitting has to be calculated on complete data repetitively, I would follow the given "rough" timeline: *(haven't kept it complex and unrealistic)* 1) Profiling algorithms to find their bottlenecks, training on a variety of example datasets(small to big, which brings in a heavy difference in calculations) - *Week 1-2*2) Working on at least one GBA, to check it my approach is cool, and that in complete in accordance with MLPACK. Parallelly working on profiling and designing parallelism for Random Forest - *Week* *2-3* 3) Working on Random forest and KNN - *Week 4 - 8* 4)* Building on Different distributed computing alternatives for MapReduce. *This one if works well, could transform MLPACK into an actual *distributed killer*. However, working randomly on different algos with different Distributed computation technique may lead to randomness in MLPACK Development. (*I still have to be sure on this stuff.*) ----------------- *An additional idea ---------* I don't know if this has been discussed before, as I have been away from MLPACK for almost a year. Have you ever thought of adding FEDERATED LEARNING support for MLPACK? Like the *PYSYFT Support(*https://github.com/OpenMined/PySyft*), *can bring tremendous improvement in MLPACK. This would really help people working on Big Deep Learning and for the researchers? Please let me know if we can discuss this idea! -------------------------------------- The reason for me choosing MLPACK is that, I have knowledge of its codebase, as I tried in mlpack last year, and ofc, the team is awesome, I have always found good support from everyone here. And, amid this COVID-19 thing, I *will* *not* be able to complete my earned internship at *NUS-Singapore,* so I need something of that level to work upon and utilising these summers. I am very comfortable with any kind of code, as an example, I have worked on completely unknown HASKELL code while working as an Undergrad Researcher at IITK(one of the finest CSE depts in INDIA). Plus, having knowledge of Advanced C++ can help me be quick and efficient. I have started drafting a proposal. Please, let me know your thoughts. Will update you soon within the next 2 days. ---- Please be safe! Looking forward to a wonderful experience with MLPACK. :) *Aman Pandeyamanpandey.codes <http://amanpandey.codes>* On Mon, Mar 16, 2020 at 7:52 PM Aman Pandey <[email protected]> wrote: > Hi Ryan, > I think that is enough information. > Thanks a lot. > I tried MLPACK, the last year, on QGMM, unfortunately, couldn't make it. > > Will try once again, with a possibly better proposal. ;) > In parallelisation this time. > > Thanks. > Aman Pandey > GitHub Username: johnsoncarl > > On Mon, Mar 16, 2020 at 7:33 PM Ryan Curtin <[email protected]> wrote: > >> On Sun, Mar 15, 2020 at 12:38:09PM +0530, Aman Pandey wrote: >> > Hey Ryan/Marcus, >> > Are there any current coordinates to start with, in "Profiling for >> > Parallelization"? >> > I want to know if any, to avoid any redundant work. >> >> Hey Aman, >> >> I don't think that there are any particular directions. You could >> consider looking at previous messages from previous years in the mailing >> list archives (this project has been proposed in the past and there has >> been some discussion). My suggestion would be to find some algorithms >> that you think could be useful to parallelize, and spend some time >> thinking about the right way to do that with OpenMP. The "profiling" >> part may come in useful here, as when you put your proposal together it >> could be useful to find algorithms that have bottlenecks that could be >> easily resolved with parallelism. (Note that not all algorithms have >> bottlenecks that can be solved with parallelism, and algorithms heavy on >> linear algebra may already be effectively parallelized via the use of >> OpenBLAS at a lower level.) >> >> Thanks, >> >> Ryan >> >> -- >> Ryan Curtin | "I was misinformed." >> [email protected] | - Rick Blaine >> > > > -- > > Aman Pandey > Junior Undergraduate, Bachelors of Technology > Sardar Vallabhbhai National Institute of Technology, > > Surat, Gujarat, India. 395007 > Webpage: https://johnsoncarl.github.io/aboutme/ > LinkedIn: https://www.linkedin.com/in/amnpandey/ > -- Aman Pandey Junior Undergraduate, Bachelors of Technology Sardar Vallabhbhai National Institute of Technology, Surat, Gujarat, India. 395007 Webpage: https://johnsoncarl.github.io/aboutme/ LinkedIn: https://www.linkedin.com/in/amnpandey/ -- Aman Pandey Junior Undergraduate, Bachelors of Technology Sardar Vallabhbhai National Institute of Technology, Surat, Gujarat, India. 395007 Webpage: https://johnsoncarl.github.io/aboutme/ LinkedIn: https://www.linkedin.com/in/amnpandey/
_______________________________________________ mlpack mailing list [email protected] http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack
