On Mon, Jun 05, 2017 at 05:50:08PM +0100, Yannis Mentekidis wrote: > Hi guys, > > Shikhar is working on his project to profile different mlpack algorithms > and identify potential bottlenecks he could then parallelize. He's found a > paper ( > https://papers.nips.cc/paper/3150-map-reduce-for-machine-learning-on-multicore.pdf) > which > adapts the MapReduce paradigm for certain algorithms, including Naive > Bayes, so he started with profiling that algorithm. > > However, he and I have been struggling to actually find a dataset that > makes the algorithm take a significant amount of time. The time it takes > for the mlpack::data::Load() functions is 2-3 orders of magnitude larger > than the Train() and Classify() functions. > > We were wondering: > > - Has anybody come across any usecases where NBC is slow enough to be > worth parallelizing? > - Does anyone have any tips on profiling the algorithm so that data > loading is ignored, so we can focus on the things we can actually improve?
NBC only takes basically one pass over the data, so either you have to find a gigantic dataset that takes a long time to pass over, or maybe a dataset with a very large number of classes (so that the model itself takes up a large number of space). The UCI higgs dataset might be useful: https://archive.ics.uci.edu/ml/datasets/HIGGS but it's only two-class. You can save some time with loading by loading a binary matrix file. CSV takes a long time to parse. You could write some simple code to convert: arma::mat dataset; data::Load("big_file.csv", dataset); data::Save("big_file.bin", dataset); That should reduce the amount of runtime devoting to building the matrix. Another option would be to just generate a very very large random matrix. -- Ryan Curtin | [email protected] | "Death is the road to awe." _______________________________________________ mlpack mailing list [email protected] http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack
