Another variant of benchmarking is regression testing. We have a lot of algorithms that have small simple unit tests, but which have developed bugs over time.
It would be great to have a suite of tasks that download a real dataset and expected results, run a large-scale operation, and compare the output to the expected results. On Thu, Mar 10, 2011 at 4:32 AM, Pararth Shah <[email protected]> wrote: > Thanks for the feedback. > > On Wed, Mar 9, 2011 at 6:27 PM, Grant Ingersoll <[email protected]> wrote: > >> >> On Mar 8, 2011, at 4:45 AM, Pararth Shah wrote: >> >> > Hi, >> > I am Pararth Shah, an undergraduate student in Computer Science and >> > Engineering at the Indian Institute of Technology, Bombay. I am planning >> to >> > submit a GSOC proposal, for Mahout. Considering Grant Ingersoll's reply >> in a >> > previous thread, I would like to "focus on benchmarking, examples and >> > documentation of existing capabilities." Here is a rough list of ideas >> that >> > came to my mind, while I was familiarizing myself with Mahout, through >> the >> > code, documentation and wiki: >> > >> > 1) Build a set of benchmarking tools tailored to Mahout, similar to the >> > Lucene benchmarking contrib[1], which benchmarks Lucene using "standard, >> > freely available corpora". >> >> Awesome. See also MAHOUT-588 >> >> > There is a lot of benchmarking data already contributed on MAHOUT-588. I'll > look into ways of organizing it so that it can be more meaningful. I am > currently reading more on the Lucene contrib. > > >> >> > >> > 2) Build a profiling tool, based on Java Interactive Profiler[2], to find >> > "hotspots" in the algorithm execution. This will help in identifying >> > modifications to gain speedups. The modified algorithm can be retested >> using >> > above benchmarking tools to quantify the speedup obtained. I believe a >> > custom-built profiler will have advantages in terms of speed, ability to >> > filter packages/classes profiled, and possible interactivity with the >> user, >> > over the standard profilers like hprof, JProbe and Yourkit. What I >> > understand about profilers is mostly from reading [3]. Also, I found >> useful >> > information to start with building a simple profiler consisting of java >> > agent interface coupled with the ASM library, on this page [4]. >> >> I don't really think it makes sense to re-invent the wheel here and I doubt >> the community has much interest in maintaining pure profiling code. >> Instead, I think it makes sense to leverage existing profilers. >> >> > OK, I'll look for possibilities of modifications in some existing profilers, > to suit Mahout. Can you suggest any profilers that work well with Mahout, > from experience? > > >> >> > >> > 3) Use these tools to gather detailed information about the control flow, >> > data flow, processing time, and memory usage patterns of execution of >> every >> > algorithm present in Mahout on certain standard datasets, and providing >> the >> > information on the Mahout website/wiki for analysis (white box >> testing[5]). >> > >> > 4) Add functionality to import databases (MySQL) into Vectors, as input >> for >> > clustering algorithms. This will allow more datasets to be directly used >> > with the clustering algorithms. >> > >> > 5) Update the documentation where required. For example, >> > "org.apache.mahout.classifier.bayes" and >> > "org.apache.mahout.clustering.canopy" are well documented, but it took me >> > some time to figure out "org.apache.mahout.clustering.minhash". The wiki >> > proved to be very informative in general, and I am assuming that the >> pages >> > that are incomplete (eg Hierarchical Clustering, Independant Component >> > Analysis) correspond to algorithms that are still work-in-progress. >> Writing >> > one or two more examples for each algorithm would certainly benefit >> > newcomers starting out with Mahout (eg me). >> >> This would be great. >> >> > MAHOUT-621 seems very interesting, its basically what I was trying to say in > point 4 here. Do you intend to have it as a GSOC project in itself? I'll put > more focus on it in my proposal. > > >> >> > >> > 6) (I don't know if this is feasible. Please comment) Build a tool that >> > tracks the progress of an algorithm in real time during its execution, >> > depicting (graphically?) what part of the dataset is already analysed, >> what >> > is being currently analysed (eg. which part of training set in a >> classifier >> > is being worked on); what is the current state of the learning algorithm >> (eg >> > size and number of clusters in clustering algorithms). The data collected >> by >> > this tool can then be further analysed (eg movement of the decision >> boundary >> > over the course of a classifier algorithm, before attaining its final >> > state). I believe this would be a great tool to: >> > (a) gain insights about the data set >> > (b) gain insights about the algorithm >> > (c) introduce machine learning concepts to anyone >> >> >> You might look into the tools that are out there for Hadoop for analyzing >> processes, etc. >> >> > I had a look at the Hadoop JobTracker, but I'll have to do more reading > before I can come up with anything concrete. > > >> > >> > >> > These are just ideas, I wish to know which (if any) seem interesting >> enough, >> > and what are the possible improvements. Then I can spend the next month, >> > before submitting the proposal, working on the specifics, figuring out >> how I >> > may go about doing it. I am hoping I'll get enough pointers along the >> way, >> > to refine and prioritize these tasks to suit the community. >> > >> > My motivation is simple: I am looking forward to either pursuing graduate >> > study in, or working on solving problems that require a knowledge of, the >> > field of machine learning. I have a fair idea of the basic concepts and >> > algorithms. Spending a summer closely scrutinising, documenting and >> testing >> > the implementations of the many ML algorithms currently present in >> Mahout, >> > will be a great opportunity for me to gain a solid, breadth-first >> > understanding of a majority of ML algorithms, plus it should be fun too >> :) >> >> I think this is a good start. Generally speaking, many people fail to get >> selected because they bite off too much. I would encourage you to focus in >> on a few areas that you think you can do a really good job in and propose >> along those lines. >> >> > Thanks, I'll keep this in mind. I'll look around for more information on how > I can implement these ideas before selecting a subset to go ahead with. > > >> > >> > Any feedback is appreciated. >> > >> > Thanks and regards, >> > Pararth >> > >> > References: >> > [1] "Lucene Javadocs" >> > http://lucene.apache.org/java/2_9_4/api/contrib-benchmark/index.html >> > [2] "Java Interactive Profiler" http://sourceforge.net/projects/jiprof/ >> > [3] "Profiling Tools" >> http://vast.uccs.edu/~tboult/CS330/NOTES/profilers.ppt >> > [4] "Build Your Own Profiler" >> > http://www.ibm.com/developerworks/java/library/j-jip/ >> > [5] "White Box Testing" http://en.wikipedia.org/wiki/White-box_testing >> >> -------------------------- >> Grant Ingersoll >> http://www.lucidimagination.com >> >> > -- Lance Norskog [email protected]
