GSOC 2011: Benchmarking, Profiling and Documentation

Pararth Shah Tue, 08 Mar 2011 01:46:13 -0800

Hi,
I am Pararth Shah, an undergraduate student in Computer Science and
Engineering at the Indian Institute of Technology, Bombay. I am planning to
submit a GSOC proposal, for Mahout. Considering Grant Ingersoll's reply in a
previous thread, I would like to "focus on benchmarking, examples and
documentation of existing capabilities." Here is a rough list of ideas that
came to my mind, while I was familiarizing myself with Mahout, through the
code, documentation and wiki:


1) Build a set of benchmarking tools tailored to Mahout, similar to the
Lucene benchmarking contrib[1], which benchmarks Lucene using "standard,
freely available corpora".

2) Build a profiling tool, based on Java Interactive Profiler[2], to find
"hotspots" in the algorithm execution. This will help in identifying
modifications to gain speedups. The modified algorithm can be retested using
above benchmarking tools to quantify the speedup obtained. I believe a
custom-built profiler will have advantages in terms of speed, ability to
filter packages/classes profiled, and possible interactivity with the user,
over the standard profilers like hprof, JProbe and Yourkit. What I
understand about profilers is mostly from reading [3]. Also, I found useful
information to start with building a simple profiler consisting of java
agent interface coupled with the ASM library, on this page [4].

3) Use these tools to gather detailed information about the control flow,
data flow, processing time, and memory usage patterns of execution of every
algorithm present in Mahout on certain standard datasets, and providing the
information on the Mahout website/wiki for analysis (white box testing[5]).

4) Add functionality to import databases (MySQL) into Vectors, as input for
clustering algorithms. This will allow more datasets to be directly used
with the clustering algorithms.

5) Update the documentation where required. For example,
"org.apache.mahout.classifier.bayes" and
"org.apache.mahout.clustering.canopy" are well documented, but it took me
some time to figure out "org.apache.mahout.clustering.minhash". The wiki
proved to be very informative in general, and I am assuming that the pages
that are incomplete (eg Hierarchical Clustering, Independant Component
Analysis) correspond to algorithms that are still work-in-progress. Writing
one or two more examples for each algorithm would certainly benefit
newcomers starting out with Mahout (eg me).

6) (I don't know if this is feasible. Please comment) Build a tool that
tracks the progress of an algorithm in real time during its execution,
depicting (graphically?) what part of the dataset is already analysed, what
is being currently analysed (eg. which part of training set in a classifier
is being worked on); what is the current state of the learning algorithm (eg
size and number of clusters in clustering algorithms). The data collected by
this tool can then be further analysed (eg movement of the decision boundary
over the course of a classifier algorithm, before attaining its final
state). I believe this would be a great tool to:
    (a) gain insights about the data set
    (b) gain insights about the algorithm
    (c) introduce machine learning concepts to anyone


These are just ideas, I wish to know which (if any) seem interesting enough,
and what are the possible improvements. Then I can spend the next month,
before submitting the proposal, working on the specifics, figuring out how I
may go about doing it. I am hoping I'll get enough pointers along the way,
to refine and prioritize these tasks to suit the community.

My motivation is simple: I am looking forward to either pursuing graduate
study in, or working on solving problems that require a knowledge of, the
field of machine learning. I have a fair idea of the basic concepts and
algorithms. Spending a summer closely scrutinising, documenting and testing
the implementations of the many ML algorithms currently present in Mahout,
will be a great opportunity for me to gain a solid, breadth-first
understanding of a majority of ML algorithms, plus it should be fun too :)

Any feedback is appreciated.

Thanks and regards,
Pararth

References:
[1] "Lucene Javadocs"
http://lucene.apache.org/java/2_9_4/api/contrib-benchmark/index.html
[2] "Java Interactive Profiler" http://sourceforge.net/projects/jiprof/
[3] "Profiling Tools" http://vast.uccs.edu/~tboult/CS330/NOTES/profilers.ppt
[4] "Build Your Own Profiler"
http://www.ibm.com/developerworks/java/library/j-jip/
[5] "White Box Testing" http://en.wikipedia.org/wiki/White-box_testing

GSOC 2011: Benchmarking, Profiling and Documentation

Reply via email to