Re: GSOC 2011: Benchmarking, Profiling and Documentation

Pararth Shah Thu, 10 Mar 2011 04:33:01 -0800

Thanks for the feedback.

On Wed, Mar 9, 2011 at 6:27 PM, Grant Ingersoll <[email protected]> wrote:


>
> On Mar 8, 2011, at 4:45 AM, Pararth Shah wrote:
>
> > Hi,
> > I am Pararth Shah, an undergraduate student in Computer Science and
> > Engineering at the Indian Institute of Technology, Bombay. I am planning
> to
> > submit a GSOC proposal, for Mahout. Considering Grant Ingersoll's reply
> in a
> > previous thread, I would like to "focus on benchmarking, examples and
> > documentation of existing capabilities." Here is a rough list of ideas
> that
> > came to my mind, while I was familiarizing myself with Mahout, through
> the
> > code, documentation and wiki:
> >
> > 1) Build a set of benchmarking tools tailored to Mahout, similar to the
> > Lucene benchmarking contrib[1], which benchmarks Lucene using "standard,
> > freely available corpora".
>
> Awesome.  See also MAHOUT-588
>
>
There is a lot of benchmarking data already contributed on MAHOUT-588. I'll
look into ways of organizing it so that it can be more meaningful. I am
currently reading more on the Lucene contrib.


>
> >
> > 2) Build a profiling tool, based on Java Interactive Profiler[2], to find
> > "hotspots" in the algorithm execution. This will help in identifying
> > modifications to gain speedups. The modified algorithm can be retested
> using
> > above benchmarking tools to quantify the speedup obtained. I believe a
> > custom-built profiler will have advantages in terms of speed, ability to
> > filter packages/classes profiled, and possible interactivity with the
> user,
> > over the standard profilers like hprof, JProbe and Yourkit. What I
> > understand about profilers is mostly from reading [3]. Also, I found
> useful
> > information to start with building a simple profiler consisting of java
> > agent interface coupled with the ASM library, on this page [4].
>
> I don't really think it makes sense to re-invent the wheel here and I doubt
> the community has much interest in maintaining pure profiling code.
>  Instead, I think it makes sense to leverage existing profilers.
>
>
OK, I'll look for possibilities of modifications in some existing profilers,
to suit Mahout. Can you suggest any profilers that work well with Mahout,
from experience?


>
> >
> > 3) Use these tools to gather detailed information about the control flow,
> > data flow, processing time, and memory usage patterns of execution of
> every
> > algorithm present in Mahout on certain standard datasets, and providing
> the
> > information on the Mahout website/wiki for analysis (white box
> testing[5]).
> >
> > 4) Add functionality to import databases (MySQL) into Vectors, as input
> for
> > clustering algorithms. This will allow more datasets to be directly used
> > with the clustering algorithms.
> >
> > 5) Update the documentation where required. For example,
> > "org.apache.mahout.classifier.bayes" and
> > "org.apache.mahout.clustering.canopy" are well documented, but it took me
> > some time to figure out "org.apache.mahout.clustering.minhash". The wiki
> > proved to be very informative in general, and I am assuming that the
> pages
> > that are incomplete (eg Hierarchical Clustering, Independant Component
> > Analysis) correspond to algorithms that are still work-in-progress.
> Writing
> > one or two more examples for each algorithm would certainly benefit
> > newcomers starting out with Mahout (eg me).
>
> This would be great.
>
>
MAHOUT-621 seems very interesting, its basically what I was trying to say in
point 4 here. Do you intend to have it as a GSOC project in itself? I'll put
more focus on it in my proposal.


>
> >
> > 6) (I don't know if this is feasible. Please comment) Build a tool that
> > tracks the progress of an algorithm in real time during its execution,
> > depicting (graphically?) what part of the dataset is already analysed,
> what
> > is being currently analysed (eg. which part of training set in a
> classifier
> > is being worked on); what is the current state of the learning algorithm
> (eg
> > size and number of clusters in clustering algorithms). The data collected
> by
> > this tool can then be further analysed (eg movement of the decision
> boundary
> > over the course of a classifier algorithm, before attaining its final
> > state). I believe this would be a great tool to:
> >    (a) gain insights about the data set
> >    (b) gain insights about the algorithm
> >    (c) introduce machine learning concepts to anyone
>
>
> You might look into the tools that are out there for Hadoop for analyzing
> processes, etc.
>
>
I had a look at the Hadoop JobTracker, but I'll have to do more reading
before I can come up with anything concrete.


>  >
> >
> > These are just ideas, I wish to know which (if any) seem interesting
> enough,
> > and what are the possible improvements. Then I can spend the next month,
> > before submitting the proposal, working on the specifics, figuring out
> how I
> > may go about doing it. I am hoping I'll get enough pointers along the
> way,
> > to refine and prioritize these tasks to suit the community.
> >
> > My motivation is simple: I am looking forward to either pursuing graduate
> > study in, or working on solving problems that require a knowledge of, the
> > field of machine learning. I have a fair idea of the basic concepts and
> > algorithms. Spending a summer closely scrutinising, documenting and
> testing
> > the implementations of the many ML algorithms currently present in
> Mahout,
> > will be a great opportunity for me to gain a solid, breadth-first
> > understanding of a majority of ML algorithms, plus it should be fun too
> :)
>
> I think this is a good start.  Generally speaking, many people fail to get
> selected because they bite off too much.  I would encourage you to focus in
> on a few areas that you think you can do a really good job in and propose
> along those lines.
>
>
Thanks, I'll keep this in mind. I'll look around for more information on how
I can implement these ideas before selecting a subset to go ahead with.


>  >
> > Any feedback is appreciated.
> >
> > Thanks and regards,
> > Pararth
> >
> > References:
> > [1] "Lucene Javadocs"
> > http://lucene.apache.org/java/2_9_4/api/contrib-benchmark/index.html
> > [2] "Java Interactive Profiler" http://sourceforge.net/projects/jiprof/
> > [3] "Profiling Tools"
> http://vast.uccs.edu/~tboult/CS330/NOTES/profilers.ppt
> > [4] "Build Your Own Profiler"
> > http://www.ibm.com/developerworks/java/library/j-jip/
> > [5] "White Box Testing" http://en.wikipedia.org/wiki/White-box_testing
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
>
>

Re: GSOC 2011: Benchmarking, Profiling and Documentation

Reply via email to