Re: GSOC 2011: Benchmarking, Profiling and Documentation

Lance Norskog Thu, 10 Mar 2011 17:15:38 -0800

Another variant of benchmarking is regression testing. We have a lot
of algorithms that have small simple unit tests, but which have
developed bugs over time.


It would be great to have a suite of tasks that download a real
dataset and expected results, run a large-scale operation, and compare
the output to the expected results.

On Thu, Mar 10, 2011 at 4:32 AM, Pararth Shah <[email protected]> wrote:
> Thanks for the feedback.
>
> On Wed, Mar 9, 2011 at 6:27 PM, Grant Ingersoll <[email protected]> wrote:
>
>>
>> On Mar 8, 2011, at 4:45 AM, Pararth Shah wrote:
>>
>> > Hi,
>> > I am Pararth Shah, an undergraduate student in Computer Science and
>> > Engineering at the Indian Institute of Technology, Bombay. I am planning
>> to
>> > submit a GSOC proposal, for Mahout. Considering Grant Ingersoll's reply
>> in a
>> > previous thread, I would like to "focus on benchmarking, examples and
>> > documentation of existing capabilities." Here is a rough list of ideas
>> that
>> > came to my mind, while I was familiarizing myself with Mahout, through
>> the
>> > code, documentation and wiki:
>> >
>> > 1) Build a set of benchmarking tools tailored to Mahout, similar to the
>> > Lucene benchmarking contrib[1], which benchmarks Lucene using "standard,
>> > freely available corpora".
>>
>> Awesome.  See also MAHOUT-588
>>
>>
> There is a lot of benchmarking data already contributed on MAHOUT-588. I'll
> look into ways of organizing it so that it can be more meaningful. I am
> currently reading more on the Lucene contrib.
>
>
>>
>> >
>> > 2) Build a profiling tool, based on Java Interactive Profiler[2], to find
>> > "hotspots" in the algorithm execution. This will help in identifying
>> > modifications to gain speedups. The modified algorithm can be retested
>> using
>> > above benchmarking tools to quantify the speedup obtained. I believe a
>> > custom-built profiler will have advantages in terms of speed, ability to
>> > filter packages/classes profiled, and possible interactivity with the
>> user,
>> > over the standard profilers like hprof, JProbe and Yourkit. What I
>> > understand about profilers is mostly from reading [3]. Also, I found
>> useful
>> > information to start with building a simple profiler consisting of java
>> > agent interface coupled with the ASM library, on this page [4].
>>
>> I don't really think it makes sense to re-invent the wheel here and I doubt
>> the community has much interest in maintaining pure profiling code.
>>  Instead, I think it makes sense to leverage existing profilers.
>>
>>
> OK, I'll look for possibilities of modifications in some existing profilers,
> to suit Mahout. Can you suggest any profilers that work well with Mahout,
> from experience?
>
>
>>
>> >
>> > 3) Use these tools to gather detailed information about the control flow,
>> > data flow, processing time, and memory usage patterns of execution of
>> every
>> > algorithm present in Mahout on certain standard datasets, and providing
>> the
>> > information on the Mahout website/wiki for analysis (white box
>> testing[5]).
>> >
>> > 4) Add functionality to import databases (MySQL) into Vectors, as input
>> for
>> > clustering algorithms. This will allow more datasets to be directly used
>> > with the clustering algorithms.
>> >
>> > 5) Update the documentation where required. For example,
>> > "org.apache.mahout.classifier.bayes" and
>> > "org.apache.mahout.clustering.canopy" are well documented, but it took me
>> > some time to figure out "org.apache.mahout.clustering.minhash". The wiki
>> > proved to be very informative in general, and I am assuming that the
>> pages
>> > that are incomplete (eg Hierarchical Clustering, Independant Component
>> > Analysis) correspond to algorithms that are still work-in-progress.
>> Writing
>> > one or two more examples for each algorithm would certainly benefit
>> > newcomers starting out with Mahout (eg me).
>>
>> This would be great.
>>
>>
> MAHOUT-621 seems very interesting, its basically what I was trying to say in
> point 4 here. Do you intend to have it as a GSOC project in itself? I'll put
> more focus on it in my proposal.
>
>
>>
>> >
>> > 6) (I don't know if this is feasible. Please comment) Build a tool that
>> > tracks the progress of an algorithm in real time during its execution,
>> > depicting (graphically?) what part of the dataset is already analysed,
>> what
>> > is being currently analysed (eg. which part of training set in a
>> classifier
>> > is being worked on); what is the current state of the learning algorithm
>> (eg
>> > size and number of clusters in clustering algorithms). The data collected
>> by
>> > this tool can then be further analysed (eg movement of the decision
>> boundary
>> > over the course of a classifier algorithm, before attaining its final
>> > state). I believe this would be a great tool to:
>> >    (a) gain insights about the data set
>> >    (b) gain insights about the algorithm
>> >    (c) introduce machine learning concepts to anyone
>>
>>
>> You might look into the tools that are out there for Hadoop for analyzing
>> processes, etc.
>>
>>
> I had a look at the Hadoop JobTracker, but I'll have to do more reading
> before I can come up with anything concrete.
>
>
>>  >
>> >
>> > These are just ideas, I wish to know which (if any) seem interesting
>> enough,
>> > and what are the possible improvements. Then I can spend the next month,
>> > before submitting the proposal, working on the specifics, figuring out
>> how I
>> > may go about doing it. I am hoping I'll get enough pointers along the
>> way,
>> > to refine and prioritize these tasks to suit the community.
>> >
>> > My motivation is simple: I am looking forward to either pursuing graduate
>> > study in, or working on solving problems that require a knowledge of, the
>> > field of machine learning. I have a fair idea of the basic concepts and
>> > algorithms. Spending a summer closely scrutinising, documenting and
>> testing
>> > the implementations of the many ML algorithms currently present in
>> Mahout,
>> > will be a great opportunity for me to gain a solid, breadth-first
>> > understanding of a majority of ML algorithms, plus it should be fun too
>> :)
>>
>> I think this is a good start.  Generally speaking, many people fail to get
>> selected because they bite off too much.  I would encourage you to focus in
>> on a few areas that you think you can do a really good job in and propose
>> along those lines.
>>
>>
> Thanks, I'll keep this in mind. I'll look around for more information on how
> I can implement these ideas before selecting a subset to go ahead with.
>
>
>>  >
>> > Any feedback is appreciated.
>> >
>> > Thanks and regards,
>> > Pararth
>> >
>> > References:
>> > [1] "Lucene Javadocs"
>> > http://lucene.apache.org/java/2_9_4/api/contrib-benchmark/index.html
>> > [2] "Java Interactive Profiler" http://sourceforge.net/projects/jiprof/
>> > [3] "Profiling Tools"
>> http://vast.uccs.edu/~tboult/CS330/NOTES/profilers.ppt
>> > [4] "Build Your Own Profiler"
>> > http://www.ibm.com/developerworks/java/library/j-jip/
>> > [5] "White Box Testing" http://en.wikipedia.org/wiki/White-box_testing
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com
>>
>>
>



-- 
Lance Norskog
[email protected]

Re: GSOC 2011: Benchmarking, Profiling and Documentation

Reply via email to