Thanks for the feedback. On Wed, Mar 9, 2011 at 6:27 PM, Grant Ingersoll <[email protected]> wrote:
> > On Mar 8, 2011, at 4:45 AM, Pararth Shah wrote: > > > Hi, > > I am Pararth Shah, an undergraduate student in Computer Science and > > Engineering at the Indian Institute of Technology, Bombay. I am planning > to > > submit a GSOC proposal, for Mahout. Considering Grant Ingersoll's reply > in a > > previous thread, I would like to "focus on benchmarking, examples and > > documentation of existing capabilities." Here is a rough list of ideas > that > > came to my mind, while I was familiarizing myself with Mahout, through > the > > code, documentation and wiki: > > > > 1) Build a set of benchmarking tools tailored to Mahout, similar to the > > Lucene benchmarking contrib[1], which benchmarks Lucene using "standard, > > freely available corpora". > > Awesome. See also MAHOUT-588 > > There is a lot of benchmarking data already contributed on MAHOUT-588. I'll look into ways of organizing it so that it can be more meaningful. I am currently reading more on the Lucene contrib. > > > > > 2) Build a profiling tool, based on Java Interactive Profiler[2], to find > > "hotspots" in the algorithm execution. This will help in identifying > > modifications to gain speedups. The modified algorithm can be retested > using > > above benchmarking tools to quantify the speedup obtained. I believe a > > custom-built profiler will have advantages in terms of speed, ability to > > filter packages/classes profiled, and possible interactivity with the > user, > > over the standard profilers like hprof, JProbe and Yourkit. What I > > understand about profilers is mostly from reading [3]. Also, I found > useful > > information to start with building a simple profiler consisting of java > > agent interface coupled with the ASM library, on this page [4]. > > I don't really think it makes sense to re-invent the wheel here and I doubt > the community has much interest in maintaining pure profiling code. > Instead, I think it makes sense to leverage existing profilers. > > OK, I'll look for possibilities of modifications in some existing profilers, to suit Mahout. Can you suggest any profilers that work well with Mahout, from experience? > > > > > 3) Use these tools to gather detailed information about the control flow, > > data flow, processing time, and memory usage patterns of execution of > every > > algorithm present in Mahout on certain standard datasets, and providing > the > > information on the Mahout website/wiki for analysis (white box > testing[5]). > > > > 4) Add functionality to import databases (MySQL) into Vectors, as input > for > > clustering algorithms. This will allow more datasets to be directly used > > with the clustering algorithms. > > > > 5) Update the documentation where required. For example, > > "org.apache.mahout.classifier.bayes" and > > "org.apache.mahout.clustering.canopy" are well documented, but it took me > > some time to figure out "org.apache.mahout.clustering.minhash". The wiki > > proved to be very informative in general, and I am assuming that the > pages > > that are incomplete (eg Hierarchical Clustering, Independant Component > > Analysis) correspond to algorithms that are still work-in-progress. > Writing > > one or two more examples for each algorithm would certainly benefit > > newcomers starting out with Mahout (eg me). > > This would be great. > > MAHOUT-621 seems very interesting, its basically what I was trying to say in point 4 here. Do you intend to have it as a GSOC project in itself? I'll put more focus on it in my proposal. > > > > > 6) (I don't know if this is feasible. Please comment) Build a tool that > > tracks the progress of an algorithm in real time during its execution, > > depicting (graphically?) what part of the dataset is already analysed, > what > > is being currently analysed (eg. which part of training set in a > classifier > > is being worked on); what is the current state of the learning algorithm > (eg > > size and number of clusters in clustering algorithms). The data collected > by > > this tool can then be further analysed (eg movement of the decision > boundary > > over the course of a classifier algorithm, before attaining its final > > state). I believe this would be a great tool to: > > (a) gain insights about the data set > > (b) gain insights about the algorithm > > (c) introduce machine learning concepts to anyone > > > You might look into the tools that are out there for Hadoop for analyzing > processes, etc. > > I had a look at the Hadoop JobTracker, but I'll have to do more reading before I can come up with anything concrete. > > > > > > These are just ideas, I wish to know which (if any) seem interesting > enough, > > and what are the possible improvements. Then I can spend the next month, > > before submitting the proposal, working on the specifics, figuring out > how I > > may go about doing it. I am hoping I'll get enough pointers along the > way, > > to refine and prioritize these tasks to suit the community. > > > > My motivation is simple: I am looking forward to either pursuing graduate > > study in, or working on solving problems that require a knowledge of, the > > field of machine learning. I have a fair idea of the basic concepts and > > algorithms. Spending a summer closely scrutinising, documenting and > testing > > the implementations of the many ML algorithms currently present in > Mahout, > > will be a great opportunity for me to gain a solid, breadth-first > > understanding of a majority of ML algorithms, plus it should be fun too > :) > > I think this is a good start. Generally speaking, many people fail to get > selected because they bite off too much. I would encourage you to focus in > on a few areas that you think you can do a really good job in and propose > along those lines. > > Thanks, I'll keep this in mind. I'll look around for more information on how I can implement these ideas before selecting a subset to go ahead with. > > > > Any feedback is appreciated. > > > > Thanks and regards, > > Pararth > > > > References: > > [1] "Lucene Javadocs" > > http://lucene.apache.org/java/2_9_4/api/contrib-benchmark/index.html > > [2] "Java Interactive Profiler" http://sourceforge.net/projects/jiprof/ > > [3] "Profiling Tools" > http://vast.uccs.edu/~tboult/CS330/NOTES/profilers.ppt > > [4] "Build Your Own Profiler" > > http://www.ibm.com/developerworks/java/library/j-jip/ > > [5] "White Box Testing" http://en.wikipedia.org/wiki/White-box_testing > > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com > >
