Deneche, I don't see your application on the GSOC web site. Nor on the apache wiki.
Time is running out and I would hate to not see you in the program. Is it just that I can't see the application yet? On Tue, Mar 31, 2009 at 1:05 PM, deneche abdelhakim <a_dene...@yahoo.fr>wrote: > > Here is a draft of my proposal > > ************************************************** > Title/Summary: [Apache Mahout] Implement parallel Random/Regression Forests > > Student: AbdelHakim Deneche > Student e-mail: ... > > Student Major: Phd in Computer Science > Student Degree: Master in Computer Science > Student Graduation: Spring 2011 > > Organization: The Apache Software Foundation > Assigned Mentor: > > > Abstract: > > My goal is to add the power of random/regression forests to Mahout. At the > end of this summer one should be able to build random/regression forests for > large, possibly, distributed datasets, store the forest and reuse it to > classify new data. In addition, a demo on EC2 is planned. > > > Detailed Description: > > This project is all about random/regression forests. The core component is > the tree building algorithm from a random bootstrap from the whole dataset. > I already wrote a detailed description on Mahout Wiki [RandomForests]. Given > the size of the dataset, two distributed implementation are possible: > > 1. The most straightforward one deals with relatively small datasets. By > small, I mean a dataset that can be replicated on every node of the cluster. > Basically, each mapper has access to the whole dataset, so if the forest > contains N trees and we have M mappers, each mapper runs the core building > algorithm N/M times. This implementation is, relatively, easy because each > mapper runs the basic building algorithm "as it is". It is also of great > interest if the user wants to "try" different parameters when building the > forest. An out-of-core implementation is also possible to deal with datasets > that cannot fit into the node memory. > > 2. The second implementation, which is the most difficult, is concerned > with very large datasets that cannot fit in every machine of the cluster. In > this case the mappers work differently, each mapper has access to a subset > from the dataset, thus all the mappers collaborate to build each tree of the > forest. The core building algorithm must thus be rewritten in a map-reduce > form. This implementation can deal with datasets of any size, as long as > they are on the cluster. > > Although the first implementation is easier to implement, the CPU and IO > overhead of the out-of-core implementation are still unknown. A reference, > non-parallel, implementation should thus be built to better understand the > effects of the out-of-core implementation, especially for large datasets. > This reference implementation is also usefull to asses the correctness of > the distributed implementation. > > > Working Plan and list of deliverables > > Must-Have: > 1. reference implementation of Random/Regression Forests Building > Algorithm: > . Build a forest of trees, the basic algorithm (described in the wiki) > takes a subset from the dataset as a training set and builds a decision > tree. This algorithm is repeated for each tree of the forest. > . The forest is stored in a file, this way it can be re-used, at any time, > to classify new cases > . At this step, the necessary changes to Mahout's Classifier interface are > made to extend its use to more than Text datasets. > > 2. Study the effects of large datasets on the reference implementation > . This step should guide our choice of the proper parallel implementation > > 3. Parallel implementation, choose one of the following: > 3a. Parallel implementation A > . When the dataset can be replicated to all computing nodes. > . Each mapper has access to the whole dataset, if the forest contains N > trees and we have M mappers, each mapper runs the basic building algorithm > N/M times. The mapper if also responsible of computing the out-of-bag error > estimation. > . The reducer store the trees in the RF file, and merges the oob error > estimations. > 3b. Parallel implementation B: > . When the dataset is so big that it can no longer fit on every computing > node, it must be distributed over the cluster. > . Each mapper has access to a subset from the dataset, thus all the > mappers collaborate to build each tree of the forest. > . In this case, the basic algorithm must be rewritten to fit in the > map-reduce paradigm. > > Should-Have: > 4. Run the Random Forest with a real dataset on EC2: > . This step is important, because running the RF on a local dual core > machine is different from running it on a real cluster with a real dataset. > . This can make a good demo for Mahout > . Amazon has put some interesting datasets to play with [PublicDatasets]. > The US Census dataset comes in various sizes ranging from 2Go to 200Go, > and should make a very good example. > . At this stage it may be useful to implement [MAHOUT-71] (Dataset to > Matrix Reader). > > Wanna-Have: > 5. If there is still time, implement one or two other important features of > RFs such as Variable importance and Proximity estimation > > > Additional Information: > I am a PhD student at the University Mentouri of Constantine. My primary > research goal is a framework to help build Intelligent Adaptive Systems. For > the purpose of my Master, I worked on Artificial Immune Systems. I applied > them to handwritten digits recognition [PatternRecognition] and Muliple > Sequence Alignement (bioinformatics) [BioInformatics]. > Last year I participated in the GSoC as an Apache student, I integrated an > Evolutionary Computing Framework to Mahout [Mahout-56]. > > References > > [BioInformatics] A. Layeb, A. Deneche, "Multiple Sequence Alignment by > Immune Artificial System", > ACS/IEEE International Conference on Computer Systems and Applications > AICCSA’07, Jordan 2007. > > [Mahout-56] https://issues.apache.org/jira/browse/MAHOUT-56 > > [Mahout-71] http://issues.apache.org/jira/browse/MAHOUT-71 > > [PatternRecognition] S. Meshoul, A. Deneche, M. Batouche, "Combining an > Artificial Immune System with a Clustering Method for Effective Pattern > Recognition", > International Conference on Machine Intelligence ICMI’05, pp. 782-787, > Tunis 2005. > > [PublicDatasets] http://aws.amazon.com/publicdatasets/ > > [RandomForests] http://cwiki.apache.org/MAHOUT/random-forests.html > > *************************************************************************** > > > > -- Ted Dunning, CTO DeepDyve 111 West Evelyn Ave. Ste. 202 Sunnyvale, CA 94086 www.deepdyve.com 858-414-0013 (m) 408-773-0220 (fax)