Re: [gsoc] random forests

Ted Dunning Tue, 31 Mar 2009 18:30:52 -0700

Deneche,

I don't see your application on the GSOC web site.  Nor on the apache wiki.


Time is running out and I would hate to not see you in the program.  Is it
just that I can't see the application yet?

On Tue, Mar 31, 2009 at 1:05 PM, deneche abdelhakim <a_dene...@yahoo.fr>wrote:

>
> Here is a draft of my proposal
>
> **************************************************
> Title/Summary: [Apache Mahout] Implement parallel Random/Regression Forests
>
> Student: AbdelHakim Deneche
> Student e-mail: ...
>
> Student Major: Phd in Computer Science
> Student Degree: Master in Computer Science
> Student Graduation: Spring 2011
>
> Organization: The Apache Software Foundation
> Assigned Mentor:
>
>
> Abstract:
>
> My goal is to add the power of random/regression forests to Mahout. At the
> end of this summer one should be able to build random/regression forests for
> large, possibly, distributed datasets, store the forest and reuse it to
> classify new data. In addition, a demo on EC2 is planned.
>
>
> Detailed Description:
>
> This project is all about random/regression forests. The core component is
> the tree building algorithm from a random bootstrap from the whole dataset.
> I already wrote a detailed description on Mahout Wiki [RandomForests]. Given
> the size of the dataset, two distributed implementation are possible:
>
> 1. The most straightforward one deals with relatively small datasets. By
> small, I mean a dataset that can be replicated on every node of the cluster.
> Basically, each mapper has access to the whole dataset, so if the forest
> contains N trees and we have M mappers, each mapper runs the core building
> algorithm N/M times. This implementation is, relatively, easy because each
> mapper runs the basic building algorithm "as it is". It is also of great
> interest if the user wants to "try" different parameters when building the
> forest. An out-of-core implementation is also possible to deal with datasets
> that cannot fit into the node memory.
>
> 2. The second implementation, which is the most difficult, is concerned
> with very large datasets that cannot fit in every machine of the cluster. In
> this case the mappers work differently, each mapper has access to a subset
> from the dataset, thus all the mappers collaborate to build each tree of the
> forest. The core building algorithm must thus be rewritten in a map-reduce
> form. This implementation can deal with datasets of any size, as long as
> they are on the cluster.
>
> Although the first implementation is easier to implement, the CPU and IO
> overhead of the out-of-core implementation are still unknown. A reference,
> non-parallel, implementation should thus be built to better understand the
> effects of the out-of-core implementation, especially for large datasets.
> This reference implementation is also usefull to asses the correctness of
> the distributed implementation.
>
>
> Working Plan and list of deliverables
>
> Must-Have:
> 1. reference implementation of Random/Regression Forests Building
> Algorithm:
>  . Build a forest of trees, the basic algorithm (described in the wiki)
> takes a subset from the dataset as a training set and builds a decision
> tree. This algorithm is repeated for each tree of the forest.
>  . The forest is stored in a file, this way it can be re-used, at any time,
> to classify new cases
>  . At this step, the necessary changes to Mahout's Classifier interface are
> made to extend its use to more than Text datasets.
>
> 2. Study the effects of large datasets on the reference implementation
>  . This step should guide our choice of the proper parallel implementation
>
> 3. Parallel implementation, choose one of the following:
>  3a. Parallel implementation A
>  . When the dataset can be replicated to all computing nodes.
>  . Each mapper has access to the whole dataset, if the forest contains N
> trees and we have M mappers, each mapper runs the basic building algorithm
> N/M times. The mapper if also responsible of computing the out-of-bag error
> estimation.
>  . The reducer store the trees in the RF file, and merges the oob error
> estimations.
>  3b. Parallel implementation B:
>  . When the dataset is so big that it can no longer fit on every computing
> node, it must be distributed over the cluster.
>  . Each mapper has access to a subset from the dataset, thus all the
> mappers collaborate to build each tree of the forest.
>  . In this case, the basic algorithm must be rewritten to fit in the
> map-reduce paradigm.
>
> Should-Have:
> 4. Run the Random Forest with a real dataset on EC2:
>  . This step is important, because running the RF on a local dual core
> machine is different from running it on a real cluster with a real dataset.
>  . This can make a good demo for Mahout
>  . Amazon has put some interesting datasets to play with [PublicDatasets].
>   The US Census dataset comes in various sizes ranging from 2Go to 200Go,
> and should make a very good example.
>  . At this stage it may be useful to implement [MAHOUT-71] (Dataset to
> Matrix Reader).
>
> Wanna-Have:
> 5. If there is still time, implement one or two other important features of
> RFs such as Variable importance and Proximity estimation
>
>
> Additional Information:
> I am a PhD student at the University Mentouri of Constantine. My primary
> research goal is a framework to help build Intelligent Adaptive Systems. For
> the purpose of my Master, I worked on Artificial Immune Systems. I applied
> them to handwritten digits recognition [PatternRecognition] and Muliple
> Sequence Alignement (bioinformatics) [BioInformatics].
> Last year I participated in the GSoC as an Apache student, I integrated an
> Evolutionary Computing Framework to Mahout [Mahout-56].
>
> References
>
> [BioInformatics] A. Layeb, A. Deneche, "Multiple Sequence Alignment by
> Immune Artificial System",
> ACS/IEEE International Conference on Computer Systems and Applications
> AICCSA’07, Jordan 2007.
>
> [Mahout-56] https://issues.apache.org/jira/browse/MAHOUT-56
>
> [Mahout-71] http://issues.apache.org/jira/browse/MAHOUT-71
>
> [PatternRecognition] S. Meshoul, A. Deneche, M. Batouche, "Combining an
> Artificial Immune System with a Clustering Method for Effective Pattern
> Recognition",
> International Conference on Machine Intelligence ICMI’05, pp. 782-787,
> Tunis 2005.
>
> [PublicDatasets] http://aws.amazon.com/publicdatasets/
>
> [RandomForests] http://cwiki.apache.org/MAHOUT/random-forests.html
>
> ***************************************************************************
>
>
>
>


-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)

Re: [gsoc] random forests

Reply via email to