Re: Out-of-core random forest implementation

Andy Twigg Fri, 22 Feb 2013 00:50:43 -0800

Hi Marty,

I would suggest doing each tree independently, one after the other.
Have each node select a random sample w/replacement of its data, and
let the master choose the random subset of features to split with. I'm
sure there are more efficient ways, but we can think of them later.


-Andy


On 22 February 2013 01:47, Marty Kube <[email protected]> wrote:
> On the algorithm choice, Parallel Boosted Regression Trees looks interesting
> as well.
>
> http://research.engineering.wustl.edu/~tyrees/Publications_files/fr819-tyreeA.pdf
>
> After reading that paper I had a question about the Streaming Parallel
> Decision Trees.  The SPDT paper is for building one decision tree.  I was
> thinking about how to extend that to a decision forest.  I could see how one
> can build many trees in the master and use a random choice of the splitting
> variables (as opposed to checking all variables and using the best
> variable). However, it seems like the idea of sampling the dataset with
> replacement for each tree gets lost.  Is that okay?
>
>
> On 02/21/2013 03:19 AM, Ted Dunning wrote:
>>
>> For quantile estimation, consider also streamlib at
>> https://github.com/clearspring/stream-lib
>>
>> The bigmlcom implementation looks more directly applicable, actually.
>>
>> On Wed, Feb 20, 2013 at 5:01 PM, Andy Twigg <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>>     Even better, there is already a good implementation of the histograms:
>>     https://github.com/bigmlcom/histogram
>>
>>     -Andy
>>
>>
>>     On 20 February 2013 22:50, Marty Kube <[email protected]
>>     <mailto:[email protected]>> wrote:
>>     > That's a winner...
>>     > Out of all of the algorithms I've looked at the Ben-Haim/SPDT
>>     looks most
>>     > likely.  In batch mode it uses one pass over the data set, it
>>     can be used in
>>     > a streaming mode, and has constant space and time requirements.
>>      That seems
>>     > like the kind of scalable algorithm we're after.
>>     > I'm in!
>>     >
>>     >
>>     > On 02/20/2013 10:09 AM, Andy Twigg wrote:
>>     >>
>>     >> Alternatively, the algorithm described in [1] is more
>>     straightforward,
>>     >> efficient, hadoop-compatible (using only mappers communicating to a
>>     >> master) and satisfies all our requirements so far. I would like to
>>     >> take a pass at implementing that, if anyone else is interested?
>>     >>
>>     >> [1]
>>     http://jmlr.csail.mit.edu/papers/volume11/ben-haim10a/ben-haim10a.pdf
>>     >>
>>     >>
>>     >> On 20 February 2013 14:27, Andy Twigg <[email protected]
>>     <mailto:[email protected]>> wrote:
>>     >>>
>>     >>> Why don't we start from
>>     >>>
>>     >>> https://github.com/ashenfad/hadooptree ?
>>     >>>
>>     >>> On 20 February 2013 13:25, Marty Kube
>>     <[email protected] <mailto:[email protected]>>
>>
>>     >>> wrote:
>>     >>>>
>>     >>>> Hi Lorenz,
>>     >>>>
>>     >>>> Very interesting, that's what I was asking for when I
>>     mentioned non-MR
>>     >>>> implementations :-)
>>     >>>>
>>     >>>> I have not looked at spark before, interesting that it uses
>>     Mesos for
>>     >>>> clustering.   I'll check it out.
>>     >>>>
>>     >>>>
>>     >>>> On 02/19/2013 09:32 PM, Lorenz Knies wrote:
>>     >>>>>
>>     >>>>> Hi Marty,
>>     >>>>>
>>     >>>>> i am currently working on a PLANET-like implementation on
>>     top of spark:
>>     >>>>> http://spark-project.org
>>     >>>>>
>>     >>>>> I think this framework is a nice fit for the problem.
>>     >>>>> If the input data fits into the "total cluster memory" you
>>     benefit from
>>     >>>>> the caching of the RDD's.
>>     >>>>>
>>     >>>>> regards,
>>     >>>>>
>>     >>>>> lorenz
>>     >>>>>
>>     >>>>>
>>     >>>>> On Feb 20, 2013, at 2:42 AM, Marty Kube
>>     <[email protected] <mailto:[email protected]>>
>>
>>     >>>>> wrote:
>>     >>>>>
>>     >>>>>> You had mentioned other "resource management" platforms
>>     like Giraph or
>>     >>>>>> Mesos.  I haven't looked at those yet.  I guess I was think
>>     of other
>>     >>>>>> parallelization frameworks.
>>     >>>>>>
>>     >>>>>> It's interesting that the planet folks thought it was really
>>     >>>>>> worthwhile
>>     >>>>>> working on top of map reduce for all of the resource
>>     management that
>>     >>>>>> is
>>     >>>>>> built in.
>>     >>>>>>
>>     >>>>>>
>>     >>>>>> On 02/19/2013 08:04 PM, Ted Dunning wrote:
>>     >>>>>>>
>>     >>>>>>> If non-MR means map-only job with communicating mappers
>>     and a state
>>     >>>>>>> store,
>>     >>>>>>> I am down with that.
>>     >>>>>>>
>>     >>>>>>> What did you mean?
>>     >>>>>>>
>>     >>>>>>> On Tue, Feb 19, 2013 at 5:53 PM, Marty Kube <
>>     >>>>>>> [email protected]
>>     <mailto:[email protected]>> wrote:
>>     >>>>>>>
>>     >>>>>>>> Right now I'd lean towards the planet model, or maybe a
>>     non-MR
>>     >>>>>>>> implementation.  Anyone have a good idea for a non-MR
>>     solution?
>>     >>>>>>>>
>>     >>>
>>     >>>
>>     >>> --
>>     >>> Dr Andy Twigg
>>     >>> Junior Research Fellow, St Johns College, Oxford
>>     >>> Room 351, Department of Computer Science
>>     >>> http://www.cs.ox.ac.uk/people/andy.twigg/
>>     >>> [email protected] <mailto:[email protected]> |
>>     +447799647538 <tel:%2B447799647538>
>>
>>     >>
>>     >>
>>     >>
>>     >> --
>>     >> Dr Andy Twigg
>>     >> Junior Research Fellow, St Johns College, Oxford
>>     >> Room 351, Department of Computer Science
>>     >> http://www.cs.ox.ac.uk/people/andy.twigg/
>>     >> [email protected] <mailto:[email protected]> |
>>     +447799647538 <tel:%2B447799647538>
>>
>>     >
>>     >
>>
>>
>>
>>     --
>>     Dr Andy Twigg
>>     Junior Research Fellow, St Johns College, Oxford
>>     Room 351, Department of Computer Science
>>     http://www.cs.ox.ac.uk/people/andy.twigg/
>>     [email protected] <mailto:[email protected]> |
>>     +447799647538 <tel:%2B447799647538>
>>
>>
>



--
Dr Andy Twigg
Junior Research Fellow, St Johns College, Oxford
Room 351, Department of Computer Science
http://www.cs.ox.ac.uk/people/andy.twigg/
[email protected] | +447799647538

Re: Out-of-core random forest implementation

Reply via email to