Re: Out-of-core random forest implementation

Andy Twigg Wed, 06 Mar 2013 06:27:02 -0800

Hi guys,

I've created a JIRA issue for this now -
https://issues.apache.org/jira/browse/MAHOUT-1153


Right now we're working on my fork but will make an early patch once
we can. Hopefully that will provide a basis for others who are
interested to add to it.

Andy


On 22 February 2013 08:49, Andy Twigg <[email protected]> wrote:
> Hi Marty,
>
> I would suggest doing each tree independently, one after the other.
> Have each node select a random sample w/replacement of its data, and
> let the master choose the random subset of features to split with. I'm
> sure there are more efficient ways, but we can think of them later.
>
> -Andy
>
>
> On 22 February 2013 01:47, Marty Kube <[email protected]> wrote:
>> On the algorithm choice, Parallel Boosted Regression Trees looks interesting
>> as well.
>>
>> http://research.engineering.wustl.edu/~tyrees/Publications_files/fr819-tyreeA.pdf
>>
>> After reading that paper I had a question about the Streaming Parallel
>> Decision Trees.  The SPDT paper is for building one decision tree.  I was
>> thinking about how to extend that to a decision forest.  I could see how one
>> can build many trees in the master and use a random choice of the splitting
>> variables (as opposed to checking all variables and using the best
>> variable). However, it seems like the idea of sampling the dataset with
>> replacement for each tree gets lost.  Is that okay?
>>
>>
>> On 02/21/2013 03:19 AM, Ted Dunning wrote:
>>>
>>> For quantile estimation, consider also streamlib at
>>> https://github.com/clearspring/stream-lib
>>>
>>> The bigmlcom implementation looks more directly applicable, actually.
>>>
>>> On Wed, Feb 20, 2013 at 5:01 PM, Andy Twigg <[email protected]
>>> <mailto:[email protected]>> wrote:
>>>
>>>     Even better, there is already a good implementation of the histograms:
>>>     https://github.com/bigmlcom/histogram
>>>
>>>     -Andy
>>>
>>>
>>>     On 20 February 2013 22:50, Marty Kube <[email protected]
>>>     <mailto:[email protected]>> wrote:
>>>     > That's a winner...
>>>     > Out of all of the algorithms I've looked at the Ben-Haim/SPDT
>>>     looks most
>>>     > likely.  In batch mode it uses one pass over the data set, it
>>>     can be used in
>>>     > a streaming mode, and has constant space and time requirements.
>>>      That seems
>>>     > like the kind of scalable algorithm we're after.
>>>     > I'm in!
>>>     >
>>>     >
>>>     > On 02/20/2013 10:09 AM, Andy Twigg wrote:
>>>     >>
>>>     >> Alternatively, the algorithm described in [1] is more
>>>     straightforward,
>>>     >> efficient, hadoop-compatible (using only mappers communicating to a
>>>     >> master) and satisfies all our requirements so far. I would like to
>>>     >> take a pass at implementing that, if anyone else is interested?
>>>     >>
>>>     >> [1]
>>>     http://jmlr.csail.mit.edu/papers/volume11/ben-haim10a/ben-haim10a.pdf
>>>     >>
>>>     >>
>>>     >> On 20 February 2013 14:27, Andy Twigg <[email protected]
>>>     <mailto:[email protected]>> wrote:
>>>     >>>
>>>     >>> Why don't we start from
>>>     >>>
>>>     >>> https://github.com/ashenfad/hadooptree ?
>>>     >>>
>>>     >>> On 20 February 2013 13:25, Marty Kube
>>>     <[email protected] <mailto:[email protected]>>
>>>
>>>     >>> wrote:
>>>     >>>>
>>>     >>>> Hi Lorenz,
>>>     >>>>
>>>     >>>> Very interesting, that's what I was asking for when I
>>>     mentioned non-MR
>>>     >>>> implementations :-)
>>>     >>>>
>>>     >>>> I have not looked at spark before, interesting that it uses
>>>     Mesos for
>>>     >>>> clustering.   I'll check it out.
>>>     >>>>
>>>     >>>>
>>>     >>>> On 02/19/2013 09:32 PM, Lorenz Knies wrote:
>>>     >>>>>
>>>     >>>>> Hi Marty,
>>>     >>>>>
>>>     >>>>> i am currently working on a PLANET-like implementation on
>>>     top of spark:
>>>     >>>>> http://spark-project.org
>>>     >>>>>
>>>     >>>>> I think this framework is a nice fit for the problem.
>>>     >>>>> If the input data fits into the "total cluster memory" you
>>>     benefit from
>>>     >>>>> the caching of the RDD's.
>>>     >>>>>
>>>     >>>>> regards,
>>>     >>>>>
>>>     >>>>> lorenz
>>>     >>>>>
>>>     >>>>>
>>>     >>>>> On Feb 20, 2013, at 2:42 AM, Marty Kube
>>>     <[email protected] <mailto:[email protected]>>
>>>
>>>     >>>>> wrote:
>>>     >>>>>
>>>     >>>>>> You had mentioned other "resource management" platforms
>>>     like Giraph or
>>>     >>>>>> Mesos.  I haven't looked at those yet.  I guess I was think
>>>     of other
>>>     >>>>>> parallelization frameworks.
>>>     >>>>>>
>>>     >>>>>> It's interesting that the planet folks thought it was really
>>>     >>>>>> worthwhile
>>>     >>>>>> working on top of map reduce for all of the resource
>>>     management that
>>>     >>>>>> is
>>>     >>>>>> built in.
>>>     >>>>>>
>>>     >>>>>>
>>>     >>>>>> On 02/19/2013 08:04 PM, Ted Dunning wrote:
>>>     >>>>>>>
>>>     >>>>>>> If non-MR means map-only job with communicating mappers
>>>     and a state
>>>     >>>>>>> store,
>>>     >>>>>>> I am down with that.
>>>     >>>>>>>
>>>     >>>>>>> What did you mean?
>>>     >>>>>>>
>>>     >>>>>>> On Tue, Feb 19, 2013 at 5:53 PM, Marty Kube <
>>>     >>>>>>> [email protected]
>>>     <mailto:[email protected]>> wrote:
>>>     >>>>>>>
>>>     >>>>>>>> Right now I'd lean towards the planet model, or maybe a
>>>     non-MR
>>>     >>>>>>>> implementation.  Anyone have a good idea for a non-MR
>>>     solution?
>>>     >>>>>>>>
>>>     >>>
>>>     >>>
>>>     >>> --
>>>     >>> Dr Andy Twigg
>>>     >>> Junior Research Fellow, St Johns College, Oxford
>>>     >>> Room 351, Department of Computer Science
>>>     >>> http://www.cs.ox.ac.uk/people/andy.twigg/
>>>     >>> [email protected] <mailto:[email protected]> |
>>>     +447799647538 <tel:%2B447799647538>
>>>
>>>     >>
>>>     >>
>>>     >>
>>>     >> --
>>>     >> Dr Andy Twigg
>>>     >> Junior Research Fellow, St Johns College, Oxford
>>>     >> Room 351, Department of Computer Science
>>>     >> http://www.cs.ox.ac.uk/people/andy.twigg/
>>>     >> [email protected] <mailto:[email protected]> |
>>>     +447799647538 <tel:%2B447799647538>
>>>
>>>     >
>>>     >
>>>
>>>
>>>
>>>     --
>>>     Dr Andy Twigg
>>>     Junior Research Fellow, St Johns College, Oxford
>>>     Room 351, Department of Computer Science
>>>     http://www.cs.ox.ac.uk/people/andy.twigg/
>>>     [email protected] <mailto:[email protected]> |
>>>     +447799647538 <tel:%2B447799647538>
>>>
>>>
>>
>
>
>
> --
> Dr Andy Twigg
> Junior Research Fellow, St Johns College, Oxford
> Room 351, Department of Computer Science
> http://www.cs.ox.ac.uk/people/andy.twigg/
> [email protected] | +447799647538



--
Dr Andy Twigg
Junior Research Fellow, St Johns College, Oxford
Room 351, Department of Computer Science
http://www.cs.ox.ac.uk/people/andy.twigg/
[email protected] | +447799647538

Re: Out-of-core random forest implementation

Reply via email to