Re: Out-of-core random forest implementation

Ted Dunning Thu, 21 Feb 2013 00:20:38 -0800

For quantile estimation, consider also streamlib at
https://github.com/clearspring/stream-lib


The bigmlcom implementation looks more directly applicable, actually.

On Wed, Feb 20, 2013 at 5:01 PM, Andy Twigg <[email protected]> wrote:

> Even better, there is already a good implementation of the histograms:
> https://github.com/bigmlcom/histogram
>
> -Andy
>
>
> On 20 February 2013 22:50, Marty Kube <[email protected]> wrote:
> > That's a winner...
> > Out of all of the algorithms I've looked at the Ben-Haim/SPDT looks most
> > likely.  In batch mode it uses one pass over the data set, it can be
> used in
> > a streaming mode, and has constant space and time requirements.  That
> seems
> > like the kind of scalable algorithm we're after.
> > I'm in!
> >
> >
> > On 02/20/2013 10:09 AM, Andy Twigg wrote:
> >>
> >> Alternatively, the algorithm described in [1] is more straightforward,
> >> efficient, hadoop-compatible (using only mappers communicating to a
> >> master) and satisfies all our requirements so far. I would like to
> >> take a pass at implementing that, if anyone else is interested?
> >>
> >> [1]
> http://jmlr.csail.mit.edu/papers/volume11/ben-haim10a/ben-haim10a.pdf
> >>
> >>
> >> On 20 February 2013 14:27, Andy Twigg <[email protected]> wrote:
> >>>
> >>> Why don't we start from
> >>>
> >>> https://github.com/ashenfad/hadooptree ?
> >>>
> >>> On 20 February 2013 13:25, Marty Kube <[email protected]>
> >>> wrote:
> >>>>
> >>>> Hi Lorenz,
> >>>>
> >>>> Very interesting, that's what I was asking for when I mentioned non-MR
> >>>> implementations :-)
> >>>>
> >>>> I have not looked at spark before, interesting that it uses Mesos for
> >>>> clustering.   I'll check it out.
> >>>>
> >>>>
> >>>> On 02/19/2013 09:32 PM, Lorenz Knies wrote:
> >>>>>
> >>>>> Hi Marty,
> >>>>>
> >>>>> i am currently working on a PLANET-like implementation on top of
> spark:
> >>>>> http://spark-project.org
> >>>>>
> >>>>> I think this framework is a nice fit for the problem.
> >>>>> If the input data fits into the "total cluster memory" you benefit
> from
> >>>>> the caching of the RDD's.
> >>>>>
> >>>>> regards,
> >>>>>
> >>>>> lorenz
> >>>>>
> >>>>>
> >>>>> On Feb 20, 2013, at 2:42 AM, Marty Kube <[email protected]
> >
> >>>>> wrote:
> >>>>>
> >>>>>> You had mentioned other "resource management" platforms like Giraph
> or
> >>>>>> Mesos.  I haven't looked at those yet.  I guess I was think of other
> >>>>>> parallelization frameworks.
> >>>>>>
> >>>>>> It's interesting that the planet folks thought it was really
> >>>>>> worthwhile
> >>>>>> working on top of map reduce for all of the resource management that
> >>>>>> is
> >>>>>> built in.
> >>>>>>
> >>>>>>
> >>>>>> On 02/19/2013 08:04 PM, Ted Dunning wrote:
> >>>>>>>
> >>>>>>> If non-MR means map-only job with communicating mappers and a state
> >>>>>>> store,
> >>>>>>> I am down with that.
> >>>>>>>
> >>>>>>> What did you mean?
> >>>>>>>
> >>>>>>> On Tue, Feb 19, 2013 at 5:53 PM, Marty Kube <
> >>>>>>> [email protected]> wrote:
> >>>>>>>
> >>>>>>>> Right now I'd lean towards the planet model, or maybe a non-MR
> >>>>>>>> implementation.  Anyone have a good idea for a non-MR solution?
> >>>>>>>>
> >>>
> >>>
> >>> --
> >>> Dr Andy Twigg
> >>> Junior Research Fellow, St Johns College, Oxford
> >>> Room 351, Department of Computer Science
> >>> http://www.cs.ox.ac.uk/people/andy.twigg/
> >>> [email protected] | +447799647538
> >>
> >>
> >>
> >> --
> >> Dr Andy Twigg
> >> Junior Research Fellow, St Johns College, Oxford
> >> Room 351, Department of Computer Science
> >> http://www.cs.ox.ac.uk/people/andy.twigg/
> >> [email protected] | +447799647538
> >
> >
>
>
>
> --
> Dr Andy Twigg
> Junior Research Fellow, St Johns College, Oxford
> Room 351, Department of Computer Science
> http://www.cs.ox.ac.uk/people/andy.twigg/
> [email protected] | +447799647538
>

Re: Out-of-core random forest implementation

Reply via email to