On the algorithm choice, Parallel Boosted Regression Trees looks interesting as well.

http://research.engineering.wustl.edu/~tyrees/Publications_files/fr819-tyreeA.pdf

After reading that paper I had a question about the Streaming Parallel Decision Trees. The SPDT paper is for building one decision tree. I was thinking about how to extend that to a decision forest. I could see how one can build many trees in the master and use a random choice of the splitting variables (as opposed to checking all variables and using the best variable). However, it seems like the idea of sampling the dataset with replacement for each tree gets lost. Is that okay?

On 02/21/2013 03:19 AM, Ted Dunning wrote:
For quantile estimation, consider also streamlib at https://github.com/clearspring/stream-lib

The bigmlcom implementation looks more directly applicable, actually.

On Wed, Feb 20, 2013 at 5:01 PM, Andy Twigg <[email protected] <mailto:[email protected]>> wrote:

    Even better, there is already a good implementation of the histograms:
    https://github.com/bigmlcom/histogram

    -Andy


    On 20 February 2013 22:50, Marty Kube <[email protected]
    <mailto:[email protected]>> wrote:
    > That's a winner...
    > Out of all of the algorithms I've looked at the Ben-Haim/SPDT
    looks most
    > likely.  In batch mode it uses one pass over the data set, it
    can be used in
    > a streaming mode, and has constant space and time requirements.
     That seems
    > like the kind of scalable algorithm we're after.
    > I'm in!
    >
    >
    > On 02/20/2013 10:09 AM, Andy Twigg wrote:
    >>
    >> Alternatively, the algorithm described in [1] is more
    straightforward,
    >> efficient, hadoop-compatible (using only mappers communicating to a
    >> master) and satisfies all our requirements so far. I would like to
    >> take a pass at implementing that, if anyone else is interested?
    >>
    >> [1]
    http://jmlr.csail.mit.edu/papers/volume11/ben-haim10a/ben-haim10a.pdf
    >>
    >>
    >> On 20 February 2013 14:27, Andy Twigg <[email protected]
    <mailto:[email protected]>> wrote:
    >>>
    >>> Why don't we start from
    >>>
    >>> https://github.com/ashenfad/hadooptree ?
    >>>
    >>> On 20 February 2013 13:25, Marty Kube
    <[email protected] <mailto:[email protected]>>
    >>> wrote:
    >>>>
    >>>> Hi Lorenz,
    >>>>
    >>>> Very interesting, that's what I was asking for when I
    mentioned non-MR
    >>>> implementations :-)
    >>>>
    >>>> I have not looked at spark before, interesting that it uses
    Mesos for
    >>>> clustering.   I'll check it out.
    >>>>
    >>>>
    >>>> On 02/19/2013 09:32 PM, Lorenz Knies wrote:
    >>>>>
    >>>>> Hi Marty,
    >>>>>
    >>>>> i am currently working on a PLANET-like implementation on
    top of spark:
    >>>>> http://spark-project.org
    >>>>>
    >>>>> I think this framework is a nice fit for the problem.
    >>>>> If the input data fits into the "total cluster memory" you
    benefit from
    >>>>> the caching of the RDD's.
    >>>>>
    >>>>> regards,
    >>>>>
    >>>>> lorenz
    >>>>>
    >>>>>
    >>>>> On Feb 20, 2013, at 2:42 AM, Marty Kube
    <[email protected] <mailto:[email protected]>>
    >>>>> wrote:
    >>>>>
    >>>>>> You had mentioned other "resource management" platforms
    like Giraph or
    >>>>>> Mesos.  I haven't looked at those yet.  I guess I was think
    of other
    >>>>>> parallelization frameworks.
    >>>>>>
    >>>>>> It's interesting that the planet folks thought it was really
    >>>>>> worthwhile
    >>>>>> working on top of map reduce for all of the resource
    management that
    >>>>>> is
    >>>>>> built in.
    >>>>>>
    >>>>>>
    >>>>>> On 02/19/2013 08:04 PM, Ted Dunning wrote:
    >>>>>>>
    >>>>>>> If non-MR means map-only job with communicating mappers
    and a state
    >>>>>>> store,
    >>>>>>> I am down with that.
    >>>>>>>
    >>>>>>> What did you mean?
    >>>>>>>
    >>>>>>> On Tue, Feb 19, 2013 at 5:53 PM, Marty Kube <
    >>>>>>> [email protected]
    <mailto:[email protected]>> wrote:
    >>>>>>>
    >>>>>>>> Right now I'd lean towards the planet model, or maybe a
    non-MR
    >>>>>>>> implementation.  Anyone have a good idea for a non-MR
    solution?
    >>>>>>>>
    >>>
    >>>
    >>> --
    >>> Dr Andy Twigg
    >>> Junior Research Fellow, St Johns College, Oxford
    >>> Room 351, Department of Computer Science
    >>> http://www.cs.ox.ac.uk/people/andy.twigg/
    >>> [email protected] <mailto:[email protected]> |
    +447799647538 <tel:%2B447799647538>
    >>
    >>
    >>
    >> --
    >> Dr Andy Twigg
    >> Junior Research Fellow, St Johns College, Oxford
    >> Room 351, Department of Computer Science
    >> http://www.cs.ox.ac.uk/people/andy.twigg/
    >> [email protected] <mailto:[email protected]> |
    +447799647538 <tel:%2B447799647538>
    >
    >



    --
    Dr Andy Twigg
    Junior Research Fellow, St Johns College, Oxford
    Room 351, Department of Computer Science
    http://www.cs.ox.ac.uk/people/andy.twigg/
    [email protected] <mailto:[email protected]> |
    +447799647538 <tel:%2B447799647538>



Reply via email to