On the algorithm choice, Parallel Boosted Regression Trees looks
interesting as well.
http://research.engineering.wustl.edu/~tyrees/Publications_files/fr819-tyreeA.pdf
After reading that paper I had a question about the Streaming Parallel
Decision Trees. The SPDT paper is for building one decision tree. I
was thinking about how to extend that to a decision forest. I could see
how one can build many trees in the master and use a random choice of
the splitting variables (as opposed to checking all variables and using
the best variable). However, it seems like the idea of sampling the
dataset with replacement for each tree gets lost. Is that okay?
On 02/21/2013 03:19 AM, Ted Dunning wrote:
For quantile estimation, consider also streamlib at
https://github.com/clearspring/stream-lib
The bigmlcom implementation looks more directly applicable, actually.
On Wed, Feb 20, 2013 at 5:01 PM, Andy Twigg <[email protected]
<mailto:[email protected]>> wrote:
Even better, there is already a good implementation of the histograms:
https://github.com/bigmlcom/histogram
-Andy
On 20 February 2013 22:50, Marty Kube <[email protected]
<mailto:[email protected]>> wrote:
> That's a winner...
> Out of all of the algorithms I've looked at the Ben-Haim/SPDT
looks most
> likely. In batch mode it uses one pass over the data set, it
can be used in
> a streaming mode, and has constant space and time requirements.
That seems
> like the kind of scalable algorithm we're after.
> I'm in!
>
>
> On 02/20/2013 10:09 AM, Andy Twigg wrote:
>>
>> Alternatively, the algorithm described in [1] is more
straightforward,
>> efficient, hadoop-compatible (using only mappers communicating to a
>> master) and satisfies all our requirements so far. I would like to
>> take a pass at implementing that, if anyone else is interested?
>>
>> [1]
http://jmlr.csail.mit.edu/papers/volume11/ben-haim10a/ben-haim10a.pdf
>>
>>
>> On 20 February 2013 14:27, Andy Twigg <[email protected]
<mailto:[email protected]>> wrote:
>>>
>>> Why don't we start from
>>>
>>> https://github.com/ashenfad/hadooptree ?
>>>
>>> On 20 February 2013 13:25, Marty Kube
<[email protected] <mailto:[email protected]>>
>>> wrote:
>>>>
>>>> Hi Lorenz,
>>>>
>>>> Very interesting, that's what I was asking for when I
mentioned non-MR
>>>> implementations :-)
>>>>
>>>> I have not looked at spark before, interesting that it uses
Mesos for
>>>> clustering. I'll check it out.
>>>>
>>>>
>>>> On 02/19/2013 09:32 PM, Lorenz Knies wrote:
>>>>>
>>>>> Hi Marty,
>>>>>
>>>>> i am currently working on a PLANET-like implementation on
top of spark:
>>>>> http://spark-project.org
>>>>>
>>>>> I think this framework is a nice fit for the problem.
>>>>> If the input data fits into the "total cluster memory" you
benefit from
>>>>> the caching of the RDD's.
>>>>>
>>>>> regards,
>>>>>
>>>>> lorenz
>>>>>
>>>>>
>>>>> On Feb 20, 2013, at 2:42 AM, Marty Kube
<[email protected] <mailto:[email protected]>>
>>>>> wrote:
>>>>>
>>>>>> You had mentioned other "resource management" platforms
like Giraph or
>>>>>> Mesos. I haven't looked at those yet. I guess I was think
of other
>>>>>> parallelization frameworks.
>>>>>>
>>>>>> It's interesting that the planet folks thought it was really
>>>>>> worthwhile
>>>>>> working on top of map reduce for all of the resource
management that
>>>>>> is
>>>>>> built in.
>>>>>>
>>>>>>
>>>>>> On 02/19/2013 08:04 PM, Ted Dunning wrote:
>>>>>>>
>>>>>>> If non-MR means map-only job with communicating mappers
and a state
>>>>>>> store,
>>>>>>> I am down with that.
>>>>>>>
>>>>>>> What did you mean?
>>>>>>>
>>>>>>> On Tue, Feb 19, 2013 at 5:53 PM, Marty Kube <
>>>>>>> [email protected]
<mailto:[email protected]>> wrote:
>>>>>>>
>>>>>>>> Right now I'd lean towards the planet model, or maybe a
non-MR
>>>>>>>> implementation. Anyone have a good idea for a non-MR
solution?
>>>>>>>>
>>>
>>>
>>> --
>>> Dr Andy Twigg
>>> Junior Research Fellow, St Johns College, Oxford
>>> Room 351, Department of Computer Science
>>> http://www.cs.ox.ac.uk/people/andy.twigg/
>>> [email protected] <mailto:[email protected]> |
+447799647538 <tel:%2B447799647538>
>>
>>
>>
>> --
>> Dr Andy Twigg
>> Junior Research Fellow, St Johns College, Oxford
>> Room 351, Department of Computer Science
>> http://www.cs.ox.ac.uk/people/andy.twigg/
>> [email protected] <mailto:[email protected]> |
+447799647538 <tel:%2B447799647538>
>
>
--
Dr Andy Twigg
Junior Research Fellow, St Johns College, Oxford
Room 351, Department of Computer Science
http://www.cs.ox.ac.uk/people/andy.twigg/
[email protected] <mailto:[email protected]> |
+447799647538 <tel:%2B447799647538>