For quantile estimation, consider also streamlib at https://github.com/clearspring/stream-lib
The bigmlcom implementation looks more directly applicable, actually. On Wed, Feb 20, 2013 at 5:01 PM, Andy Twigg <[email protected]> wrote: > Even better, there is already a good implementation of the histograms: > https://github.com/bigmlcom/histogram > > -Andy > > > On 20 February 2013 22:50, Marty Kube <[email protected]> wrote: > > That's a winner... > > Out of all of the algorithms I've looked at the Ben-Haim/SPDT looks most > > likely. In batch mode it uses one pass over the data set, it can be > used in > > a streaming mode, and has constant space and time requirements. That > seems > > like the kind of scalable algorithm we're after. > > I'm in! > > > > > > On 02/20/2013 10:09 AM, Andy Twigg wrote: > >> > >> Alternatively, the algorithm described in [1] is more straightforward, > >> efficient, hadoop-compatible (using only mappers communicating to a > >> master) and satisfies all our requirements so far. I would like to > >> take a pass at implementing that, if anyone else is interested? > >> > >> [1] > http://jmlr.csail.mit.edu/papers/volume11/ben-haim10a/ben-haim10a.pdf > >> > >> > >> On 20 February 2013 14:27, Andy Twigg <[email protected]> wrote: > >>> > >>> Why don't we start from > >>> > >>> https://github.com/ashenfad/hadooptree ? > >>> > >>> On 20 February 2013 13:25, Marty Kube <[email protected]> > >>> wrote: > >>>> > >>>> Hi Lorenz, > >>>> > >>>> Very interesting, that's what I was asking for when I mentioned non-MR > >>>> implementations :-) > >>>> > >>>> I have not looked at spark before, interesting that it uses Mesos for > >>>> clustering. I'll check it out. > >>>> > >>>> > >>>> On 02/19/2013 09:32 PM, Lorenz Knies wrote: > >>>>> > >>>>> Hi Marty, > >>>>> > >>>>> i am currently working on a PLANET-like implementation on top of > spark: > >>>>> http://spark-project.org > >>>>> > >>>>> I think this framework is a nice fit for the problem. > >>>>> If the input data fits into the "total cluster memory" you benefit > from > >>>>> the caching of the RDD's. > >>>>> > >>>>> regards, > >>>>> > >>>>> lorenz > >>>>> > >>>>> > >>>>> On Feb 20, 2013, at 2:42 AM, Marty Kube <[email protected] > > > >>>>> wrote: > >>>>> > >>>>>> You had mentioned other "resource management" platforms like Giraph > or > >>>>>> Mesos. I haven't looked at those yet. I guess I was think of other > >>>>>> parallelization frameworks. > >>>>>> > >>>>>> It's interesting that the planet folks thought it was really > >>>>>> worthwhile > >>>>>> working on top of map reduce for all of the resource management that > >>>>>> is > >>>>>> built in. > >>>>>> > >>>>>> > >>>>>> On 02/19/2013 08:04 PM, Ted Dunning wrote: > >>>>>>> > >>>>>>> If non-MR means map-only job with communicating mappers and a state > >>>>>>> store, > >>>>>>> I am down with that. > >>>>>>> > >>>>>>> What did you mean? > >>>>>>> > >>>>>>> On Tue, Feb 19, 2013 at 5:53 PM, Marty Kube < > >>>>>>> [email protected]> wrote: > >>>>>>> > >>>>>>>> Right now I'd lean towards the planet model, or maybe a non-MR > >>>>>>>> implementation. Anyone have a good idea for a non-MR solution? > >>>>>>>> > >>> > >>> > >>> -- > >>> Dr Andy Twigg > >>> Junior Research Fellow, St Johns College, Oxford > >>> Room 351, Department of Computer Science > >>> http://www.cs.ox.ac.uk/people/andy.twigg/ > >>> [email protected] | +447799647538 > >> > >> > >> > >> -- > >> Dr Andy Twigg > >> Junior Research Fellow, St Johns College, Oxford > >> Room 351, Department of Computer Science > >> http://www.cs.ox.ac.uk/people/andy.twigg/ > >> [email protected] | +447799647538 > > > > > > > > -- > Dr Andy Twigg > Junior Research Fellow, St Johns College, Oxford > Room 351, Department of Computer Science > http://www.cs.ox.ac.uk/people/andy.twigg/ > [email protected] | +447799647538 >
