The basic idea is that you would extend the OnlineSummarize to get more quantiles. Then you would combine these OnlineSummarizer estimates weighted by how much data they represent. This won't work if the data is perversely ordered. Hector's suggestions will give you lower accuracy for random ordered data, but better accuracy in the worst case.
On Fri, Apr 20, 2012 at 2:40 PM, Dmitriy Lyubimov <[email protected]> wrote: > Thank you, Ted. > > On Fri, Apr 20, 2012 at 2:30 PM, Ted Dunning <[email protected]> > wrote: > > Look at our OnlineSummarizer. THis should be roughly parallelizable. > > > > On Fri, Apr 20, 2012 at 2:12 PM, Dmitriy Lyubimov <[email protected]> > wrote: > > > >> Thank you, sir. Let me consider this. > >> > >> On Fri, Apr 20, 2012 at 11:50 AM, Hector Yee <[email protected]> > wrote: > >> > how about this > >> > > >> > http://en.wikipedia.org/wiki/Reservoir_sampling > >> > > >> > On Fri, Apr 20, 2012 at 10:44 AM, Dmitriy Lyubimov <[email protected] > >> >wrote: > >> > > >> >> Hello, > >> >> > >> >> There should be some way to compile quartiles in a map/reduce fashion > >> >> (i.e. with api similar to Pig's Arithmetic custom function) without > >> >> keeping enormous count hash? > >> >> There's this countsketch thing that i implemented before on map > >> >> reduce, but it is sort of like bloom filter: if it gives a wrong > >> >> result, the error is fairly huge (in case of bloom filter, 100%) and > >> >> to get good results it still requires quite a bit of memory > >> >> > >> > > >> > > >> > > >> > -- > >> > Yee Yang Li Hector <https://plus.google.com/106746796711269457249> > >> > Professional Profile <http://www.linkedin.com/in/yeehector> > >> > http://hectorgon.blogspot.com/ (tech + travel) > >> > http://hectorgon.com (book reviews) > >> >
