One way for this to happen is to have the intermediate data for the aggregate function be a byte array operated using Unsafe -- that plays very nicely with the binary data processing we are doing (i.e. fast serialization, no gc).
The downside is that we'd need to re-implement whatever algorithm is out there in order for them to be usable with Unsafe/byte arrays. On Thu, Jun 18, 2015 at 1:22 AM, Nick Pentreath <nick.pentre...@gmail.com> wrote: > If it's going into the DataFrame API (which it probably should rather than > in RDD itself) - then it could become a UDT (similar to HyperLogLogUDT) > which would mean it doesn't have to implement Serializable, as it appears > that serialization is taken care of in the UDT def (e.g. > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala#L254 > ) > > If I understand correctly UDT SerDe correctly? > > On Thu, Jun 11, 2015 at 2:47 AM, Ray Ortigas < > rorti...@linkedin.com.invalid> wrote: > >> Hi Grega and Reynold, >> >> Grega, if you still want to use t-digest, I filed this PR because I >> thought your t-digest suggestion was a good idea. >> >> https://github.com/tdunning/t-digest/pull/56 >> >> If it is helpful feel free to do whatever with it. >> >> Regards, >> Ray >> >> >> On Wed, Jun 10, 2015 at 2:54 PM, Reynold Xin <r...@databricks.com> wrote: >> >>> This email is good. Just one note -- a lot of people are swamped right >>> before Spark Summit, so you might not get prompt responses this week. >>> >>> >>> On Wed, Jun 10, 2015 at 2:53 PM, Grega Kešpret <gr...@celtra.com> wrote: >>> >>>> I have some time to work on it now. What's a good way to continue the >>>> discussions before coding it? >>>> >>>> This e-mail list, JIRA or something else? >>>> >>>> On Mon, Apr 6, 2015 at 12:59 AM, Reynold Xin <r...@databricks.com> >>>> wrote: >>>> >>>>> I think those are great to have. I would put them in the DataFrame API >>>>> though, since this is applying to structured data. Many of the advanced >>>>> functions on the PairRDDFunctions should really go into the DataFrame API >>>>> now we have it. >>>>> >>>>> One thing that would be great to understand is what state-of-the-art >>>>> alternatives are out there. I did a quick google scholar search using the >>>>> keyword "approximate quantile" and found some older papers. Just the >>>>> first few I found: >>>>> >>>>> http://www.softnet.tuc.gr/~minos/Papers/sigmod05.pdf by bell labs >>>>> >>>>> >>>>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.6.6513&rep=rep1&type=pdf >>>>> by Bruce Lindsay, IBM >>>>> >>>>> http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Mon, Apr 6, 2015 at 12:50 AM, Grega Kešpret <gr...@celtra.com> >>>>> wrote: >>>>> >>>>>> Hi! >>>>>> >>>>>> I'd like to get community's opinion on implementing a generic >>>>>> quantile approximation algorithm for Spark that is O(n) and requires >>>>>> limited memory. I would find it useful and I haven't found any existing >>>>>> implementation. The plan was basically to wrap t-digest >>>>>> <https://github.com/tdunning/t-digest>, implement the >>>>>> serialization/deserialization boilerplate and provide >>>>>> >>>>>> def cdf(x: Double): Double >>>>>> def quantile(q: Double): Double >>>>>> >>>>>> >>>>>> on RDD[Double] and RDD[(K, Double)]. >>>>>> >>>>>> Let me know what you think. Any other ideas/suggestions also welcome! >>>>>> >>>>>> Best, >>>>>> Grega >>>>>> -- >>>>>> [image: Inline image 1]*Grega Kešpret* >>>>>> Senior Software Engineer, Analytics >>>>>> >>>>>> Skype: gregakespret >>>>>> celtra.com <http://www.celtra.com/> | @celtramobile >>>>>> <http://www.twitter.com/celtramobile> >>>>>> >>>>>> >>>>> >>>> >>> >> >