Re: Approximate rank-based statistics (median, 95-th percentile, etc.) for Spark

Reynold Xin Sun, 21 Jun 2015 16:30:04 -0700

One way for this to happen is to have the intermediate data for the
aggregate function be a byte array operated using Unsafe -- that plays very
nicely with the binary data processing we are doing (i.e. fast
serialization, no gc).


The downside is that we'd need to re-implement whatever algorithm is out
there in order for them to be usable with Unsafe/byte arrays.



On Thu, Jun 18, 2015 at 1:22 AM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> If it's going into the DataFrame API (which it probably should rather than
> in RDD itself) - then it could become a UDT (similar to HyperLogLogUDT)
> which would mean it doesn't have to implement Serializable, as it appears
> that serialization is taken care of in the UDT def (e.g.
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala#L254
> )
>
> If I understand correctly UDT SerDe correctly?
>
> On Thu, Jun 11, 2015 at 2:47 AM, Ray Ortigas <
> rorti...@linkedin.com.invalid> wrote:
>
>> Hi Grega and Reynold,
>>
>> Grega, if you still want to use t-digest, I filed this PR because I
>> thought your t-digest suggestion was a good idea.
>>
>> https://github.com/tdunning/t-digest/pull/56
>>
>> If it is helpful feel free to do whatever with it.
>>
>> Regards,
>> Ray
>>
>>
>> On Wed, Jun 10, 2015 at 2:54 PM, Reynold Xin <r...@databricks.com> wrote:
>>
>>> This email is good. Just one note -- a lot of people are swamped right
>>> before Spark Summit, so you might not get prompt responses this week.
>>>
>>>
>>> On Wed, Jun 10, 2015 at 2:53 PM, Grega Kešpret <gr...@celtra.com> wrote:
>>>
>>>> I have some time to work on it now. What's a good way to continue the
>>>> discussions before coding it?
>>>>
>>>> This e-mail list, JIRA or something else?
>>>>
>>>> On Mon, Apr 6, 2015 at 12:59 AM, Reynold Xin <r...@databricks.com>
>>>> wrote:
>>>>
>>>>> I think those are great to have. I would put them in the DataFrame API
>>>>> though, since this is applying to structured data. Many of the advanced
>>>>> functions on the PairRDDFunctions should really go into the DataFrame API
>>>>> now we have it.
>>>>>
>>>>> One thing that would be great to understand is what state-of-the-art
>>>>> alternatives are out there. I did a quick google scholar search using the
>>>>> keyword "approximate quantile" and found some older papers. Just the
>>>>> first few I found:
>>>>>
>>>>> http://www.softnet.tuc.gr/~minos/Papers/sigmod05.pdf  by bell labs
>>>>>
>>>>>
>>>>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.6.6513&rep=rep1&type=pdf
>>>>>  by Bruce Lindsay, IBM
>>>>>
>>>>> http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Apr 6, 2015 at 12:50 AM, Grega Kešpret <gr...@celtra.com>
>>>>> wrote:
>>>>>
>>>>>> Hi!
>>>>>>
>>>>>> I'd like to get community's opinion on implementing a generic
>>>>>> quantile approximation algorithm for Spark that is O(n) and requires
>>>>>> limited memory. I would find it useful and I haven't found any existing
>>>>>> implementation. The plan was basically to wrap t-digest
>>>>>> <https://github.com/tdunning/t-digest>, implement the
>>>>>> serialization/deserialization boilerplate and provide
>>>>>>
>>>>>> def cdf(x: Double): Double
>>>>>> def quantile(q: Double): Double
>>>>>>
>>>>>>
>>>>>> on RDD[Double] and RDD[(K, Double)].
>>>>>>
>>>>>> Let me know what you think. Any other ideas/suggestions also welcome!
>>>>>>
>>>>>> Best,
>>>>>> Grega
>>>>>> --
>>>>>> [image: Inline image 1]*Grega Kešpret*
>>>>>> Senior Software Engineer, Analytics
>>>>>>
>>>>>> Skype: gregakespret
>>>>>> celtra.com <http://www.celtra.com/> | @celtramobile
>>>>>> <http://www.twitter.com/celtramobile>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Approximate rank-based statistics (median, 95-th percentile, etc.) for Spark

Reply via email to