It's already possible to just copy the code from countApproxDistinct
<https://github.com/apache/spark/blob/v1.4.0/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1153>
and
access the HLL directly, or do anything you like.

On Wed, Jul 1, 2015 at 5:26 PM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> Any thoughts?
>
> —
> Sent from Mailbox <https://www.dropbox.com/mailbox>
>
>
> On Tue, Jun 23, 2015 at 11:19 AM, Nick Pentreath <nick.pentre...@gmail.com
> > wrote:
>
>> Hey Spark devs
>>
>> I've been looking at DF UDFs and UDAFs. The approx distinct is using
>> hyperloglog,
>> but there is only an option to return the count as a Long.
>>
>> It can be useful to be able to return and store the actual data structure
>> (ie serialized HLL). This effectively allows one to do aggregation /
>> rollups over columns while still preserving the ability to get distinct
>> counts.
>>
>> For example, one can store daily aggregates of events, grouped by various
>> columns, while storing for each grouping the HLL of say unique users. So
>> you can get the uniques per day directly but could also very easily do
>> arbitrary aggregates (say monthly, annually) and still be able to get a
>> unique count for that period by merging the daily HLLS.
>>
>> I did this a while back as a Hive UDAF (
>> https://github.com/MLnick/hive-udf) which returns a Struct field
>> containing a "cardinality" field and a "binary" field containing the
>> serialized HLL.
>>
>> I was wondering if there would be interest in something like this? I am
>> not so clear on how UDTs work with regards to SerDe - so could one adapt
>> the HyperLogLogUDT to be a Struct with the serialized HLL as a field as
>> well as count as a field? Then I assume this would automatically play
>> nicely with DataFrame I/O etc. The gotcha is one needs to then call
>> "approx_count_field.count" (or is there a concept of a "default field" for
>> a Struct?).
>>
>> Also, being able to provide the bitsize parameter may be useful...
>>
>> The same thinking would apply potentially to other approximate (and
>> mergeable) data structures like T-Digest and maybe CMS.
>>
>> Nick
>>
>
>

Reply via email to