It's already possible to just copy the code from countApproxDistinct <https://github.com/apache/spark/blob/v1.4.0/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1153> and access the HLL directly, or do anything you like.
On Wed, Jul 1, 2015 at 5:26 PM, Nick Pentreath <nick.pentre...@gmail.com> wrote: > Any thoughts? > > — > Sent from Mailbox <https://www.dropbox.com/mailbox> > > > On Tue, Jun 23, 2015 at 11:19 AM, Nick Pentreath <nick.pentre...@gmail.com > > wrote: > >> Hey Spark devs >> >> I've been looking at DF UDFs and UDAFs. The approx distinct is using >> hyperloglog, >> but there is only an option to return the count as a Long. >> >> It can be useful to be able to return and store the actual data structure >> (ie serialized HLL). This effectively allows one to do aggregation / >> rollups over columns while still preserving the ability to get distinct >> counts. >> >> For example, one can store daily aggregates of events, grouped by various >> columns, while storing for each grouping the HLL of say unique users. So >> you can get the uniques per day directly but could also very easily do >> arbitrary aggregates (say monthly, annually) and still be able to get a >> unique count for that period by merging the daily HLLS. >> >> I did this a while back as a Hive UDAF ( >> https://github.com/MLnick/hive-udf) which returns a Struct field >> containing a "cardinality" field and a "binary" field containing the >> serialized HLL. >> >> I was wondering if there would be interest in something like this? I am >> not so clear on how UDTs work with regards to SerDe - so could one adapt >> the HyperLogLogUDT to be a Struct with the serialized HLL as a field as >> well as count as a field? Then I assume this would automatically play >> nicely with DataFrame I/O etc. The gotcha is one needs to then call >> "approx_count_field.count" (or is there a concept of a "default field" for >> a Struct?). >> >> Also, being able to provide the bitsize parameter may be useful... >> >> The same thinking would apply potentially to other approximate (and >> mergeable) data structures like T-Digest and maybe CMS. >> >> Nick >> > >