Any thoughts?
— Sent from Mailbox On Tue, Jun 23, 2015 at 11:19 AM, Nick Pentreath <nick.pentre...@gmail.com> wrote: > Hey Spark devs > I've been looking at DF UDFs and UDAFs. The approx distinct is using > hyperloglog, > but there is only an option to return the count as a Long. > It can be useful to be able to return and store the actual data structure > (ie serialized HLL). This effectively allows one to do aggregation / > rollups over columns while still preserving the ability to get distinct > counts. > For example, one can store daily aggregates of events, grouped by various > columns, while storing for each grouping the HLL of say unique users. So > you can get the uniques per day directly but could also very easily do > arbitrary aggregates (say monthly, annually) and still be able to get a > unique count for that period by merging the daily HLLS. > I did this a while back as a Hive UDAF (https://github.com/MLnick/hive-udf) > which returns a Struct field containing a "cardinality" field and a > "binary" field containing the serialized HLL. > I was wondering if there would be interest in something like this? I am not > so clear on how UDTs work with regards to SerDe - so could one adapt the > HyperLogLogUDT to be a Struct with the serialized HLL as a field as well as > count as a field? Then I assume this would automatically play nicely with > DataFrame I/O etc. The gotcha is one needs to then call > "approx_count_field.count" (or is there a concept of a "default field" for > a Struct?). > Also, being able to provide the bitsize parameter may be useful... > The same thinking would apply potentially to other approximate (and > mergeable) data structures like T-Digest and maybe CMS. > Nick