Any thoughts?


—
Sent from Mailbox

On Tue, Jun 23, 2015 at 11:19 AM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> Hey Spark devs
> I've been looking at DF UDFs and UDAFs. The approx distinct is using
> hyperloglog,
> but there is only an option to return the count as a Long.
> It can be useful to be able to return and store the actual data structure
> (ie serialized HLL). This effectively allows one to do aggregation /
> rollups over columns while still preserving the ability to get distinct
> counts.
> For example, one can store daily aggregates of events, grouped by various
> columns, while storing for each grouping the HLL of say unique users. So
> you can get the uniques per day directly but could also very easily do
> arbitrary aggregates (say monthly, annually) and still be able to get a
> unique count for that period by merging the daily HLLS.
> I did this a while back as a Hive UDAF (https://github.com/MLnick/hive-udf)
> which returns a Struct field containing a "cardinality" field and a
> "binary" field containing the serialized HLL.
> I was wondering if there would be interest in something like this? I am not
> so clear on how UDTs work with regards to SerDe - so could one adapt the
> HyperLogLogUDT to be a Struct with the serialized HLL as a field as well as
> count as a field? Then I assume this would automatically play nicely with
> DataFrame I/O etc. The gotcha is one needs to then call
> "approx_count_field.count" (or is there a concept of a "default field" for
> a Struct?).
> Also, being able to provide the bitsize parameter may be useful...
> The same thinking would apply potentially to other approximate (and
> mergeable) data structures like T-Digest and maybe CMS.
> Nick

Reply via email to