Hello Nick, I have been working on a (UDT-less) implementation of HLL++. You can find the PR here: https://github.com/apache/spark/pull/8362. This current implements the dense version of HLL++, which is a further development of HLL. It returns a Long, but it shouldn't be to hard to return a Row containing the cardinality and/or the HLL registers (the binary data).
I am curious what the stance is on using UDTs in the new UDAF interface. Is this still viable? This wouldn't work with UnsafeRow for instance. The OpenHashSetUDT for instance would be a nice building block for CollectSet and all Distinct Aggregate operators. Are there any opinions on this? Kind regards, Herman van Hövell tot Westerflier QuestTec B.V. Torenwacht 98 2353 DC Leiderdorp [email protected] +599 9 521 4402 2015-09-12 10:07 GMT+02:00 Nick Pentreath <[email protected]>: > Inspired by this post: > http://eugenezhulenev.com/blog/2015/07/15/interactive-audience-analytics-with-spark-and-hyperloglog/, > I've started putting together something based on the Spark 1.5 UDAF > interface: https://gist.github.com/MLnick/eca566604f2e4e3c6141 > > Some questions - > > 1. How do I get the UDAF to accept input arguments of different type? We > can hash anything basically for HLL - Int, Long, String, Object, raw bytes > etc. Right now it seems we'd need to build a new UDAF for each input type, > which seems strange - I should be able to use one UDAF that can handle raw > input of different types, as well as handle existing HLLs that can be > merged/aggregated (e.g. for grouped data) > 2. @Reynold, how would I ensure this works for Tungsten (ie against raw > bytes in memory)? Or does the new Aggregate2 stuff automatically do that? > Where should I look for examples on how this works internally? > 3. I've based this on the Sum and Avg examples for the new UDAF interface > - any suggestions or issue please advise. Is the intermediate buffer > efficient? > 4. The current HyperLogLogUDT is private - so I've had to make my own one > which is a bit pointless as it's copy-pasted. Any thoughts on exposing that > type? Or I need to make the package spark.sql ... > > Nick > > On Thu, Jul 2, 2015 at 8:06 AM, Reynold Xin <[email protected]> wrote: > >> Yes - it's very interesting. However, ideally we should have a version of >> hyperloglog that can work directly against some raw bytes in memory (rather >> than java objects), in order for this to fit the Tungsten execution model >> where everything is operating directly against some memory address. >> >> On Wed, Jul 1, 2015 at 11:00 PM, Nick Pentreath <[email protected] >> > wrote: >> >>> Sure I can copy the code but my aim was more to understand: >>> >>> (A) if this is broadly interesting enough to folks to think about >>> updating / extending the existing UDAF within Spark >>> (b) how to register ones own custom UDAF - in which case it could be a >>> Spark package for example >>> >>> All examples deal with registering a UDF but nothing about UDAFs >>> >>> — >>> Sent from Mailbox <https://www.dropbox.com/mailbox> >>> >>> >>> On Wed, Jul 1, 2015 at 6:32 PM, Daniel Darabos < >>> [email protected]> wrote: >>> >>>> It's already possible to just copy the code from countApproxDistinct >>>> <https://github.com/apache/spark/blob/v1.4.0/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1153> >>>> and >>>> access the HLL directly, or do anything you like. >>>> >>>> On Wed, Jul 1, 2015 at 5:26 PM, Nick Pentreath < >>>> [email protected]> wrote: >>>> >>>>> Any thoughts? >>>>> >>>>> — >>>>> Sent from Mailbox <https://www.dropbox.com/mailbox> >>>>> >>>>> >>>>> On Tue, Jun 23, 2015 at 11:19 AM, Nick Pentreath < >>>>> [email protected]> wrote: >>>>> >>>>>> Hey Spark devs >>>>>> >>>>>> I've been looking at DF UDFs and UDAFs. The approx distinct is using >>>>>> hyperloglog, >>>>>> but there is only an option to return the count as a Long. >>>>>> >>>>>> It can be useful to be able to return and store the actual data >>>>>> structure (ie serialized HLL). This effectively allows one to do >>>>>> aggregation / rollups over columns while still preserving the ability to >>>>>> get distinct counts. >>>>>> >>>>>> For example, one can store daily aggregates of events, grouped by >>>>>> various columns, while storing for each grouping the HLL of say unique >>>>>> users. So you can get the uniques per day directly but could also very >>>>>> easily do arbitrary aggregates (say monthly, annually) and still be able >>>>>> to >>>>>> get a unique count for that period by merging the daily HLLS. >>>>>> >>>>>> I did this a while back as a Hive UDAF ( >>>>>> https://github.com/MLnick/hive-udf) which returns a Struct field >>>>>> containing a "cardinality" field and a "binary" field containing the >>>>>> serialized HLL. >>>>>> >>>>>> I was wondering if there would be interest in something like this? I >>>>>> am not so clear on how UDTs work with regards to SerDe - so could one >>>>>> adapt >>>>>> the HyperLogLogUDT to be a Struct with the serialized HLL as a field as >>>>>> well as count as a field? Then I assume this would automatically play >>>>>> nicely with DataFrame I/O etc. The gotcha is one needs to then call >>>>>> "approx_count_field.count" (or is there a concept of a "default field" >>>>>> for >>>>>> a Struct?). >>>>>> >>>>>> Also, being able to provide the bitsize parameter may be useful... >>>>>> >>>>>> The same thinking would apply potentially to other approximate (and >>>>>> mergeable) data structures like T-Digest and maybe CMS. >>>>>> >>>>>> Nick >>>>>> >>>>> >>>>> >>>> >>> >> >
