Re: Integration of DataSketches into Flink

Seth Wiesman Mon, 27 Apr 2020 12:28:43 -0700

Hi Lee,

I really like this project, I used it with Flink a few years ago when it
was still Yahoo DataSketches. The projects clearly complement each other.
As Arvid mentioned, the Flink community is trying to foster an ecosystem
larger than what is in the main Flink repository. The reason is that the
project has grown to such a scale that it cannot reasonably maintain
everything. To encourage that sort of growth, Flink is extensively
pluggable which means that components do not need to live within the main
repository to be treated first-class.


I'd like to outline somethings the DataSketch community could do to
integrate with Flink.

1) Create a page on the flink packages website.

The flink community hosts a website call flink packages to increase the
visibility of ecosystem projects with the flink user base[1]. Datasketches
are usable from Flink today so I'd encourage you to create a page right
away.

2) Implement TypeInformation for DataSketches

TypeInformation is Flink's internal type system and is used as a factory
for creating serializing for different types. These serializers are what
Flink uses when shuffling data around the cluster and when storing records
in state backends as state. Providing type information instances for the
different sketch types, which would just be wrappers around existing
serializers in the data sketch codebase. This should be relatively
straightforward. There is no DataStream aggregation API in the way you are
describing so this is the *only* step you would need to take to provide
first-class support for Flink DataStream API[2][3].

3) Implement sketch UDFs

Along with its Java API, Flink also offers a relational API and UDFs. The
community could provide UDFs for datasketches like Hive. To do so only
requires implementing the aggregation function interface[4]. Flink SQL
offers the concept of modules, which are a collection of SQL UDFs that can
easily be loaded in the system[5]. A DataSketch SQL module would provide a
simple way for users to get started and expose these UDFs as if they were
native to Flink.

I hope this helps, I look forward to watching the DataSketch community grow!

Seth

[1] https://flink-packages.org/
[2]
https://ci.apache.org/projects/flink/flink-docs-stable/dev/types_serialization.html
[3]
https://ci.apache.org/projects/flink/flink-docs-stable/dev/datastream_api.html
[4]
https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/functions/udfs.html#aggregation-functions
[5]
https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/modules.html


On Mon, Apr 27, 2020 at 12:57 PM Flavio Pompermaier <[email protected]>
wrote:

> If this can encourage Lee I'm one of the Flink users that already use
> datasketches and I found it an amazing library.
> When I was trying it out (lat year) I tried to stimulate some discussion[1]
> but at that time it was probably too early..
> I really hope that now things are mature for both communities!
>
> [1]
>
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-and-sketches-td26852.html
>
> Best,
> Flavio
>
> On Mon, Apr 27, 2020 at 7:37 PM leerho <[email protected]> wrote:
>
> > Hi Arvid,
> >
> > Note: I am dual listing this thread on both dev lists for better
> tracking.
> >
> >    1. I'm curious on how you would estimate the effort to port
> datasketches
> > >    to Flink? It already has a Java API, but how difficult would it be
> to
> > >    subdivide the tasks into parallel chunks of work? Since it's already
> > > ported
> > >    on Pig, I think we could use this port as a baseline
> >
> >
> > Most systems (including systems like Druid, Hive, Pig, Spark, PostgreSQL,
> > Databases, Streaming Platforms, Map-Reduce Platforms, etc) have some sort
> > of aggregation API, which allows users to plug in custom aggregation
> > functions.  Typical API functions found in these APIs are Initialize(),
> > Update() (or Add()), Merge(), and getResult().  How these are named and
> > operate vary considerably from system to system.  These APIs are
> sometimes
> > called User Defined Functions (UDFs) or User Defined Aggregation
> Functions
> > (UDAFs).
> >
> > DataSketches is a library of Sketching (streaming) aggregation functions,
> > each of which perform specific types of aggregation. For example,
> counting
> > unique items, determining quantiles and histograms of unknown
> > distributions, identifying most frequent items (heavy hitters) from a
> > stream, etc.   The advantage of using DataSketches is that they are
> > extremely fast, small in size, and have well defined error properties
> > defined by published scientific papers that define the underlying
> > mathematics.
> >
> > The task of porting DataSketches is usually developing a thin wrapping
> > layer that translates the specific UDAF API of Flink to the equivalent
> API
> > methods of the targeted sketches in the library.  This is best done by
> > someone with deep knowledge of the UDAF code of the targeted system.   We
> > are certainly available answer questions about the DataSketches APIs.
> >  Although we did write the UDAF layers for Hive and Pig, we did that as a
> > proof of concept and example on how to write such layers.  We are a small
> > team and are not in a position to support these integration layers for
> > every system out there.
> >
> > 2. Do you have any idea who is usually driving the adoptions?
> >
> >
> > To start, you only need to write the UDAF layer for the sketches that you
> > think would be in most demand by your users.  The big 4 categories are
> > distinct (unique) counting, quantiles, frequent-items, and sampling.
> This
> > is a natural way of subdividing the task: choose the sketches you want to
> > adapt and in what order.  Each sketch is independent so it can be adapted
> > whenever it is needed.
> >
> > Please let us know if you have any further questions :)
> >
> > Lee.
> >
> >
> >
> >
> > On Mon, Apr 27, 2020 at 2:11 AM Arvid Heise <[email protected]> wrote:
> >
> > > Hi Lee,
> > >
> > > I must admit that I also heard of data sketches for the first time
> (there
> > > are really many Apache projects).
> > >
> > > Datasketches sounds really exciting. As a (former) data engineer, I can
> > > 100% say that this is something that (end-)users want and need and it
> > would
> > > make so much sense to have it in Flink from the get-go.
> > > Flink, however, is a quite old project already, which grew at a strong
> > pace
> > > leading to some 150 modules in the core. We are currently in the
> process
> > to
> > > restructure that and reduce the number of things in the core, such that
> > > build times and stability improve.
> > >
> > > To counter that we created Flink packages [1], which includes
> everything
> > > new that we deem to not be essential. I'd propose to incorporate a
> Flink
> > > datasketch package there. If it seems like it's becoming essential, we
> > can
> > > still move it to core at a later point.
> > >
> > > As I have seen on the page, there are already plenty of adoptions. That
> > > leaves a few questions to me.
> > >
> > >    1. I'm curious on how you would estimate the effort to port
> > datasketches
> > >    to Flink? It already has a Java API, but how difficult would it be
> to
> > >    subdivide the tasks into parallel chunks of work? Since it's already
> > > ported
> > >    on Pig, I think we could use this port as a baseline.
> > >    2. Do you have any idea who is usually driving the adoptions?
> > >
> > >
> > > [1] https://flink-packages.org/
> > >
> > > On Sun, Apr 26, 2020 at 8:07 AM leerho <[email protected]> wrote:
> > >
> > > > Hello All,
> > > >
> > > > I am a committer on DataSketches.apache.org
> > > > <http://datasketches.apache.org/> and just learning about Flink,
> > Since
> > > > Flink is designed for stateful stream processing I would think it
> would
> > > > make sense to have the DataSketches library integrated into its core
> so
> > > all
> > > > users of Flink could take advantage of these advanced streaming
> > > > algorithms.  If there is interest in the Flink community for this
> > > > capability, please contact us at [email protected] or on
> our
> > > > datasketches-dev Slack channel.
> > > > Cheers,
> > > > Lee.
> > > >
> > >
> > >
> > > --
> > >
> > > Arvid Heise | Senior Java Developer
> > >
> > > <https://www.ververica.com/>
> > >
> > > Follow us @VervericaData
> > >
> > > --
> > >
> > > Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> > > Conference
> > >
> > > Stream Processing | Event Driven | Real Time
> > >
> > > --
> > >
> > > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
> > >
> > > --
> > > Ververica GmbH
> > > Registered at Amtsgericht Charlottenburg: HRB 158244 B
> > > Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
> > > (Toni) Cheng
> > >
>

Re: Integration of DataSketches into Flink

Reply via email to