Re: Integration of DataSketches into Flink

Gergo Bakos Tue, 05 May 2020 04:34:34 -0700

Hi,

I'm an experienced software developer who is looking for a project (MSc 
computer science). Robert Metzger suggested this post to me. I would be more 
than happy to investigate, design, and probably implement Type Information for 
DataSketches and Sketch UDFs.


Thanks,
Gergo


On 2020/04/29 22:33:55, leerho <[email protected]> wrote: 
> Seth,
> Thanks for the enthusiastic reply.
> 
> However, I have some questions ... and concerns :)
> 
> 1) Create a page on the flink packages website.
> 
> 
> I looked at this website and it raises a number of red flags for me:
> 
>    - There is no instructions anywhere on the site on how to add a listing.
>    - The "Login with Github" raises security concerns and without any
>    explanation:
>       - Why would I want or need to authorize this site to have "access to
>       my email account"!  Whoa!
>       - This site has registered fewer than 100 GitHub users.  That is a
>       very small number. It seems a lot of GitHub users have the same concerns
>       that I have.
>    - The packages listed are "not endorsed by Apache Flink project or
>    Ververica.  This site is not affiliated with or released by Apache Flink".
>    There is no verification of licensing.
>    - In other words, this site carries zero or even negative weight.  Why
>    would I want to add a listing for our very high quality and properly
>    licensed Apache DataSketches product alongside other listings that are
>    possibly junk?
> 
> 
> 2) Implement Type Information for DataSketches
> 
> 
> In terms of serialization and deserialization, the sketches in our library
> have their own serialization: to and from a byte array, which is also
> language independent across Java, C++ and Python.  How to transport bytes
> from one system to another is system dependent and external to the
> DataSketches library.  Some systems use Base64, or ProtoBuf, or Kryo, or
> Kafka, or whatever.  As long as we can deserialize (or wrap) the same byte
> array that was serialized we are fine.
> 
> If you are asking for metadata about a specific blob of bytes, such as
> which sketch created the blob of bytes, we can perhaps do that, but the
> documentation is not clear about how much metadata is really required,
> because our library does not need it.  So we could use some help here in
> defining what is really required.  Be aware that metadata also increases
> the storage for an object, and we have worked very hard to keep the stored
> size of our sketches very small, because that is one of the key advantages
> of using sketches.  This is also why we don't use Java serialization, it is
> way too heavy!
> 
> 3) Implementing Sketch UDFs
> 
> 
> Thanks for the references, but this was getting way too deep into the weeds
> for me right now.  I would suggest we start simple and then build these
> UDF's later, as they seem optional, if I understand your comments correctly.
> 
> I would suggest we set up a video call with a couple of your key developers
> that could steer us quickly through the options.
> 
> Please be aware that we are *extremely* resource limited, Flink is at least
> 10 times our size, so we could use some help in getting started.  What
> would be ideal would be for someone in your community that is interested in
> seeing DataSketches integrated into Flink work with us on making it
> happen.
> 
> I am looking forward to working with Flink to make this happen.
> 
> Cheers,
> 
> Lee.
> 
> 
> On Mon, Apr 27, 2020 at 2:15 PM Seth Wiesman <[email protected]> wrote:
> 
> > One more point I forgot to mention.
> >
> > Flink SQL supports Hive UDF's[1]. I haven't tested it, but the datasketch
> > hive package should just work out of the box.
> >
> > Seth
> >
> > [1]
> >
> > https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/hive/hive_functions.html
> >
> > On Mon, Apr 27, 2020 at 2:27 PM Seth Wiesman <[email protected]> wrote:
> >
> > > Hi Lee,
> > >
> > > I really like this project, I used it with Flink a few years ago when it
> > > was still Yahoo DataSketches. The projects clearly complement each other.
> > > As Arvid mentioned, the Flink community is trying to foster an ecosystem
> > > larger than what is in the main Flink repository. The reason is that the
> > > project has grown to such a scale that it cannot reasonably maintain
> > > everything. To encourage that sort of growth, Flink is extensively
> > > pluggable which means that components do not need to live within the main
> > > repository to be treated first-class.
> > >
> > > I'd like to outline somethings the DataSketch community could do to
> > > integrate with Flink.
> > >
> > > 1) Create a page on the flink packages website.
> > >
> > > The flink community hosts a website call flink packages to increase the
> > > visibility of ecosystem projects with the flink user base[1].
> > Datasketches
> > > are usable from Flink today so I'd encourage you to create a page right
> > > away.
> > >
> > > 2) Implement TypeInformation for DataSketches
> > >
> > > TypeInformation is Flink's internal type system and is used as a factory
> > > for creating serializing for different types. These serializers are what
> > > Flink uses when shuffling data around the cluster and when storing
> > records
> > > in state backends as state. Providing type information instances for the
> > > different sketch types, which would just be wrappers around existing
> > > serializers in the data sketch codebase. This should be relatively
> > > straightforward. There is no DataStream aggregation API in the way you
> > are
> > > describing so this is the *only* step you would need to take to provide
> > > first-class support for Flink DataStream API[2][3].
> > >
> > > 3) Implement sketch UDFs
> > >
> > > Along with its Java API, Flink also offers a relational API and UDFs. The
> > > community could provide UDFs for datasketches like Hive. To do so only
> > > requires implementing the aggregation function interface[4]. Flink SQL
> > > offers the concept of modules, which are a collection of SQL UDFs that
> > can
> > > easily be loaded in the system[5]. A DataSketch SQL module would provide
> > a
> > > simple way for users to get started and expose these UDFs as if they were
> > > native to Flink.
> > >
> > > I hope this helps, I look forward to watching the DataSketch community
> > > grow!
> > >
> > > Seth
> > >
> > > [1] https://flink-packages.org/
> > > [2]
> > >
> > https://ci.apache.org/projects/flink/flink-docs-stable/dev/types_serialization.html
> > > [3]
> > >
> > https://ci.apache.org/projects/flink/flink-docs-stable/dev/datastream_api.html
> > > [4]
> > >
> > https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/functions/udfs.html#aggregation-functions
> > > [5]
> > >
> > https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/modules.html
> > >
> > >
> > > On Mon, Apr 27, 2020 at 12:57 PM Flavio Pompermaier <
> > [email protected]>
> > > wrote:
> > >
> > >> If this can encourage Lee I'm one of the Flink users that already use
> > >> datasketches and I found it an amazing library.
> > >> When I was trying it out (lat year) I tried to stimulate some
> > >> discussion[1]
> > >> but at that time it was probably too early..
> > >> I really hope that now things are mature for both communities!
> > >>
> > >> [1]
> > >>
> > >>
> > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-and-sketches-td26852.html
> > >>
> > >> Best,
> > >> Flavio
> > >>
> > >> On Mon, Apr 27, 2020 at 7:37 PM leerho <[email protected]> wrote:
> > >>
> > >> > Hi Arvid,
> > >> >
> > >> > Note: I am dual listing this thread on both dev lists for better
> > >> tracking.
> > >> >
> > >> >    1. I'm curious on how you would estimate the effort to port
> > >> datasketches
> > >> > >    to Flink? It already has a Java API, but how difficult would it
> > be
> > >> to
> > >> > >    subdivide the tasks into parallel chunks of work? Since it's
> > >> already
> > >> > > ported
> > >> > >    on Pig, I think we could use this port as a baseline
> > >> >
> > >> >
> > >> > Most systems (including systems like Druid, Hive, Pig, Spark,
> > >> PostgreSQL,
> > >> > Databases, Streaming Platforms, Map-Reduce Platforms, etc) have some
> > >> sort
> > >> > of aggregation API, which allows users to plug in custom aggregation
> > >> > functions.  Typical API functions found in these APIs are
> > Initialize(),
> > >> > Update() (or Add()), Merge(), and getResult().  How these are named
> > and
> > >> > operate vary considerably from system to system.  These APIs are
> > >> sometimes
> > >> > called User Defined Functions (UDFs) or User Defined Aggregation
> > >> Functions
> > >> > (UDAFs).
> > >> >
> > >> > DataSketches is a library of Sketching (streaming) aggregation
> > >> functions,
> > >> > each of which perform specific types of aggregation. For example,
> > >> counting
> > >> > unique items, determining quantiles and histograms of unknown
> > >> > distributions, identifying most frequent items (heavy hitters) from a
> > >> > stream, etc.   The advantage of using DataSketches is that they are
> > >> > extremely fast, small in size, and have well defined error properties
> > >> > defined by published scientific papers that define the underlying
> > >> > mathematics.
> > >> >
> > >> > The task of porting DataSketches is usually developing a thin wrapping
> > >> > layer that translates the specific UDAF API of Flink to the equivalent
> > >> API
> > >> > methods of the targeted sketches in the library.  This is best done by
> > >> > someone with deep knowledge of the UDAF code of the targeted system.
> > >>  We
> > >> > are certainly available answer questions about the DataSketches APIs.
> > >> >  Although we did write the UDAF layers for Hive and Pig, we did that
> > as
> > >> a
> > >> > proof of concept and example on how to write such layers.  We are a
> > >> small
> > >> > team and are not in a position to support these integration layers for
> > >> > every system out there.
> > >> >
> > >> > 2. Do you have any idea who is usually driving the adoptions?
> > >> >
> > >> >
> > >> > To start, you only need to write the UDAF layer for the sketches that
> > >> you
> > >> > think would be in most demand by your users.  The big 4 categories are
> > >> > distinct (unique) counting, quantiles, frequent-items, and sampling.
> > >> This
> > >> > is a natural way of subdividing the task: choose the sketches you want
> > >> to
> > >> > adapt and in what order.  Each sketch is independent so it can be
> > >> adapted
> > >> > whenever it is needed.
> > >> >
> > >> > Please let us know if you have any further questions :)
> > >> >
> > >> > Lee.
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > On Mon, Apr 27, 2020 at 2:11 AM Arvid Heise <[email protected]>
> > >> wrote:
> > >> >
> > >> > > Hi Lee,
> > >> > >
> > >> > > I must admit that I also heard of data sketches for the first time
> > >> (there
> > >> > > are really many Apache projects).
> > >> > >
> > >> > > Datasketches sounds really exciting. As a (former) data engineer, I
> > >> can
> > >> > > 100% say that this is something that (end-)users want and need and
> > it
> > >> > would
> > >> > > make so much sense to have it in Flink from the get-go.
> > >> > > Flink, however, is a quite old project already, which grew at a
> > strong
> > >> > pace
> > >> > > leading to some 150 modules in the core. We are currently in the
> > >> process
> > >> > to
> > >> > > restructure that and reduce the number of things in the core, such
> > >> that
> > >> > > build times and stability improve.
> > >> > >
> > >> > > To counter that we created Flink packages [1], which includes
> > >> everything
> > >> > > new that we deem to not be essential. I'd propose to incorporate a
> > >> Flink
> > >> > > datasketch package there. If it seems like it's becoming essential,
> > we
> > >> > can
> > >> > > still move it to core at a later point.
> > >> > >
> > >> > > As I have seen on the page, there are already plenty of adoptions.
> > >> That
> > >> > > leaves a few questions to me.
> > >> > >
> > >> > >    1. I'm curious on how you would estimate the effort to port
> > >> > datasketches
> > >> > >    to Flink? It already has a Java API, but how difficult would it
> > be
> > >> to
> > >> > >    subdivide the tasks into parallel chunks of work? Since it's
> > >> already
> > >> > > ported
> > >> > >    on Pig, I think we could use this port as a baseline.
> > >> > >    2. Do you have any idea who is usually driving the adoptions?
> > >> > >
> > >> > >
> > >> > > [1] https://flink-packages.org/
> > >> > >
> > >> > > On Sun, Apr 26, 2020 at 8:07 AM leerho <[email protected]> wrote:
> > >> > >
> > >> > > > Hello All,
> > >> > > >
> > >> > > > I am a committer on DataSketches.apache.org
> > >> > > > <http://datasketches.apache.org/> and just learning about Flink,
> > >> > Since
> > >> > > > Flink is designed for stateful stream processing I would think it
> > >> would
> > >> > > > make sense to have the DataSketches library integrated into its
> > >> core so
> > >> > > all
> > >> > > > users of Flink could take advantage of these advanced streaming
> > >> > > > algorithms.  If there is interest in the Flink community for this
> > >> > > > capability, please contact us at [email protected] or
> > on
> > >> our
> > >> > > > datasketches-dev Slack channel.
> > >> > > > Cheers,
> > >> > > > Lee.
> > >> > > >
> > >> > >
> > >> > >
> > >> > > --
> > >> > >
> > >> > > Arvid Heise | Senior Java Developer
> > >> > >
> > >> > > <https://www.ververica.com/>
> > >> > >
> > >> > > Follow us @VervericaData
> > >> > >
> > >> > > --
> > >> > >
> > >> > > Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> > >> > > Conference
> > >> > >
> > >> > > Stream Processing | Event Driven | Real Time
> > >> > >
> > >> > > --
> > >> > >
> > >> > > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
> > >> > >
> > >> > > --
> > >> > > Ververica GmbH
> > >> > > Registered at Amtsgericht Charlottenburg: HRB 158244 B
> > >> > > Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason,
> > >> Ji
> > >> > > (Toni) Cheng
> > >> > >
> > >>
> > >
> >
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Integration of DataSketches into Flink

Reply via email to