Hi, I'm an experienced software developer who is looking for a project (MSc computer science). Robert Metzger suggested this post to me. I would be more than happy to investigate, design, and probably implement Type Information for DataSketches and Sketch UDFs.
Thanks, Gergo On 2020/04/29 22:33:55, leerho <[email protected]> wrote: > Seth, > Thanks for the enthusiastic reply. > > However, I have some questions ... and concerns :) > > 1) Create a page on the flink packages website. > > > I looked at this website and it raises a number of red flags for me: > > - There is no instructions anywhere on the site on how to add a listing. > - The "Login with Github" raises security concerns and without any > explanation: > - Why would I want or need to authorize this site to have "access to > my email account"! Whoa! > - This site has registered fewer than 100 GitHub users. That is a > very small number. It seems a lot of GitHub users have the same concerns > that I have. > - The packages listed are "not endorsed by Apache Flink project or > Ververica. This site is not affiliated with or released by Apache Flink". > There is no verification of licensing. > - In other words, this site carries zero or even negative weight. Why > would I want to add a listing for our very high quality and properly > licensed Apache DataSketches product alongside other listings that are > possibly junk? > > > 2) Implement Type Information for DataSketches > > > In terms of serialization and deserialization, the sketches in our library > have their own serialization: to and from a byte array, which is also > language independent across Java, C++ and Python. How to transport bytes > from one system to another is system dependent and external to the > DataSketches library. Some systems use Base64, or ProtoBuf, or Kryo, or > Kafka, or whatever. As long as we can deserialize (or wrap) the same byte > array that was serialized we are fine. > > If you are asking for metadata about a specific blob of bytes, such as > which sketch created the blob of bytes, we can perhaps do that, but the > documentation is not clear about how much metadata is really required, > because our library does not need it. So we could use some help here in > defining what is really required. Be aware that metadata also increases > the storage for an object, and we have worked very hard to keep the stored > size of our sketches very small, because that is one of the key advantages > of using sketches. This is also why we don't use Java serialization, it is > way too heavy! > > 3) Implementing Sketch UDFs > > > Thanks for the references, but this was getting way too deep into the weeds > for me right now. I would suggest we start simple and then build these > UDF's later, as they seem optional, if I understand your comments correctly. > > I would suggest we set up a video call with a couple of your key developers > that could steer us quickly through the options. > > Please be aware that we are *extremely* resource limited, Flink is at least > 10 times our size, so we could use some help in getting started. What > would be ideal would be for someone in your community that is interested in > seeing DataSketches integrated into Flink work with us on making it > happen. > > I am looking forward to working with Flink to make this happen. > > Cheers, > > Lee. > > > On Mon, Apr 27, 2020 at 2:15 PM Seth Wiesman <[email protected]> wrote: > > > One more point I forgot to mention. > > > > Flink SQL supports Hive UDF's[1]. I haven't tested it, but the datasketch > > hive package should just work out of the box. > > > > Seth > > > > [1] > > > > https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/hive/hive_functions.html > > > > On Mon, Apr 27, 2020 at 2:27 PM Seth Wiesman <[email protected]> wrote: > > > > > Hi Lee, > > > > > > I really like this project, I used it with Flink a few years ago when it > > > was still Yahoo DataSketches. The projects clearly complement each other. > > > As Arvid mentioned, the Flink community is trying to foster an ecosystem > > > larger than what is in the main Flink repository. The reason is that the > > > project has grown to such a scale that it cannot reasonably maintain > > > everything. To encourage that sort of growth, Flink is extensively > > > pluggable which means that components do not need to live within the main > > > repository to be treated first-class. > > > > > > I'd like to outline somethings the DataSketch community could do to > > > integrate with Flink. > > > > > > 1) Create a page on the flink packages website. > > > > > > The flink community hosts a website call flink packages to increase the > > > visibility of ecosystem projects with the flink user base[1]. > > Datasketches > > > are usable from Flink today so I'd encourage you to create a page right > > > away. > > > > > > 2) Implement TypeInformation for DataSketches > > > > > > TypeInformation is Flink's internal type system and is used as a factory > > > for creating serializing for different types. These serializers are what > > > Flink uses when shuffling data around the cluster and when storing > > records > > > in state backends as state. Providing type information instances for the > > > different sketch types, which would just be wrappers around existing > > > serializers in the data sketch codebase. This should be relatively > > > straightforward. There is no DataStream aggregation API in the way you > > are > > > describing so this is the *only* step you would need to take to provide > > > first-class support for Flink DataStream API[2][3]. > > > > > > 3) Implement sketch UDFs > > > > > > Along with its Java API, Flink also offers a relational API and UDFs. The > > > community could provide UDFs for datasketches like Hive. To do so only > > > requires implementing the aggregation function interface[4]. Flink SQL > > > offers the concept of modules, which are a collection of SQL UDFs that > > can > > > easily be loaded in the system[5]. A DataSketch SQL module would provide > > a > > > simple way for users to get started and expose these UDFs as if they were > > > native to Flink. > > > > > > I hope this helps, I look forward to watching the DataSketch community > > > grow! > > > > > > Seth > > > > > > [1] https://flink-packages.org/ > > > [2] > > > > > https://ci.apache.org/projects/flink/flink-docs-stable/dev/types_serialization.html > > > [3] > > > > > https://ci.apache.org/projects/flink/flink-docs-stable/dev/datastream_api.html > > > [4] > > > > > https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/functions/udfs.html#aggregation-functions > > > [5] > > > > > https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/modules.html > > > > > > > > > On Mon, Apr 27, 2020 at 12:57 PM Flavio Pompermaier < > > [email protected]> > > > wrote: > > > > > >> If this can encourage Lee I'm one of the Flink users that already use > > >> datasketches and I found it an amazing library. > > >> When I was trying it out (lat year) I tried to stimulate some > > >> discussion[1] > > >> but at that time it was probably too early.. > > >> I really hope that now things are mature for both communities! > > >> > > >> [1] > > >> > > >> > > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-and-sketches-td26852.html > > >> > > >> Best, > > >> Flavio > > >> > > >> On Mon, Apr 27, 2020 at 7:37 PM leerho <[email protected]> wrote: > > >> > > >> > Hi Arvid, > > >> > > > >> > Note: I am dual listing this thread on both dev lists for better > > >> tracking. > > >> > > > >> > 1. I'm curious on how you would estimate the effort to port > > >> datasketches > > >> > > to Flink? It already has a Java API, but how difficult would it > > be > > >> to > > >> > > subdivide the tasks into parallel chunks of work? Since it's > > >> already > > >> > > ported > > >> > > on Pig, I think we could use this port as a baseline > > >> > > > >> > > > >> > Most systems (including systems like Druid, Hive, Pig, Spark, > > >> PostgreSQL, > > >> > Databases, Streaming Platforms, Map-Reduce Platforms, etc) have some > > >> sort > > >> > of aggregation API, which allows users to plug in custom aggregation > > >> > functions. Typical API functions found in these APIs are > > Initialize(), > > >> > Update() (or Add()), Merge(), and getResult(). How these are named > > and > > >> > operate vary considerably from system to system. These APIs are > > >> sometimes > > >> > called User Defined Functions (UDFs) or User Defined Aggregation > > >> Functions > > >> > (UDAFs). > > >> > > > >> > DataSketches is a library of Sketching (streaming) aggregation > > >> functions, > > >> > each of which perform specific types of aggregation. For example, > > >> counting > > >> > unique items, determining quantiles and histograms of unknown > > >> > distributions, identifying most frequent items (heavy hitters) from a > > >> > stream, etc. The advantage of using DataSketches is that they are > > >> > extremely fast, small in size, and have well defined error properties > > >> > defined by published scientific papers that define the underlying > > >> > mathematics. > > >> > > > >> > The task of porting DataSketches is usually developing a thin wrapping > > >> > layer that translates the specific UDAF API of Flink to the equivalent > > >> API > > >> > methods of the targeted sketches in the library. This is best done by > > >> > someone with deep knowledge of the UDAF code of the targeted system. > > >> We > > >> > are certainly available answer questions about the DataSketches APIs. > > >> > Although we did write the UDAF layers for Hive and Pig, we did that > > as > > >> a > > >> > proof of concept and example on how to write such layers. We are a > > >> small > > >> > team and are not in a position to support these integration layers for > > >> > every system out there. > > >> > > > >> > 2. Do you have any idea who is usually driving the adoptions? > > >> > > > >> > > > >> > To start, you only need to write the UDAF layer for the sketches that > > >> you > > >> > think would be in most demand by your users. The big 4 categories are > > >> > distinct (unique) counting, quantiles, frequent-items, and sampling. > > >> This > > >> > is a natural way of subdividing the task: choose the sketches you want > > >> to > > >> > adapt and in what order. Each sketch is independent so it can be > > >> adapted > > >> > whenever it is needed. > > >> > > > >> > Please let us know if you have any further questions :) > > >> > > > >> > Lee. > > >> > > > >> > > > >> > > > >> > > > >> > On Mon, Apr 27, 2020 at 2:11 AM Arvid Heise <[email protected]> > > >> wrote: > > >> > > > >> > > Hi Lee, > > >> > > > > >> > > I must admit that I also heard of data sketches for the first time > > >> (there > > >> > > are really many Apache projects). > > >> > > > > >> > > Datasketches sounds really exciting. As a (former) data engineer, I > > >> can > > >> > > 100% say that this is something that (end-)users want and need and > > it > > >> > would > > >> > > make so much sense to have it in Flink from the get-go. > > >> > > Flink, however, is a quite old project already, which grew at a > > strong > > >> > pace > > >> > > leading to some 150 modules in the core. We are currently in the > > >> process > > >> > to > > >> > > restructure that and reduce the number of things in the core, such > > >> that > > >> > > build times and stability improve. > > >> > > > > >> > > To counter that we created Flink packages [1], which includes > > >> everything > > >> > > new that we deem to not be essential. I'd propose to incorporate a > > >> Flink > > >> > > datasketch package there. If it seems like it's becoming essential, > > we > > >> > can > > >> > > still move it to core at a later point. > > >> > > > > >> > > As I have seen on the page, there are already plenty of adoptions. > > >> That > > >> > > leaves a few questions to me. > > >> > > > > >> > > 1. I'm curious on how you would estimate the effort to port > > >> > datasketches > > >> > > to Flink? It already has a Java API, but how difficult would it > > be > > >> to > > >> > > subdivide the tasks into parallel chunks of work? Since it's > > >> already > > >> > > ported > > >> > > on Pig, I think we could use this port as a baseline. > > >> > > 2. Do you have any idea who is usually driving the adoptions? > > >> > > > > >> > > > > >> > > [1] https://flink-packages.org/ > > >> > > > > >> > > On Sun, Apr 26, 2020 at 8:07 AM leerho <[email protected]> wrote: > > >> > > > > >> > > > Hello All, > > >> > > > > > >> > > > I am a committer on DataSketches.apache.org > > >> > > > <http://datasketches.apache.org/> and just learning about Flink, > > >> > Since > > >> > > > Flink is designed for stateful stream processing I would think it > > >> would > > >> > > > make sense to have the DataSketches library integrated into its > > >> core so > > >> > > all > > >> > > > users of Flink could take advantage of these advanced streaming > > >> > > > algorithms. If there is interest in the Flink community for this > > >> > > > capability, please contact us at [email protected] or > > on > > >> our > > >> > > > datasketches-dev Slack channel. > > >> > > > Cheers, > > >> > > > Lee. > > >> > > > > > >> > > > > >> > > > > >> > > -- > > >> > > > > >> > > Arvid Heise | Senior Java Developer > > >> > > > > >> > > <https://www.ververica.com/> > > >> > > > > >> > > Follow us @VervericaData > > >> > > > > >> > > -- > > >> > > > > >> > > Join Flink Forward <https://flink-forward.org/> - The Apache Flink > > >> > > Conference > > >> > > > > >> > > Stream Processing | Event Driven | Real Time > > >> > > > > >> > > -- > > >> > > > > >> > > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany > > >> > > > > >> > > -- > > >> > > Ververica GmbH > > >> > > Registered at Amtsgericht Charlottenburg: HRB 158244 B > > >> > > Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, > > >> Ji > > >> > > (Toni) Cheng > > >> > > > > >> > > > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
