[this announcement is available online at https://s.apache.org/jhvqu ]

Open Source high-performance Big Data streaming algorithm library in use at 
Nielsen Identity, Permutive, Splice Machine, and Verizon Media, among others.

Wilmington, DE —3 February 2021— The Apache Software Foundation (ASF), the 
all-volunteer developers, stewards, and incubators of more than 350 Open Source 
projects and initiatives, announced today Apache® DataSketches™ as a Top-Level 
Project (TLP).

Apache DataSketches is a highly performant Big Data analysis library for 
scalable approximate algorithms. The project originated at Yahoo in 2012, was 
open-sourced in 2015, and entered the Apache Incubator in March 2019.

"We are excited to be part of the ASF," said Lee Rhodes, Vice President of 
Apache DataSketches. "We have learned a great deal from the incubation process 
and look forward to working with new users of our library that want to take 
advantage of sketching technology."

Apache DataSketches’s library of specialized streaming algorithms —known as 
sketches— comprise small data structures that process data at massive scale. 
Sketches are ideal for queries that cannot afford the time or huge compute 
resources needed to generate exact results. Where approximate results are 
acceptable, sketches are the only viable alternative for interactive queries 
with real-time analysis. Apache DataSketches is:

 - Fast —produces approximate results at orders of magnitude faster than 
traditional methods -- user configurable size vs accuracy tradeoff;
 - Efficient —sketch algorithms process data in a single pass for both 
real-time and batch;
 - Mergeable —allows for parallelization;
 - Optimized for large-scale computing environments that process Big Data —such 
as Apache Hadoop, Apache Spark, Apache Druid, Apache Hive, Apache Pig, 
PostgreSQL;
 - Binary compatible across multiple languages and platforms —available in 
Java, C++, and Python;
 - Expanded Analysis —including count distinct with set operations, quantiles, 
most frequent items (heavy hitters), matrix computations, and more; and
 - Mathematically defined and proven error properties —provides a priori and a 
posteriori error estimation and upper and lower bounds with statistically 
derived confidence intervals.

Apache DataSketches is used in large-scale computing environments such as 
Nielsen Identity, Permutive, Splice Machine, and Verizon Media, among others, 
as well as Apache Druid and Apache Pinot (incubating).

"The Apache DataSketches project takes powerful algorithms for data 
summarization and analysis, and makes them available to everyone," said 
Professor Graham Cormode of the  University of Warwick. "While these methods 
are tremendously useful in practice, their descriptions were previously only in 
highly technical scientific papers. This project has made robust, dependable 
and well-documented implementations available to all. Already the library has 
been used for a wide range of applications, including service quality, 
monitoring, ad analytics and the sciences."

"Using Apache DataSketches has enabled Apache Druid users to perform common 
tasks such as quantiles and unique counting in a highly performant and 
efficient manner," said Gian Merlino, Vice President of Apache Druid. "We have 
worked closely together over the years to make the power of DataSketches 
accessible to Apache Druid users, helping us provide real-time analytics at 
scale."

"Sketches are fundamental to calculating many of our key company metrics," said 
Tom Miller, Director of Software Development Engineering at Verizon Media. "It 
allows us to greatly simplify our data processing and reduce storage costs by 
allowing us to calculate non-additive metrics across user specified dimension 
combinations at report time instead of having to either retain raw data or 
pre-calculate for each set of dimensions."

"Combining Apache Druid and DataSketches allows us to provide our customers 
real-time insights into their target audiences and advertising campaigns," said 
Yakir Buskilla, Senior Vice President of Research and Development and General 
Manager Israel at Nielsen Identity. "The ability to evaluate set expressions 
make the Theta Sketch especially powerful for multi-set cardinality estimation 
as well as funnel analysis."

“Apache DataSketches has provided us with a solid theoretical foundation upon 
which we are able to store and process data at scale - in a simple, fast and 
cost-efficient manner," said David Cromberge, Senior Software Engineer at 
Permutive. "It has been a pleasure to engage with their creators and community 
who have been helpful at every step of the way.”

"We use DataSketches's Theta-Sketches for distinct-count aggregations that are 
used to solve large multi-set cardinality approximation," said Mayank 
Shrivastava, Committer and member of the Apache Pinot (incubating) Podling 
Project Management Committee. "The ability to evaluate set expressions make the 
Theta Sketch especially powerful for multi-set cardinality estimation as well 
as funnel analysis."

"We welcome those interested in streaming algorithms to visit us, learn about 
this exciting technology, and contribute to Apache DataSketches to make our 
project even better," added Rhodes.

Availability and Oversight
Apache DataSketches software is released under the Apache License v2.0 and is 
overseen by a self-selected team of active contributors to the project. A 
Project Management Committee (PMC) guides the Project's day-to-day operations, 
including community development and product releases. For downloads, 
documentation, and ways to become involved with Apache DataSketches, visit 
https://datasketches.apache.org .

About the Apache Incubator
The Apache Incubator is the primary entry path for projects and codebases 
wishing to become part of the efforts at The Apache Software Foundation. All 
code donations from external organizations and existing external projects enter 
the ASF through the Incubator to: 1) ensure all donations are in accordance 
with the ASF legal standards; and 2) develop new communities that adhere to our 
guiding principles. Incubation is required of all newly accepted projects until 
a further review indicates that the infrastructure, communications, and 
decision making process have stabilized in a manner consistent with other 
successful ASF projects. While incubation status is not necessarily a 
reflection of the completeness or stability of the code, it does indicate that 
the project has yet to be fully endorsed by the ASF. For more information, 
visit http://incubator.apache.org/ .

About The Apache Software Foundation (ASF)
Established in 1999, The Apache Software Foundation is the world’s largest Open 
Source foundation, stewarding 227M+ lines of code and providing more than $20B+ 
worth of software to the public at 100% no cost. The ASF’s all-volunteer 
community grew from 21 original founders overseeing the Apache HTTP Server to 
813 individual Members and 206 Project Management Committees who successfully 
lead 350+ Apache projects and initiatives in collaboration with nearly 8,000 
Committers through the ASF’s meritocratic process known as "The Apache Way". 
Apache software is integral to nearly every end user computing device, from 
laptops to tablets to mobile devices across enterprises and mission-critical 
applications. Apache projects power most of the Internet, manage exabytes of 
data, execute teraflops of operations, and store billions of objects in 
virtually every industry. The commercially-friendly and permissive Apache 
License v2 is an Open Source industry standard, helping launch billion dollar 
corporations and benefiting countless users worldwide. The ASF is a US 
501(c)(3) not-for-profit charitable organization funded by individual donations 
and corporate sponsors including Aetna, Alibaba Cloud Computing, Amazon Web 
Services, Anonymous, Baidu, Bloomberg, Budget Direct, Capital One, Cloudera, 
Comcast, Didi Chuxing, Facebook, Google, Handshake, Huawei, IBM, Microsoft, 
Pineapple Fund, Red Hat, Reprise Software, Target, Tencent, Union Investment, 
Verizon Media, and Workday. For more information, visit http://apache.org/ and 
https://twitter.com/TheASF .

© The Apache Software Foundation. "Apache", "DataSketches", "Apache 
DataSketches", "Druid", "Apache Druid", "Hadoop", "Apache Hadoop", "Hive", 
"Apache Hive", "Pig", "Apache Pig", "Pinot (incubating)", "Apache Pinot 
(incubating)", "Spark", "Apache Spark", and "ApacheCon" are registered 
trademarks or trademarks of the Apache Software Foundation in the United States 
and/or other countries. All other brands and trademarks are the property of 
their respective owners.

# # #

NOTE: you are receiving this message because you are subscribed to the 
announce@apache.org distribution list. To unsubscribe, send email from the 
recipient account to announce-unsubscr...@apache.org with the word 
"Unsubscribe" in the subject line.

Reply via email to