This is an automated email from the ASF dual-hosted git repository.
git-site-role pushed a commit to branch asf-site
in repository
https://gitbox.apache.org/repos/asf/incubator-datasketches-website.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 75e9e71 Automatic Site Publish by Buildbot
75e9e71 is described below
commit 75e9e71d58fd68566c7e33c1fa3e107516edeab8
Author: buildbot <[email protected]>
AuthorDate: Tue May 26 23:50:36 2020 +0000
Automatic Site Publish by Buildbot
---
output/docs/Community/KDD_Tutorial_Summary.html | 166 ++++++++++++++++++++++++
1 file changed, 166 insertions(+)
diff --git a/output/docs/Community/KDD_Tutorial_Summary.html
b/output/docs/Community/KDD_Tutorial_Summary.html
index a22dc5c..d70bc70 100644
--- a/output/docs/Community/KDD_Tutorial_Summary.html
+++ b/output/docs/Community/KDD_Tutorial_Summary.html
@@ -61,18 +61,184 @@
<h2 id="abstract">Abstract</h2>
+<p>Speed, cost, and scale. These are 3 of the biggest challenges in analyzing
big data. While modern data systems continue to push the boundaries of scale,
the problems of speed and cost are fundamentally tied to the size of data being
scanned or processed. Processing thousands of queries that each access
terabytes of data with sub-second latency remains infeasible. Data sketching
techniques provide means to drastically reduce this size, allowing for
real-time or interactive data analysi [...]
+
+<p>This tutorial covers a number of useful data sketching and sampling methods
and demonstrate their use using the Apache DataSketches project. We focus
particularly on common analytic problems such as counting distinct items,
quantiles, histograms, heavy hitters, and aggregations with large group-bys.
For these, we covers algorithms, techniques, and theory that can aid both
practitioners and theorists in constructing sketches and designing systems that
achieve desired error guarantees. [...]
+
<h2 id="audience">Audience</h2>
+<p>This tutorial targets researchers, data systems and infrastructure
engineers, and data scientists interested in greatly speeding up or reducing
the cost of processing big data sets in practice. It is also useful to theory
oriented CS researchers who are interested in statistical techniques that are
typically outside CS curricula.</p>
+
+<p>The audience is expected to have a familiarity of probability and
statistics that is typical for an undergraduate mathematical statistics or
introductory graduate machine learning course.</p>
+
<h2 id="outline">Outline</h2>
+<p>The tutorial will consist of two parts. The first focuses on methods and
theory for data sketching and sampling. The second focuses on application and
includes code examples using the Apache DataSketches project.</p>
+
+<p>The audience should learn about</p>
+
+<ul>
+ <li>techniques used to construct sketches such as sampling, quantization,
and random projections</li>
+ <li>statistical techniques for extracting information from and theoretically
analyzing sketches</li>
+ <li>problems that sketches are useful for</li>
+ <li>the current state-of-the-art sketch for the problem</li>
+ <li>inherent trade-offs when using sketches</li>
+ <li>example applications of data sketches</li>
+ <li>how to design systems to use sketches</li>
+ <li>how to easily incorporate sketches using Apache DataSketches</li>
+ <li>how to deal with error in practice</li>
+</ul>
+
+<h3 id="details">Details</h3>
+
+<ol>
+ <li>Introduction to data sketches (20 minutes)
+ <ol>
+ <li>Definition</li>
+ <li>Applications and motivation</li>
+ <li>Examples</li>
+ <li>Sketch components
+ <ol>
+ <li>Construction</li>
+ <li>Representation</li>
+ <li>Estimation</li>
+ </ol>
+ </li>
+ <li>Optimality</li>
+ <li>Mergeability and distributed processing</li>
+ <li>Space vs accuracy measures</li>
+ <li>Flexibility versus space</li>
+ </ol>
+ </li>
+ <li>Data sketches (90 minutes)
+ <ol>
+ <li>Sums + group by
+ <ol>
+ <li>Count-Min and AGMS</li>
+ <li>More accurate estimates using background distributions</li>
+ <li>Linear sketches and inner products</li>
+ </ol>
+ </li>
+ <li>Frequent item
+ <ol>
+ <li>Misra-Gries</li>
+ <li>Extensions</li>
+ </ol>
+ </li>
+ <li>Subset sums
+ <ol>
+ <li>Priority sampling and VarOpt</li>
+ <li>Unbiased space saving</li>
+ </ol>
+ </li>
+ <li>Distinct Counting
+ <ol>
+ <li>HyperLogLog and Theta sketch</li>
+ <li>Streaming vs distributed</li>
+ <li>Intersections and Unions</li>
+ <li>Many counters</li>
+ </ol>
+ </li>
+ <li>Quantiles
+ <ol>
+ <li>Manku-Rajagopalan-Lindsay and Karnin-Liberty-Lang</li>
+ <li>IID streams</li>
+ </ol>
+ </li>
+ <li>Approximate set membership
+ <ol>
+ <li>Bloom filters</li>
+ <li>Cuckoo filters</li>
+ <li>XOR filters</li>
+ </ol>
+ </li>
+ <li>Matrix sketches</li>
+ </ol>
+ </li>
+ <li>Applications (60 minutes)
+ <ol>
+ <li>Sketch-based architectures</li>
+ <li>Examples
+ <ol>
+ <li>Case Studies</li>
+ <li>Sketch-enabled Packages</li>
+ </ol>
+ </li>
+ <li>Practical Usage
+ <ol>
+ <li>Implementation subtlety and challenges</li>
+ <li>Accepting approximation</li>
+ <li>Understanding error</li>
+ <li>System planning</li>
+ </ol>
+ </li>
+ <li>Demonstration in python
+ <ol>
+ <li>Examples of several sketches</li>
+ <li>Deeper dive with sampling</li>
+ </ol>
+ </li>
+ </ol>
+ </li>
+ <li>Extra topics (Time permitting) (10 minutes)
+ <ol>
+ <li>Privacy using sketches % worth making time w/ GDPR/CCPA/etc</li>
+ <li>Adversarial attacks</li>
+ <li>Active research areas % not sure if this is interesting?</li>
+ </ol>
+ </li>
+</ol>
+
<h2 id="presentor-bios">Presentor Bios</h2>
+<p>Daniel Ting is a researcher in Tableau working primarily on data sketching
with sketching work published in KDD, SIGMOD, and NeurIPS. His work in the area
was initially inspired by problems he encountered while on Facebook’s core data
science team where he built systems for large scale online experimentation. He
obtained his PhD in Statistics from U.C. Berkeley.</p>
+
+<p>Jon Malkin is a senior principal research engineer at Verizon Media and a
contributor to the Apache DataSketches project. He has experience with
large-scale data processing, both brute-force and with sketches, from roles in
computational advertising and website traffic analytics. He obtained his PhD in
Electrical Engineering from the University of Washington.</p>
+
<h2 id="contributor-bio">Contributor Bio</h2>
+<p>Lee Rhodes is a Distinguished Architect at Yahoo (now Verizon Media). He
created the DataSketches project in 2012 to address analysis problems in
Yahoo’s large data processing pipelines. DataSketches was Open Sourced in 2015
and is now in incubation at Apache Software Foundation. He was an author or
coauthor on sketching work published in ICDT, IMC, and JCGS. He obtained his
Master’s Degree in Electrical Engineering from Stanford University.</p>
+
<h2 id="societal-impact">Societal Impact</h2>
+<p>Our society is impacted by the ability to do fast data analysis at scale.
This tutorial aims to advance this ability by demonstrating how sketching can
help and by making sketching more accessible to practitioners. It is also
intended to influence future research in data sketches by putting more emphasis
on practical algorithms and the importance of constant factors for sketching
and by introducing statistical techniques that help solve these problems.</p>
+
<h2 id="references">References</h2>
+<ol>
+ <li>P. K. Agarwal, G. Cormode, Z. Huang, J. M. Phillips, Z. Wei, and K. Yi.
Mergeable Summaries. ACM Transactions on Database Systems, 38(4):26, 2013.</li>
+ <li>N. Alon, Y. Matias, and M. Szegedy. The space complexity of
approximating the frequency moments. Journal of Computer and System Sciences,
58(1):137–147, 1999.</li>
+ <li>Apache. Datasketches (incubating). https://datasketches.apache.org,
2020.</li>
+ <li>Apache. Druid. https://druid.apache.org, 2020.</li>
+ <li>A. Broder and M. Mitzenmacher. Network applications of bloom filters: A
survey. Internet mathematics, 1(4):485–509, 2004.</li>
+ <li>E. Cohen. All-distances sketches, revisited: Hip estimators for massive
graphsanalysis.IEEE Transactions on Knowledge and Data Engineering,
27(9):2320–2334, 2015.</li>
+ <li>E. Cohen. Stream sampling for frequency cap statistics. In KDD. ACM,
2015.</li>
+ <li>E. Cohen, N. Duffield, H. Kaplan, C. Lund, and M. Thorup. Stream
sampling for variance-optimal estimation of subset sums. In Proceedings of the
twentieth annual ACM-SIAM symposium on Discrete algorithms, pages 1255–1264.
SIAM, 2009.</li>
+ <li>G. Cormode, M. Garofalakis, P. J. Haas, and C. Jermaine. Synopses for
massive data: Samples, histograms, wavelets, sketches. Foundations and Trends
in Databases, 4(1–3):1–294, 2012.
+Data Sketching for Real Time Analytics: Theory and Practice. KDD’20, August
2020, San Diego, CA.</li>
+ <li>G. Cormode and S. Muthukrishnan. An improved data stream summary: the
count-min sketch and its applications. Journal of Algorithms, 55(1):58–75,
2005.</li>
+ <li>G. Cormode, V. Shkapenyuk, D. Srivastava, and B. Xu. Forward decay: A
practicaltime decay model for streaming systems. In ICDE, pages 138–149. IEEE,
2009.</li>
+ <li>G. Cormode and P. Veselý. Tight lower bound for comparison-based
quantile summaries. 2019.</li>
+ <li>A. Dasgupta, K. J. Lang, L. Rhodes, and J. Thaler. A framework for
estimating stream expression cardinalities. In ICDT, 2016.</li>
+ <li>N. Duffield, C. Lund, and M. Thorup. Priority sampling for estimation of
arbitrarysubset sums. Journal of the ACM (JACM), 54(6):32, 2007.</li>
+ <li>O. Ertl. New cardinality estimation algorithms for hyperloglog sketches.
arXiv preprintm arXiv:1702.01284, 2017.</li>
+ <li>B. Fan, D. G. Andersen, M. Kaminsky, and M. D. Mitzenmacher. Cuckoo
filter:Practically better than bloom. In Proceedings of the 10th ACM
International on Conference on emerging Networking Experiments and
Technologies, pages 75–88, 2014.</li>
+ <li>M. Ghashami, E. Liberty, and J. M. Phillips. Efficient frequent
directions algorithm for sparse matrices. KDD, 2016.</li>
+ <li>M. Greenwald and S. Khanna. Space-efficient online computation of
quantile summaries.SIGMOD, 2001.</li>
+ <li>Z. Karnin, K. Lang, and E. Liberty. Optimal quantile approximation in
streams. In FOCS. IEEE, 2016.</li>
+ <li>E. Liberty. Simple and deterministic matrix sketching. In KDD. ACM,
2013.</li>
+ <li>G. S. Manku, S. Rajagopalan, and B. Lindsay. Random sampling techniques
for space efficient online computation of order statistics of large datasets.
SIGMOD, 1999.</li>
+ <li>J. Misra and D. Gries. Finding repeated elements. Science of computer
programming, 2(2):143–152, 1982.</li>
+ <li>A. Shrivastava, A. C. Konig, and M. Bilenko. Time adaptive sketches
(ada-sketches) for summarizing data streams. In Proceedings of the 2016
International Conference on Management of Data, pages 1417–1432, 2016.</li>
+ <li>M. Szegedy. The DLT priority sampling is essentially optimal. In STOC,
pp. 150–158. ACM, 2006.</li>
+ <li>D. Ting. Streamed approximate counting of distinct elements: beating
optimal batch methods. In KDD, 2014.</li>
+ <li>D. Ting. Towards optimal cardinality estimation of unions and
intersections withsketches. In Proceedings of the 22nd ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, pages
1195–1204. ACM, 2016.</li>
+ <li>D. Ting. Count-min: Optimal estimation and tight error bounds using
empirical error distributions. In Proceedings of the 24th ACM SIGKDD
International Conference on Knowledge Discovery & Data Mining, pages
2319–2328. ACM, 2018.</li>
+ <li>D. Ting. Data sketches for disaggregated subset sum and frequent item
estimation. In Proceedings of the 2018 International Conference on Management
of Data, pp. 1129–1140. ACM, 2018.</li>
+ <li>D. Ting and E. Brochu. Optimal subsampling with influence functions. In
Advances in Neural Information Processing Systems, pp. 3650–3659, 2018.</li>
+ <li>L. Wang, G. Luo, K. Yi, and G. Cormode. Quantiles over data streams: an
experimental study. In SIGMOD, 2013.</li>
+</ol>
+
</div>
</div>
</div>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]