[incubator-datasketches-website] branch asf-site updated: Automatic Site Publish by Buildbot

git-site-role Tue, 26 May 2020 16:51:10 -0700

This is an automated email from the ASF dual-hosted git repository.

git-site-role pushed a commit to branch asf-site
in repository 
https://gitbox.apache.org/repos/asf/incubator-datasketches-website.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 75e9e71  Automatic Site Publish by Buildbot
75e9e71 is described below

commit 75e9e71d58fd68566c7e33c1fa3e107516edeab8
Author: buildbot <[email protected]>
AuthorDate: Tue May 26 23:50:36 2020 +0000

    Automatic Site Publish by Buildbot
---
 output/docs/Community/KDD_Tutorial_Summary.html | 166 ++++++++++++++++++++++++
 1 file changed, 166 insertions(+)

diff --git a/output/docs/Community/KDD_Tutorial_Summary.html 
b/output/docs/Community/KDD_Tutorial_Summary.html
index a22dc5c..d70bc70 100644
--- a/output/docs/Community/KDD_Tutorial_Summary.html
+++ b/output/docs/Community/KDD_Tutorial_Summary.html
@@ -61,18 +61,184 @@
 
 <h2 id="abstract">Abstract</h2>
 
+<p>Speed, cost, and scale. These are 3 of the biggest challenges in analyzing 
big data. While modern data systems continue to push the boundaries of scale, 
the problems of speed and cost are fundamentally tied to the size of data being 
scanned or processed. Processing thousands of queries that each access 
terabytes of data with sub-second latency remains infeasible. Data sketching 
techniques provide means to drastically reduce this size, allowing for 
real-time or interactive data analysi [...]
+
+<p>This tutorial covers a number of useful data sketching and sampling methods 
and demonstrate their use using the Apache DataSketches project. We focus 
particularly on common analytic problems such as counting distinct items, 
quantiles, histograms, heavy hitters, and aggregations with large group-bys. 
For these, we covers algorithms, techniques, and theory that can aid both 
practitioners and theorists in constructing sketches and designing systems that 
achieve desired error guarantees.  [...]
+
 <h2 id="audience">Audience</h2>
 
+<p>This tutorial targets researchers, data systems and infrastructure 
engineers, and data scientists interested in greatly speeding up or reducing 
the cost of processing big data sets in practice. It is also useful to theory 
oriented CS researchers who are interested in statistical techniques that are 
typically outside CS curricula.</p>
+
+<p>The audience is expected to have a familiarity of probability and 
statistics that is typical for an undergraduate mathematical statistics or 
introductory graduate machine learning course.</p>
+
 <h2 id="outline">Outline</h2>
 
+<p>The tutorial will consist of two parts. The first focuses on methods and 
theory for data sketching and sampling. The second focuses on application and 
includes code examples using the Apache DataSketches project.</p>
+
+<p>The audience should learn about</p>
+
+<ul>
+  <li>techniques used to construct sketches such as sampling, quantization, 
and random projections</li>
+  <li>statistical techniques for extracting information from and theoretically 
analyzing sketches</li>
+  <li>problems that sketches are useful for</li>
+  <li>the current state-of-the-art sketch for the problem</li>
+  <li>inherent trade-offs when using sketches</li>
+  <li>example applications of data sketches</li>
+  <li>how to design systems to use sketches</li>
+  <li>how to easily incorporate sketches using Apache DataSketches</li>
+  <li>how to deal with error in practice</li>
+</ul>
+
+<h3 id="details">Details</h3>
+
+<ol>
+  <li>Introduction to data sketches (20 minutes)
+    <ol>
+      <li>Definition</li>
+      <li>Applications and motivation</li>
+      <li>Examples</li>
+      <li>Sketch components
+        <ol>
+          <li>Construction</li>
+          <li>Representation</li>
+          <li>Estimation</li>
+        </ol>
+      </li>
+      <li>Optimality</li>
+      <li>Mergeability and distributed processing</li>
+      <li>Space vs accuracy measures</li>
+      <li>Flexibility versus space</li>
+    </ol>
+  </li>
+  <li>Data sketches (90 minutes)
+    <ol>
+      <li>Sums + group by
+        <ol>
+          <li>Count-Min and AGMS</li>
+          <li>More accurate estimates using background distributions</li>
+          <li>Linear sketches and inner products</li>
+        </ol>
+      </li>
+      <li>Frequent item
+        <ol>
+          <li>Misra-Gries</li>
+          <li>Extensions</li>
+        </ol>
+      </li>
+      <li>Subset sums
+        <ol>
+          <li>Priority sampling and VarOpt</li>
+          <li>Unbiased space saving</li>
+        </ol>
+      </li>
+      <li>Distinct Counting
+        <ol>
+          <li>HyperLogLog and Theta sketch</li>
+          <li>Streaming vs distributed</li>
+          <li>Intersections and Unions</li>
+          <li>Many counters</li>
+        </ol>
+      </li>
+      <li>Quantiles
+        <ol>
+          <li>Manku-Rajagopalan-Lindsay and Karnin-Liberty-Lang</li>
+          <li>IID streams</li>
+        </ol>
+      </li>
+      <li>Approximate set membership
+        <ol>
+          <li>Bloom filters</li>
+          <li>Cuckoo filters</li>
+          <li>XOR filters</li>
+        </ol>
+      </li>
+      <li>Matrix sketches</li>
+    </ol>
+  </li>
+  <li>Applications (60 minutes)
+    <ol>
+      <li>Sketch-based architectures</li>
+      <li>Examples
+        <ol>
+          <li>Case Studies</li>
+          <li>Sketch-enabled Packages</li>
+        </ol>
+      </li>
+      <li>Practical Usage
+        <ol>
+          <li>Implementation subtlety and challenges</li>
+          <li>Accepting approximation</li>
+          <li>Understanding error</li>
+          <li>System planning</li>
+        </ol>
+      </li>
+      <li>Demonstration in python
+        <ol>
+          <li>Examples of several sketches</li>
+          <li>Deeper dive with sampling</li>
+        </ol>
+      </li>
+    </ol>
+  </li>
+  <li>Extra topics (Time permitting) (10 minutes)
+    <ol>
+      <li>Privacy using sketches % worth making time w/ GDPR/CCPA/etc</li>
+      <li>Adversarial attacks</li>
+      <li>Active research areas % not sure if  this is interesting?</li>
+    </ol>
+  </li>
+</ol>
+
 <h2 id="presentor-bios">Presentor Bios</h2>
 
+<p>Daniel Ting is a researcher in Tableau working primarily on data sketching 
with sketching work published in KDD, SIGMOD, and NeurIPS. His work in the area 
was initially inspired by problems he encountered while on Facebook’s core data 
science team where he built systems for large scale online experimentation. He 
obtained his PhD in Statistics from U.C. Berkeley.</p>
+
+<p>Jon Malkin is a senior principal research engineer at Verizon Media and a 
contributor to the Apache DataSketches project. He has experience with 
large-scale data processing, both brute-force and with sketches, from roles in 
computational advertising and website traffic analytics. He obtained his PhD in 
Electrical Engineering from the University of Washington.</p>
+
 <h2 id="contributor-bio">Contributor Bio</h2>
 
+<p>Lee Rhodes is a Distinguished Architect at Yahoo (now Verizon Media). He 
created the DataSketches project in 2012 to address analysis problems in 
Yahoo’s large data processing pipelines. DataSketches was Open Sourced in 2015 
and is now in incubation at Apache Software Foundation. He was an author or 
coauthor on sketching work published in ICDT, IMC, and JCGS. He obtained his 
Master’s Degree in Electrical Engineering from Stanford University.</p>
+
 <h2 id="societal-impact">Societal Impact</h2>
 
+<p>Our society is impacted by the ability to do fast data analysis at scale. 
This tutorial aims to advance this ability by demonstrating how sketching can 
help and by making sketching more accessible to practitioners. It is also 
intended to influence future research in data sketches by putting more emphasis 
on practical algorithms and the importance of constant factors for sketching 
and by introducing statistical techniques that help solve these problems.</p>
+
 <h2 id="references">References</h2>
 
+<ol>
+  <li>P. K. Agarwal, G. Cormode, Z. Huang, J. M. Phillips, Z. Wei, and K. Yi. 
Mergeable Summaries. ACM Transactions on Database Systems, 38(4):26, 2013.</li>
+  <li>N. Alon, Y. Matias, and M. Szegedy. The space complexity of 
approximating the frequency moments. Journal of Computer and System Sciences, 
58(1):137–147, 1999.</li>
+  <li>Apache. Datasketches (incubating). https://datasketches.apache.org, 
2020.</li>
+  <li>Apache. Druid. https://druid.apache.org, 2020.</li>
+  <li>A. Broder and M. Mitzenmacher. Network applications of bloom filters: A 
survey. Internet mathematics, 1(4):485–509, 2004.</li>
+  <li>E. Cohen. All-distances sketches, revisited: Hip estimators for massive 
graphsanalysis.IEEE Transactions on Knowledge and Data Engineering, 
27(9):2320–2334, 2015.</li>
+  <li>E. Cohen. Stream sampling for frequency cap statistics. In KDD. ACM, 
2015.</li>
+  <li>E. Cohen, N. Duffield, H. Kaplan, C. Lund, and M. Thorup. Stream 
sampling for variance-optimal estimation of subset sums. In Proceedings of the 
twentieth annual ACM-SIAM symposium on Discrete algorithms, pages 1255–1264. 
SIAM, 2009.</li>
+  <li>G. Cormode, M. Garofalakis, P. J. Haas, and C. Jermaine. Synopses for 
massive data: Samples, histograms, wavelets, sketches. Foundations and Trends 
in Databases, 4(1–3):1–294, 2012.
+Data Sketching for Real Time Analytics: Theory and Practice. KDD’20, August 
2020, San Diego, CA.</li>
+  <li>G. Cormode and S. Muthukrishnan.  An improved data stream summary: the 
count-min sketch and its applications. Journal of Algorithms, 55(1):58–75, 
2005.</li>
+  <li>G. Cormode, V. Shkapenyuk, D. Srivastava, and B. Xu. Forward decay: A 
practicaltime decay model for streaming systems. In ICDE, pages 138–149. IEEE, 
2009.</li>
+  <li>G. Cormode and P. Veselý.  Tight lower bound for comparison-based 
quantile summaries. 2019.</li>
+  <li>A. Dasgupta, K. J. Lang, L. Rhodes, and J. Thaler.  A framework for 
estimating stream expression cardinalities. In ICDT, 2016.</li>
+  <li>N. Duffield, C. Lund, and M. Thorup. Priority sampling for estimation of 
arbitrarysubset sums. Journal of the ACM (JACM), 54(6):32, 2007.</li>
+  <li>O. Ertl. New cardinality estimation algorithms for hyperloglog sketches. 
arXiv preprintm arXiv:1702.01284, 2017.</li>
+  <li>B. Fan, D. G. Andersen, M. Kaminsky, and M. D. Mitzenmacher. Cuckoo 
filter:Practically better than bloom. In Proceedings of the 10th ACM 
International on Conference on emerging Networking Experiments and 
Technologies, pages 75–88, 2014.</li>
+  <li>M. Ghashami, E. Liberty, and J. M. Phillips. Efficient frequent 
directions algorithm for sparse matrices. KDD, 2016.</li>
+  <li>M. Greenwald and S. Khanna.  Space-efficient online computation of 
quantile summaries.SIGMOD, 2001.</li>
+  <li>Z. Karnin, K. Lang, and E. Liberty. Optimal quantile approximation in 
streams. In FOCS. IEEE, 2016.</li>
+  <li>E. Liberty. Simple and deterministic matrix sketching. In KDD. ACM, 
2013.</li>
+  <li>G. S. Manku, S. Rajagopalan, and B. Lindsay. Random sampling techniques 
for space efficient online computation of order statistics of large datasets. 
SIGMOD, 1999.</li>
+  <li>J. Misra and D. Gries. Finding repeated elements. Science of computer 
programming, 2(2):143–152, 1982.</li>
+  <li>A. Shrivastava, A. C. Konig, and M. Bilenko. Time adaptive sketches 
(ada-sketches) for summarizing data streams. In Proceedings of the 2016 
International Conference on Management of Data, pages 1417–1432, 2016.</li>
+  <li>M. Szegedy.  The DLT priority sampling is essentially optimal. In STOC, 
pp. 150–158. ACM, 2006.</li>
+  <li>D. Ting. Streamed approximate counting of distinct elements: beating 
optimal batch methods. In KDD, 2014.</li>
+  <li>D. Ting. Towards optimal cardinality estimation of unions and 
intersections withsketches.  In Proceedings of the 22nd ACM SIGKDD 
International Conference on Knowledge Discovery and Data Mining, pages 
1195–1204. ACM, 2016.</li>
+  <li>D. Ting. Count-min: Optimal estimation and tight error bounds using 
empirical error distributions. In Proceedings of the 24th ACM SIGKDD 
International Conference on Knowledge Discovery &amp; Data Mining, pages 
2319–2328. ACM, 2018.</li>
+  <li>D. Ting. Data sketches for disaggregated subset sum and frequent item 
estimation. In Proceedings of the 2018 International Conference on Management 
of Data, pp. 1129–1140. ACM, 2018.</li>
+  <li>D. Ting and E. Brochu. Optimal subsampling with influence functions. In 
Advances in Neural Information Processing Systems, pp. 3650–3659, 2018.</li>
+  <li>L. Wang, G. Luo, K. Yi, and G. Cormode. Quantiles over data streams: an 
experimental study. In SIGMOD, 2013.</li>
+</ol>
+
     </div>
   </div>
 </div>


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[incubator-datasketches-website] branch asf-site updated: Automatic Site Publish by Buildbot

Reply via email to