This is an automated email from the ASF dual-hosted git repository.

leerho pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/datasketches-website.git

commit 0f055bdf65708bf45ba5576f576156daba98a5c7
Author: Lee Rhodes <[email protected]>
AuthorDate: Fri Jan 23 22:07:12 2026 -0800

    Add EB-PPS Sampling Sketches
---
 docs/Sampling/EB-PPS_SamplingSketches.md | 69 ++++++++++++++++++++++++++++++++
 src/main/resources/docgen/toc.json       |  1 +
 2 files changed, 70 insertions(+)

diff --git a/docs/Sampling/EB-PPS_SamplingSketches.md 
b/docs/Sampling/EB-PPS_SamplingSketches.md
new file mode 100644
index 00000000..0eee9068
--- /dev/null
+++ b/docs/Sampling/EB-PPS_SamplingSketches.md
@@ -0,0 +1,69 @@
+---
+layout: doc_page
+---
+<!--
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+-->
+## Contents
+<!-- TOC -->
+* [EB-PPS Sampling Sketches](#eb-pps-sampling-sketch)
+<!-- TOC -->
+
+<a id="eb-pps-sampling-sketch"></a>
+## Exact Bound Probability-Proportional-to-Size (EB-PPS) Sampling Sketches
+
+An implementation inspired by the paper:"Exact PPS Sampling with Bounded 
Sample Size",
+B. Hentschel, P. J. Haas, Y. Tian. Information Processing Letters, 2023.
+
+The EB-PPS Sampling Sketch is a one-pass streaming algorithm that ensures 
Probability-Proportional-to-Size (PPS) sampling while strictly bounding the 
sample size to a target *k*. It maintains the exact PPS property, handles 
imbalanced data efficiently, and is computationally superior for large-scale 
applications compared to other reservoir-type sketches.
+
+Key features and benefits of EB-PPA include:
+
+* **Sample Size Control:** EB-PPS ensures the sample size never exceeds the 
target *k*, making it ideal for scenarios requiring strict storage or time 
constraints.
+* **Optimal Statistical Properties:** When possible, the sample size is 
exactly *k*, while in other cases, it maintains maximal expected value and 
minimal variance.
+* **Efficiency:** As a one-pass streaming algorithm, it is highly efficient, 
which is crucial for processing large, real-time, or streaming data.
+* **Application in Machine Learning:** This method is particularly useful for 
training classifiers on imbalanced datasets, where it helps in overcoming 
issues with biased results.
+* **Comparison with Other Methods:** In the above referenced paper, EB-PPS is 
compared to other methods such as Conditional Poisson Sampling (CPS) and Pareto 
PPS sampling, which also aim to have inclusion probabilities proportional to 
given size measures.
+
+EB-PPS is advantageous for its simplicity, efficiency, and ability to balance 
the need for exact PPS with the constraint of a fixed sample size.
+
+EB-PPS is a modern sampling scheme designed to resolve a fundamental conflict 
in statistical sampling: the inability to simultaneously guarantee exact 
**Probability Proportional to Size (PPS)** and a fixed sample size for all 
datasets. 
+
+### Key Concepts of EB-PPS
+Traditional PPS schemes often prioritize keeping a fixed sample size *k*, 
which sometimes forces them to violate the PPS property (where each item's 
selection probability must be exactly proportional to its weight). EB-PPS flips 
this trade-off: 
+
+* **Strict PPS Property:** It enforces the PPS property at all times, ensuring 
item-inclusion probabilities are exactly proportional to their weights.
+* **Bounded Sample Size:** It uses the target *k* as a strict upper bound 
rather than a fixed requirement.
+* **Optimality:** If the sample size cannot be exactly *k* while maintaining 
strict PPS, the algorithm produces a sample with the maximal possible expected 
size and minimal variance.
+
+### Computational Advantages
+
+EB-PPS is particularly suited for high-volume data environments and 2026+ era 
big data analytics: 
+
+* **One-Pass Streaming:** It processes items sequentially and does not need to 
know the total population size in advance.
+
+* **Efficiency:** It features an amortized processing time of *O(1)* **per 
item**, making it computationally superior to older state-of-the-art algorithms 
like VarOpt, which has *O(\log n)* complexity.
+
+* **Memory Management:** By bounding the sample size, it prevents storage 
overflows during real-time data ingestion.
+
+### Applications in Machine Learning
+
+A primary 2026 application for EB-PPS is training classifiers on imbalanced 
datasets: 
+
+* **Bayes-Optimality:** By maintaining exact PPS, it allows standard training 
techniques (designed for 0-1 loss) to produce Bayes-optimal classifiers even 
when the true loss function is imbalanced.
+* **Superior Performance:** Research indicates that smaller, well-curated 
EB-PPS samples can outperform larger samples from traditional schemes (like 
VarOpt) because they avoid the bias introduced by violating PPS.
diff --git a/src/main/resources/docgen/toc.json 
b/src/main/resources/docgen/toc.json
index 3e9a5e05..a11ba460 100644
--- a/src/main/resources/docgen/toc.json
+++ b/src/main/resources/docgen/toc.json
@@ -63,6 +63,7 @@
 
         { "class":"Dropdown", "desc" : "Sampling", "array":
           [
+            {"class":"Doc",  "desc" : "EB-PPS Sampling Sketches",              
    "dir" : "Sampling", "file": "EB-PPS_SamplingSketches" },
             {"class":"Doc",  "desc" : "Reservoir Sampling Sketches",           
    "dir" : "Sampling", "file": "ReservoirSamplingSketches" },
             {"class":"Doc",  "desc" : "VarOpt Sampling Sketches",              
    "dir" : "Sampling", "file": "VarOptSamplingSketches" },
           ]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to