Kontinuation opened a new pull request, #442:
URL: https://github.com/apache/sedona-db/pull/442

   This implements part of #436 . Spatial partitioners is the core component of 
the spatial partitioned spatial join. We need to collect samples of geospatial 
objects to build the spatial partitioning grid.
   
   The goal is that we collect enough number of samples to create a high 
quality spatial partition even for small datasets, while not collecting too 
many samples for large datasets to avoid running out of memory. The sampling 
should be uniform, so that the collected samples could faithfully represent the 
distribution of the entire dataset. The sampler should only go through the 
sampled stream in one single pass, since evaluating the sampled stream multiple 
times may trigger repeated computations of upstream physical operators.
   
   The sampling algorithm we adopted is a combination of reservoir sampling and 
Bernoulli sampling: it collects at least $N_\text{min}$ , at most 
$N_\text{max}$ samples per partition, and make sure that the sampling rate 
won’t go below $R$ before hitting $N_\text{max}$.
   
   The algorithm maintains a set of sampled envelopes $S$, and will go through 
4 stages as the number of rows seen $k$ proceeds:
   
   - **Stage 1 - Filling the small reservoir**: When $k < N_\text{min}$, simply 
add the envelope of the geometry to $S$
   - **Stage 2 - Small reservoir sampling**: when $N_\text{min} \leq k < 
\dfrac{N_\text{min}}{R}$, use [[reservoir 
sampling](https://en.wikipedia.org/wiki/Reservoir_sampling)](https://en.wikipedia.org/wiki/Reservoir_sampling)
 method to maintain a fixed number of samples ($N_\text{min}$) in $S$
   - **Stage 3 - Bernoulli sampling**: when $k \geq \dfrac{N_\text{min}}{R} 
\land ||S|| < N_\text{max}$, use Bernoulli sampling to determine if we accept 
the next sample or not. $S$ starts to grow in this stage.
   - **Stage 4 - Large reservoir sampling**: when $||S|| = N_\text{max}$, use 
reservoir sampling method to maintain a fixed number of samples 
($N_\text{max}$) in $S$
   
   This algorithm guarantees that:
   
   1. **Collect enough samples even for small partitions**: If number of rows 
in a partition is not less than $N_\text{min}$, at least $N_\text{min}$ samples 
will be collected. If number of rows in a partition is less than 
$N_\text{min}$, all rows will be collected as samples.
   2. **Won’t collect too many samples for large partitions**: $||S||$ will 
never exceed $N_\text{max}$, no matter how large the partition is.
   3. **Uniform sampling**: The samples are uniformly sampled even though the 
algorithm is composed by 4 distinct stages. This is trivial to prove.
   
   Here is a figure illustrating the 4 stages of the sampling algorithm, it 
shows which stage is used to sample each portion of the row stream. We take 
$N_\text{min} = 1000$, $N_\text{max} = 10000$, $R = 0.01$ as an example.
   
   ![SamplingStages 
drawio](https://github.com/user-attachments/assets/4bafa2d8-6dcb-41fa-b650-41f3ffa98cd7)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to