Kontinuation opened a new pull request, #442:
URL: https://github.com/apache/sedona-db/pull/442
This implements part of #436 . Spatial partitioners is the core component of
the spatial partitioned spatial join. We need to collect samples of geospatial
objects to build the spatial partitioning grid.
The goal is that we collect enough number of samples to create a high
quality spatial partition even for small datasets, while not collecting too
many samples for large datasets to avoid running out of memory. The sampling
should be uniform, so that the collected samples could faithfully represent the
distribution of the entire dataset. The sampler should only go through the
sampled stream in one single pass, since evaluating the sampled stream multiple
times may trigger repeated computations of upstream physical operators.
The sampling algorithm we adopted is a combination of reservoir sampling and
Bernoulli sampling: it collects at least $N_\text{min}$ , at most
$N_\text{max}$ samples per partition, and make sure that the sampling rate
won’t go below $R$ before hitting $N_\text{max}$.
The algorithm maintains a set of sampled envelopes $S$, and will go through
4 stages as the number of rows seen $k$ proceeds:
- **Stage 1 - Filling the small reservoir**: When $k < N_\text{min}$, simply
add the envelope of the geometry to $S$
- **Stage 2 - Small reservoir sampling**: when $N_\text{min} \leq k <
\dfrac{N_\text{min}}{R}$, use [[reservoir
sampling](https://en.wikipedia.org/wiki/Reservoir_sampling)](https://en.wikipedia.org/wiki/Reservoir_sampling)
method to maintain a fixed number of samples ($N_\text{min}$) in $S$
- **Stage 3 - Bernoulli sampling**: when $k \geq \dfrac{N_\text{min}}{R}
\land ||S|| < N_\text{max}$, use Bernoulli sampling to determine if we accept
the next sample or not. $S$ starts to grow in this stage.
- **Stage 4 - Large reservoir sampling**: when $||S|| = N_\text{max}$, use
reservoir sampling method to maintain a fixed number of samples
($N_\text{max}$) in $S$
This algorithm guarantees that:
1. **Collect enough samples even for small partitions**: If number of rows
in a partition is not less than $N_\text{min}$, at least $N_\text{min}$ samples
will be collected. If number of rows in a partition is less than
$N_\text{min}$, all rows will be collected as samples.
2. **Won’t collect too many samples for large partitions**: $||S||$ will
never exceed $N_\text{max}$, no matter how large the partition is.
3. **Uniform sampling**: The samples are uniformly sampled even though the
algorithm is composed by 4 distinct stages. This is trivial to prove.
Here is a figure illustrating the 4 stages of the sampling algorithm, it
shows which stage is used to sample each portion of the row stream. We take
$N_\text{min} = 1000$, $N_\text{max} = 10000$, $R = 0.01$ as an example.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]