Kontinuation commented on issue #854: URL: https://github.com/apache/sedona/issues/854#issuecomment-1585667724
I propose a different (more complicated) approach: 1. Make analyze() method of `SpatialRDD` take sample of the RDD. We can integrate the logic of poisson sampler into the `StatCalculator` and calculate the boundary, count, and samples in one pass. 2. When running the spatial join physical plan, we `analyze()` both sides. Now we know the boundary and count of both sides, then we can simply apply some heuristics to determine which is the partitioning side (for example, take the one with more records as the partitioning side). 3. Build a spatial partitioning grid using the samples we collected in `analyze()`. Since we also have samples of the other side, we can estimate how many geometries of both sides will fall into each grid. 4. When running the `DynamicIndexLookupJudgement`, we can determine which side to build and which side to stream on a per-grid basis. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
