Hi all,
I'm working on a project which aims to improve performance of spatial join
query in AsterixDB.
The problem can be described as follows:
Assume that we have 2 dataset ParkSet(id, geom) and LakeSet(id, geom) with
geom is a spatial data type (point, rectangle, polygon). This is a join
query to list all intersected pairs of these two datasets:
```
*SELECT COUNT(*) FROM ParkSet AS ps, LakeSet AS lsWHERE
spatial_intersect(ps.geom, ls.geom);*
*```*
I rewrote this query in a more complicated SQL++ query(*) but has a better
performance.
I took a look at FuzzyJoinRule
<
https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/FuzzyJoinRule.java
implementation
<
https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/FuzzyJoinRule.java
(a
rule for AQL+) in AsterixDB and I'm trying to make a similar implementation
called SpatalJoinRule
<
https://github.com/tinvukhac/asterixdb/blob/cartilage/asterixdb/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/SpatialJoinRule.java
(a
rule for SQL++). Now I'm stuck with some questions and I would
really appreciate if you can help me to answer them:
1. What is the syntax to make placeholders for input operators and
variables in a SQL++ template? In the FuzzyJoinRule
<
https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/FuzzyJoinRule.java
implementation
<
https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/FuzzyJoinRule.java
,
I saw that it is using ##LEFT_0 for left input operator and $$LEFT_0 for
left input variable. Is it similar in SQL++?
2. Does the query template for substitution must be a (full) compilable
query? Or it can be just a part of a full query?
3. Can the substitute query contain UDF functions?
Thanks,
Tin
(*) Rewitten query: in short, it partitioned the datasets by a grid (red
statement) then we compute the intersection pairs for each cell of the
grid.
- 'mytileids' UDF function returns the corresponding cell ID of a spatial
object.
- 'referencepointtileid' UDF function is used to ignore the duplications in
the final result.
- (-180.0,83.0,180.0,90.0) is space MBR.
- (100,100) is grid size (rows x columns).
```
*SELECT COUNT(*) FROM (SELECT parksPartitioned.tile as tile1,
lakesPartitioned.tile as tile2,
test#referencepointtileid(parksPartitioned.geom,
lakesPartitioned.geom,-180.0,83.0,180.0,90.0,100,100) as
ref_tileFROM(SELECT tile, p.ps.id <http://p.ps.id/>, p.ps.geom FROM
(SELECT
test#mytileids(ps.geom,-180.0,-83.0,180.0,90.0,100,100) AS tileids, ps FROM
ParkSet as ps) AS p UNNEST p.tileids AS tile) AS parksPartitioned,(SELECT
tile, p.ps.id <http://p.ps.id/>, p.ps.geom FROM (SELECT
test#mytileids(ps.geom,-180.0,-83.0,180.0,90.0,100,100) AS tileids, ps FROM
LakeSet as ps) AS p UNNEST p.tileids AS tile) AS lakesPartitionedWHERE
parksPartitioned.tile = lakesPartitioned.tileAND
spatial_intersect(parksPartitioned.geom, lakesPartitioned.geom)) AS result
WHERE result.tile1 = result.ref_tile;*
*```*