Hi all, I'm working on a project which aims to improve performance of spatial join query in AsterixDB.
The problem can be described as follows: Assume that we have 2 dataset ParkSet(id, geom) and LakeSet(id, geom) with geom is a spatial data type (point, rectangle, polygon). This is a join query to list all intersected pairs of these two datasets: ``` *SELECT COUNT(*) FROM ParkSet AS ps, LakeSet AS lsWHERE spatial_intersect(ps.geom, ls.geom);* *```* I rewrote this query in a more complicated SQL++ query(*) but has a better performance. I took a look at FuzzyJoinRule <https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/FuzzyJoinRule.java> implementation <https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/FuzzyJoinRule.java> (a rule for AQL+) in AsterixDB and I'm trying to make a similar implementation called SpatalJoinRule <https://github.com/tinvukhac/asterixdb/blob/cartilage/asterixdb/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/SpatialJoinRule.java> (a rule for SQL++). Now I'm stuck with some questions and I would really appreciate if you can help me to answer them: 1. What is the syntax to make placeholders for input operators and variables in a SQL++ template? In the FuzzyJoinRule <https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/FuzzyJoinRule.java> implementation <https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/FuzzyJoinRule.java>, I saw that it is using ##LEFT_0 for left input operator and $$LEFT_0 for left input variable. Is it similar in SQL++? 2. Does the query template for substitution must be a (full) compilable query? Or it can be just a part of a full query? 3. Can the substitute query contain UDF functions? Thanks, Tin (*) Rewitten query: in short, it partitioned the datasets by a grid (red statement) then we compute the intersection pairs for each cell of the grid. - 'mytileids' UDF function returns the corresponding cell ID of a spatial object. - 'referencepointtileid' UDF function is used to ignore the duplications in the final result. - (-180.0,83.0,180.0,90.0) is space MBR. - (100,100) is grid size (rows x columns). ``` *SELECT COUNT(*) FROM (SELECT parksPartitioned.tile as tile1, lakesPartitioned.tile as tile2, test#referencepointtileid(parksPartitioned.geom, lakesPartitioned.geom,-180.0,83.0,180.0,90.0,100,100) as ref_tileFROM(SELECT tile, p.ps.id <http://p.ps.id/>, p.ps.geom FROM (SELECT test#mytileids(ps.geom,-180.0,-83.0,180.0,90.0,100,100) AS tileids, ps FROM ParkSet as ps) AS p UNNEST p.tileids AS tile) AS parksPartitioned,(SELECT tile, p.ps.id <http://p.ps.id/>, p.ps.geom FROM (SELECT test#mytileids(ps.geom,-180.0,-83.0,180.0,90.0,100,100) AS tileids, ps FROM LakeSet as ps) AS p UNNEST p.tileids AS tile) AS lakesPartitionedWHERE parksPartitioned.tile = lakesPartitioned.tileAND spatial_intersect(parksPartitioned.geom, lakesPartitioned.geom)) AS result WHERE result.tile1 = result.ref_tile;* *```*
