Kristin Cowalcijk created SEDONA-180:
----------------------------------------

             Summary: Make better use of catalyst analyzer/optimizer - 
expression properties and traits
                 Key: SEDONA-180
                 URL: https://issues.apache.org/jira/browse/SEDONA-180
             Project: Apache Sedona
          Issue Type: Improvement
            Reporter: Kristin Cowalcijk


There are several improvements we can make to catalyst expressions defined by 
Apache Sedona. This issue describes three trivial ones.

h3. Implement {{foldable}} for constant folding optimization

There're lots of range queries where the query window can be folded as a 
constant geometry, such as the following query:

{code:sql}
SELECT * FROM some_table WHERE ST_Within(geom, ST_GeomFromText('POLYGON (...)'))
{code}

The physical plan is:
{code}
== Physical Plan ==
Project [_1#9 AS latitude#14, _2#10 AS longitude#15, st_point(_2#10, _1#9) AS 
geom#18]
+- Filter st_within(st_point(_2#10, _1#9), st_geomfromtext(POLYGON ((0 0, 100 
0, 100 100, 0 100, 0 0))))
   +- <table to be scanned>
{code}

The query window {{ST_GeomFromText('POLYGON (...)')}} can be folded as a 
constant, since the argument passed to {{ST_GeomFromText}} is a constant, and 
{{ST_GeomFromText}} is deterministic. We can implement the {{foldable}} method 
of such expressions to allow the constant folding optimizer to fold them:

{code:scala}
override def foldable: Boolean = inputExpressions.forall(_.foldable)
{code}

Once we have {{foldable}} method properly implemented, the physical plan 
becomes:

{code}
== Physical Plan ==
Project [_1#9 AS latitude#14, _2#10 AS longitude#15, st_point(_2#10, _1#9) AS 
geom#18]
+- Filter st_within(st_point(_2#10, _1#9), 
[1,3,0,0,32,0,0,0,0,1,0,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,89,64,0,0,0,0,0,0,0,0,0,0,0,0,0,0,89,64,0,0,0,0,0,0,89,64,0,0,0,0,0,0,0,0,0,0,0,0,0,0,89,64,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0])
   +- <table to be scanned>
{code}

The second argument passed to st_within became a constant value, which is the 
internal representation of the {{GeometryUDT}} value.

Folding constant query window expressions is an important step to implement 
spatial predicate push down, and it is also beneficial to query performance 
since it eliminates unnecessary repeated evaluations.

h3. Declaration of Input Types

Catalyst expression can mixin the trait {{ExpectsInputTypes}} and implement 
{{inputTypes}} method to make catalyst analyzer type check the inputs of your 
expressions. For example, the following query

{code:sql}
SELECT ST_Contains('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))', 
ST_GeomFromText('POINT (10.0 20.0)')
{code}

will raise an exception at runtime, since the first argument of {{ST_Contains}} 
should be a geometry:

{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 
1) (kontinuation executor driver): java.lang.ClassCastException: 
org.apache.spark.unsafe.types.UTF8String cannot be cast to 
org.apache.spark.sql.catalyst.util.ArrayData
        at 
org.apache.spark.sql.sedona_sql.expressions.ST_Contains.eval(Predicates.scala:49)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
        at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
        ...
{code}

We can implement the {{inputTypes}} method for {{ST_Contains}} to declare that 
the input of this expression should be two geometries:

{code:scala}
override def inputTypes: Seq[AbstractDataType] = Seq(GeometryUDT, GeometryUDT)
{code}

Now spark will emit a more informative error message when analyzing this query:

{code}
org.apache.spark.sql.AnalysisException: cannot resolve 'st_contains('POLYGON 
((0 0, 1 0, 1 1, 0 1, 0 0))', st_geomfromtext('POINT (10.0 20.0)'))' due to 
data type mismatch: argument 1 requires geometry type, however, ''POLYGON ((0 
0, 1 0, 1 1, 0 1, 0 0))'' is of string type.; line 1 pos 7;
'Project [unresolvedalias( 
**org.apache.spark.sql.sedona_sql.expressions.ST_Contains$**  , None)]
+- OneRowRelation
  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$7(CheckAnalysis.scala:212)
...
{code}


h3. Implicit Input Type Casting

Sometimes we want the arguments passed to our expressions being implicitly 
casted to declared types. For example, {{ST_PolygonFromEnvelope}} takes 4 
arguments, which are the coordinates of the envelope. Currently it only accepts 
double or decimal values as arguments. If integers were passed to 
{{ST_PolygonFromEnvelope}}, it will raise a {{scala.MatchError}} exception 
because we have not handled such cases:

{code:sql}
SELECT ST_PolygonFromEnvelope(1 + 1, 2 + 2, 3 + 3, 4 + 4)
{code}

{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 
1) (kontinuation executor driver): scala.MatchError: 2 (of class 
java.lang.Integer)
        at 
org.apache.spark.sql.sedona_sql.expressions.ST_PolygonFromEnvelope.eval(Constructors.scala:363)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
        at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
{code}

We can make {{ST_PolygonFromEnvelope}} mixin trait {{ImplicitCastInputTypes}}, 
which tells the analyzer to implicitly cast input values when possible. We can 
confirm that the input values were casted by inspecting the query plan:

{code}
== Physical Plan ==
Project [st_polygonfromenvelope(2.0, 4.0, 6.0, 8.0) AS 
st_polygonfromenvelope((1 + 1), (2 + 2), (3 + 3), (4 + 4))#13]
+- *(1) Scan OneRowRelation[]
{code}

Additionally, with {{foldable}} method properly implemented, the entire 
expression could be folded as a constant:

{code}
== Physical Plan ==
*(1) Project 
[[1,3,0,0,32,0,0,0,0,1,0,0,0,5,0,0,0,0,0,0,0,0,0,0,64,0,0,0,0,0,0,16,64,0,0,0,0,0,0,0,64,0,0,0,0,0,0,32,64,0,0,0,0,0,0,24,64,0,0,0,0,0,0,32,64,0,0,0,0,0,0,24,64,0,0,0,0,0,0,16,64,0,0,0,0,0,0,0,64,0,0,0,0,0,0,16,64]
 AS st_polygonfromenvelope((1 + 1), (2 + 2), (3 + 3), (4 + 4))#13]
+- *(1) Scan OneRowRelation[]
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to