[
https://issues.apache.org/jira/browse/SEDONA-180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17622395#comment-17622395
]
Jia Yu commented on SEDONA-180:
-------------------------------
[~Kontinuation] This sounds fantastic! Looking forward to the patch!
> Make better use of catalyst analyzer/optimizer - expression properties and
> traits
> ---------------------------------------------------------------------------------
>
> Key: SEDONA-180
> URL: https://issues.apache.org/jira/browse/SEDONA-180
> Project: Apache Sedona
> Issue Type: Improvement
> Reporter: Kristin Cowalcijk
> Priority: Major
>
> There are several improvements we can make to catalyst expressions defined by
> Apache Sedona. This issue describes three trivial ones.
> h3. Implement {{foldable}} for constant folding optimization
> There're lots of range queries where the query window can be folded as a
> constant geometry, such as the following query:
> {code:sql}
> SELECT * FROM some_table WHERE ST_Within(geom, ST_GeomFromText('POLYGON
> (...)'))
> {code}
> The physical plan is:
> {code:java}
> == Physical Plan ==
> Project [_1#9 AS latitude#14, _2#10 AS longitude#15, st_point(_2#10, _1#9) AS
> geom#18]
> +- Filter st_within(st_point(_2#10, _1#9), st_geomfromtext(POLYGON ((0 0, 100
> 0, 100 100, 0 100, 0 0))))
> +- <table to be scanned>
> {code}
> The query window {{ST_GeomFromText('POLYGON (...)')}} can be folded as a
> constant, since the argument passed to {{ST_GeomFromText}} is a constant, and
> {{ST_GeomFromText}} is deterministic. We can implement the {{foldable}}
> method for such expressions to allow the constant folding optimization pass
> to fold them:
> {code:scala}
> override def foldable: Boolean = inputExpressions.forall(_.foldable)
> {code}
> Once we have {{foldable}} method properly implemented, the physical plan
> becomes:
> {code:java}
> == Physical Plan ==
> Project [_1#9 AS latitude#14, _2#10 AS longitude#15, st_point(_2#10, _1#9) AS
> geom#18]
> +- Filter st_within(st_point(_2#10, _1#9),
> [1,3,0,0,32,0,0,0,0,1,0,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,89,64,0,0,0,0,0,0,0,0,0,0,0,0,0,0,89,64,0,0,0,0,0,0,89,64,0,0,0,0,0,0,0,0,0,0,0,0,0,0,89,64,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0])
> +- <table to be scanned>
> {code}
> The second argument passed to st_within became a constant value, which is the
> internal representation of the {{GeometryUDT}} value.
> Folding constant query window expressions is an important step to implement
> spatial predicate push down, and it is also beneficial to query performance
> since it eliminates unnecessary repeated evaluations.
> h3. Declaration of Input Types
> Catalyst expression can mixin the trait {{ExpectsInputTypes}} and implement
> {{inputTypes}} method to make catalyst analyzer type check the inputs of your
> expressions. For example, the following query
> {code:sql}
> SELECT ST_Contains('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))',
> ST_GeomFromText('POINT (10.0 20.0)')
> {code}
> will raise an exception at runtime, since the first argument of
> {{ST_Contains}} should be a geometry:
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in
> stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0
> (TID 1) (kontinuation executor driver): java.lang.ClassCastException:
> org.apache.spark.unsafe.types.UTF8String cannot be cast to
> org.apache.spark.sql.catalyst.util.ArrayData
> at
> org.apache.spark.sql.sedona_sql.expressions.ST_Contains.eval(Predicates.scala:49)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
> Source)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
> Source)
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
> ...
> {code}
> We can implement the {{inputTypes}} method for {{ST_Contains}} to declare
> that the input of this expression should be two geometries:
> {code:scala}
> override def inputTypes: Seq[AbstractDataType] = Seq(GeometryUDT, GeometryUDT)
> {code}
> Now spark will emit a more informative error message when analyzing this
> query:
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve 'st_contains('POLYGON
> ((0 0, 1 0, 1 1, 0 1, 0 0))', st_geomfromtext('POINT (10.0 20.0)'))' due to
> data type mismatch: argument 1 requires geometry type, however, ''POLYGON ((0
> 0, 1 0, 1 1, 0 1, 0 0))'' is of string type.; line 1 pos 7;
> 'Project [unresolvedalias(
> **org.apache.spark.sql.sedona_sql.expressions.ST_Contains$** , None)]
> +- OneRowRelation
> at
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
> at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$7(CheckAnalysis.scala:212)
> ...
> {code}
> h3. Implicit Input Type Casting
> Sometimes we want the arguments passed to our expressions being implicitly
> casted to declared types. For example, {{ST_PolygonFromEnvelope}} takes 4
> arguments, which are the coordinates of the envelope. Currently it only
> accepts double or decimal values as arguments. If integers were passed to
> {{{}ST_PolygonFromEnvelope{}}}, it will raise a {{scala.MatchError}}
> exception because we have not handled such cases:
> {code:sql}
> SELECT ST_PolygonFromEnvelope(1 + 1, 2 + 2, 3 + 3, 4 + 4)
> {code}
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in
> stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0
> (TID 1) (kontinuation executor driver): scala.MatchError: 2 (of class
> java.lang.Integer)
> at
> org.apache.spark.sql.sedona_sql.expressions.ST_PolygonFromEnvelope.eval(Constructors.scala:363)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
> Source)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
> Source)
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
> {code}
> We can make {{ST_PolygonFromEnvelope}} mixin trait
> {{{}ImplicitCastInputTypes{}}}, which tells the analyzer to implicitly cast
> input values when possible. We can confirm that the input values were casted
> by inspecting the query plan:
> {code:java}
> == Physical Plan ==
> Project [st_polygonfromenvelope(2.0, 4.0, 6.0, 8.0) AS
> st_polygonfromenvelope((1 + 1), (2 + 2), (3 + 3), (4 + 4))#13]
> +- *(1) Scan OneRowRelation[]
> {code}
> Additionally, with {{foldable}} method properly implemented, the entire
> expression could be folded as a constant:
> {code:java}
> == Physical Plan ==
> *(1) Project
> [[1,3,0,0,32,0,0,0,0,1,0,0,0,5,0,0,0,0,0,0,0,0,0,0,64,0,0,0,0,0,0,16,64,0,0,0,0,0,0,0,64,0,0,0,0,0,0,32,64,0,0,0,0,0,0,24,64,0,0,0,0,0,0,32,64,0,0,0,0,0,0,24,64,0,0,0,0,0,0,16,64,0,0,0,0,0,0,0,64,0,0,0,0,0,0,16,64]
> AS st_polygonfromenvelope((1 + 1), (2 + 2), (3 + 3), (4 + 4))#13]
> +- *(1) Scan OneRowRelation[]
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)