[ 
https://issues.apache.org/jira/browse/SEDONA-180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17622231#comment-17622231
 ] 

Kristin Cowalcijk commented on SEDONA-180:
------------------------------------------

I'm working on a patch to apply these improvements to all the UDFs defined by 
Sedona. I'll submit a pull request once I'm done.

> Make better use of catalyst analyzer/optimizer - expression properties and 
> traits
> ---------------------------------------------------------------------------------
>
>                 Key: SEDONA-180
>                 URL: https://issues.apache.org/jira/browse/SEDONA-180
>             Project: Apache Sedona
>          Issue Type: Improvement
>            Reporter: Kristin Cowalcijk
>            Priority: Major
>
> There are several improvements we can make to catalyst expressions defined by 
> Apache Sedona. This issue describes three trivial ones.
> h3. Implement {{foldable}} for constant folding optimization
> There're lots of range queries where the query window can be folded as a 
> constant geometry, such as the following query:
> {code:sql}
> SELECT * FROM some_table WHERE ST_Within(geom, ST_GeomFromText('POLYGON 
> (...)'))
> {code}
> The physical plan is:
> {code:java}
> == Physical Plan ==
> Project [_1#9 AS latitude#14, _2#10 AS longitude#15, st_point(_2#10, _1#9) AS 
> geom#18]
> +- Filter st_within(st_point(_2#10, _1#9), st_geomfromtext(POLYGON ((0 0, 100 
> 0, 100 100, 0 100, 0 0))))
>    +- <table to be scanned>
> {code}
> The query window {{ST_GeomFromText('POLYGON (...)')}} can be folded as a 
> constant, since the argument passed to {{ST_GeomFromText}} is a constant, and 
> {{ST_GeomFromText}} is deterministic. We can implement the {{foldable}} 
> method for such expressions to allow the constant folding optimization pass 
> to fold them:
> {code:scala}
> override def foldable: Boolean = inputExpressions.forall(_.foldable)
> {code}
> Once we have {{foldable}} method properly implemented, the physical plan 
> becomes:
> {code:java}
> == Physical Plan ==
> Project [_1#9 AS latitude#14, _2#10 AS longitude#15, st_point(_2#10, _1#9) AS 
> geom#18]
> +- Filter st_within(st_point(_2#10, _1#9), 
> [1,3,0,0,32,0,0,0,0,1,0,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,89,64,0,0,0,0,0,0,0,0,0,0,0,0,0,0,89,64,0,0,0,0,0,0,89,64,0,0,0,0,0,0,0,0,0,0,0,0,0,0,89,64,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0])
>    +- <table to be scanned>
> {code}
> The second argument passed to st_within became a constant value, which is the 
> internal representation of the {{GeometryUDT}} value.
> Folding constant query window expressions is an important step to implement 
> spatial predicate push down, and it is also beneficial to query performance 
> since it eliminates unnecessary repeated evaluations.
> h3. Declaration of Input Types
> Catalyst expression can mixin the trait {{ExpectsInputTypes}} and implement 
> {{inputTypes}} method to make catalyst analyzer type check the inputs of your 
> expressions. For example, the following query
> {code:sql}
> SELECT ST_Contains('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))', 
> ST_GeomFromText('POINT (10.0 20.0)')
> {code}
> will raise an exception at runtime, since the first argument of 
> {{ST_Contains}} should be a geometry:
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 
> (TID 1) (kontinuation executor driver): java.lang.ClassCastException: 
> org.apache.spark.unsafe.types.UTF8String cannot be cast to 
> org.apache.spark.sql.catalyst.util.ArrayData
>       at 
> org.apache.spark.sql.sedona_sql.expressions.ST_Contains.eval(Predicates.scala:49)
>       at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>       at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>       at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
>       ...
> {code}
> We can implement the {{inputTypes}} method for {{ST_Contains}} to declare 
> that the input of this expression should be two geometries:
> {code:scala}
> override def inputTypes: Seq[AbstractDataType] = Seq(GeometryUDT, GeometryUDT)
> {code}
> Now spark will emit a more informative error message when analyzing this 
> query:
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve 'st_contains('POLYGON 
> ((0 0, 1 0, 1 1, 0 1, 0 0))', st_geomfromtext('POINT (10.0 20.0)'))' due to 
> data type mismatch: argument 1 requires geometry type, however, ''POLYGON ((0 
> 0, 1 0, 1 1, 0 1, 0 0))'' is of string type.; line 1 pos 7;
> 'Project [unresolvedalias( 
> **org.apache.spark.sql.sedona_sql.expressions.ST_Contains$**  , None)]
> +- OneRowRelation
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$7(CheckAnalysis.scala:212)
> ...
> {code}
> h3. Implicit Input Type Casting
> Sometimes we want the arguments passed to our expressions being implicitly 
> casted to declared types. For example, {{ST_PolygonFromEnvelope}} takes 4 
> arguments, which are the coordinates of the envelope. Currently it only 
> accepts double or decimal values as arguments. If integers were passed to 
> {{{}ST_PolygonFromEnvelope{}}}, it will raise a {{scala.MatchError}} 
> exception because we have not handled such cases:
> {code:sql}
> SELECT ST_PolygonFromEnvelope(1 + 1, 2 + 2, 3 + 3, 4 + 4)
> {code}
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 
> (TID 1) (kontinuation executor driver): scala.MatchError: 2 (of class 
> java.lang.Integer)
>       at 
> org.apache.spark.sql.sedona_sql.expressions.ST_PolygonFromEnvelope.eval(Constructors.scala:363)
>       at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>       at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>       at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
> {code}
> We can make {{ST_PolygonFromEnvelope}} mixin trait 
> {{{}ImplicitCastInputTypes{}}}, which tells the analyzer to implicitly cast 
> input values when possible. We can confirm that the input values were casted 
> by inspecting the query plan:
> {code:java}
> == Physical Plan ==
> Project [st_polygonfromenvelope(2.0, 4.0, 6.0, 8.0) AS 
> st_polygonfromenvelope((1 + 1), (2 + 2), (3 + 3), (4 + 4))#13]
> +- *(1) Scan OneRowRelation[]
> {code}
> Additionally, with {{foldable}} method properly implemented, the entire 
> expression could be folded as a constant:
> {code:java}
> == Physical Plan ==
> *(1) Project 
> [[1,3,0,0,32,0,0,0,0,1,0,0,0,5,0,0,0,0,0,0,0,0,0,0,64,0,0,0,0,0,0,16,64,0,0,0,0,0,0,0,64,0,0,0,0,0,0,32,64,0,0,0,0,0,0,24,64,0,0,0,0,0,0,32,64,0,0,0,0,0,0,24,64,0,0,0,0,0,0,16,64,0,0,0,0,0,0,0,64,0,0,0,0,0,0,16,64]
>  AS st_polygonfromenvelope((1 + 1), (2 + 2), (3 + 3), (4 + 4))#13]
> +- *(1) Scan OneRowRelation[]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to