Kristin Cowalcijk created SEDONA-180:
----------------------------------------
Summary: Make better use of catalyst analyzer/optimizer -
expression properties and traits
Key: SEDONA-180
URL: https://issues.apache.org/jira/browse/SEDONA-180
Project: Apache Sedona
Issue Type: Improvement
Reporter: Kristin Cowalcijk
There are several improvements we can make to catalyst expressions defined by
Apache Sedona. This issue describes three trivial ones.
h3. Implement {{foldable}} for constant folding optimization
There're lots of range queries where the query window can be folded as a
constant geometry, such as the following query:
{code:sql}
SELECT * FROM some_table WHERE ST_Within(geom, ST_GeomFromText('POLYGON (...)'))
{code}
The physical plan is:
{code}
== Physical Plan ==
Project [_1#9 AS latitude#14, _2#10 AS longitude#15, st_point(_2#10, _1#9) AS
geom#18]
+- Filter st_within(st_point(_2#10, _1#9), st_geomfromtext(POLYGON ((0 0, 100
0, 100 100, 0 100, 0 0))))
+- <table to be scanned>
{code}
The query window {{ST_GeomFromText('POLYGON (...)')}} can be folded as a
constant, since the argument passed to {{ST_GeomFromText}} is a constant, and
{{ST_GeomFromText}} is deterministic. We can implement the {{foldable}} method
of such expressions to allow the constant folding optimizer to fold them:
{code:scala}
override def foldable: Boolean = inputExpressions.forall(_.foldable)
{code}
Once we have {{foldable}} method properly implemented, the physical plan
becomes:
{code}
== Physical Plan ==
Project [_1#9 AS latitude#14, _2#10 AS longitude#15, st_point(_2#10, _1#9) AS
geom#18]
+- Filter st_within(st_point(_2#10, _1#9),
[1,3,0,0,32,0,0,0,0,1,0,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,89,64,0,0,0,0,0,0,0,0,0,0,0,0,0,0,89,64,0,0,0,0,0,0,89,64,0,0,0,0,0,0,0,0,0,0,0,0,0,0,89,64,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0])
+- <table to be scanned>
{code}
The second argument passed to st_within became a constant value, which is the
internal representation of the {{GeometryUDT}} value.
Folding constant query window expressions is an important step to implement
spatial predicate push down, and it is also beneficial to query performance
since it eliminates unnecessary repeated evaluations.
h3. Declaration of Input Types
Catalyst expression can mixin the trait {{ExpectsInputTypes}} and implement
{{inputTypes}} method to make catalyst analyzer type check the inputs of your
expressions. For example, the following query
{code:sql}
SELECT ST_Contains('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))',
ST_GeomFromText('POINT (10.0 20.0)')
{code}
will raise an exception at runtime, since the first argument of {{ST_Contains}}
should be a geometry:
{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in
stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID
1) (kontinuation executor driver): java.lang.ClassCastException:
org.apache.spark.unsafe.types.UTF8String cannot be cast to
org.apache.spark.sql.catalyst.util.ArrayData
at
org.apache.spark.sql.sedona_sql.expressions.ST_Contains.eval(Predicates.scala:49)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
Source)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
...
{code}
We can implement the {{inputTypes}} method for {{ST_Contains}} to declare that
the input of this expression should be two geometries:
{code:scala}
override def inputTypes: Seq[AbstractDataType] = Seq(GeometryUDT, GeometryUDT)
{code}
Now spark will emit a more informative error message when analyzing this query:
{code}
org.apache.spark.sql.AnalysisException: cannot resolve 'st_contains('POLYGON
((0 0, 1 0, 1 1, 0 1, 0 0))', st_geomfromtext('POINT (10.0 20.0)'))' due to
data type mismatch: argument 1 requires geometry type, however, ''POLYGON ((0
0, 1 0, 1 1, 0 1, 0 0))'' is of string type.; line 1 pos 7;
'Project [unresolvedalias(
**org.apache.spark.sql.sedona_sql.expressions.ST_Contains$** , None)]
+- OneRowRelation
at
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$7(CheckAnalysis.scala:212)
...
{code}
h3. Implicit Input Type Casting
Sometimes we want the arguments passed to our expressions being implicitly
casted to declared types. For example, {{ST_PolygonFromEnvelope}} takes 4
arguments, which are the coordinates of the envelope. Currently it only accepts
double or decimal values as arguments. If integers were passed to
{{ST_PolygonFromEnvelope}}, it will raise a {{scala.MatchError}} exception
because we have not handled such cases:
{code:sql}
SELECT ST_PolygonFromEnvelope(1 + 1, 2 + 2, 3 + 3, 4 + 4)
{code}
{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in
stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID
1) (kontinuation executor driver): scala.MatchError: 2 (of class
java.lang.Integer)
at
org.apache.spark.sql.sedona_sql.expressions.ST_PolygonFromEnvelope.eval(Constructors.scala:363)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
Source)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
{code}
We can make {{ST_PolygonFromEnvelope}} mixin trait {{ImplicitCastInputTypes}},
which tells the analyzer to implicitly cast input values when possible. We can
confirm that the input values were casted by inspecting the query plan:
{code}
== Physical Plan ==
Project [st_polygonfromenvelope(2.0, 4.0, 6.0, 8.0) AS
st_polygonfromenvelope((1 + 1), (2 + 2), (3 + 3), (4 + 4))#13]
+- *(1) Scan OneRowRelation[]
{code}
Additionally, with {{foldable}} method properly implemented, the entire
expression could be folded as a constant:
{code}
== Physical Plan ==
*(1) Project
[[1,3,0,0,32,0,0,0,0,1,0,0,0,5,0,0,0,0,0,0,0,0,0,0,64,0,0,0,0,0,0,16,64,0,0,0,0,0,0,0,64,0,0,0,0,0,0,32,64,0,0,0,0,0,0,24,64,0,0,0,0,0,0,32,64,0,0,0,0,0,0,24,64,0,0,0,0,0,0,16,64,0,0,0,0,0,0,0,64,0,0,0,0,0,0,16,64]
AS st_polygonfromenvelope((1 + 1), (2 + 2), (3 + 3), (4 + 4))#13]
+- *(1) Scan OneRowRelation[]
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)