[ 
https://issues.apache.org/jira/browse/SEDONA-231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655634#comment-17655634
 ] 

Adam Binford commented on SEDONA-231:
-------------------------------------

This is an interesting idea! Only comment I might add is instead of using a 
method that sets a flag, maybe just create an "evalDeserialized" method on the 
SerdeAware trait that gets called instead of "eval" if the expression matches 
the trait

> Redundant Serde Removal
> -----------------------
>
>                 Key: SEDONA-231
>                 URL: https://issues.apache.org/jira/browse/SEDONA-231
>             Project: Apache Sedona
>          Issue Type: Improvement
>            Reporter: Doug Dennis
>            Priority: Major
>
> Currently, Geometry objects are deserialized and reserialized during every 
> evaluation of a function on a row in Spark. This amounts to a great deal of 
> redundant serde during query execution. At times, objects are serialized just 
> to be immediately deserialized.
> To demonstrate this in action, I placed print statements in the 
> GeometrySerializer serialize and deserialize methods, the GeometryUDT 
> serialize and deserialize methods, and in the eval methods of several 
> functions. When the following is executed:
>  
> {noformat}
> val columns = Seq("input", "blade")
> val data = Seq(
>   ("GEOMETRYCOLLECTION ( LINESTRING (0 0, 1.5 1.5, 2 2), LINESTRING (3 3, 4.5 
> 4.5, 5 5))", "MULTIPOINT (0.5 0.5, 1 1, 3.5 3.5, 4 4)")
> )
> var df = spark.createDataFrame(data).toDF(columns:_*)     
> println(
>   df.selectExpr("ST_Normalize(ST_Split(ST_GeomFromWKT(input), 
> ST_GeomFromWKT(blade))) AS 
> result").collect()(0).get(0).asInstanceOf[Geometry].toText()
> ){noformat}
> I get the following output:
> {noformat}
>  **org.apache.spark.sql.sedona_sql.expressions.ST_Normalize**
>  **org.apache.spark.sql.sedona_sql.expressions.ST_Split**
>  **org.apache.spark.sql.sedona_sql.expressions.ST_GeomFromWKT**
> Inside GeometrySerializer.serialize
> Inside GeometrySerializer.serialize
> Inside GeometrySerializer.serialize
> Inside GeometrySerializer.deserialize
>  **org.apache.spark.sql.sedona_sql.expressions.ST_GeomFromWKT**
> Inside GeometrySerializer.serialize
> Inside GeometrySerializer.deserialize
> Inside GeometrySerializer.serialize
> Inside GeometrySerializer.deserialize
> Inside GeometrySerializer.serialize
> Inside UDT deserialize.
> Inside GeometrySerializer.deserialize
> MULTILINESTRING ((0 0, 0.5 0.5), (0.5 0.5, 1 1), (1 1, 1.5 1.5, 2 2), (3 3, 
> 3.5 3.5), (3.5 3.5, 4 4), (4 4, 4.5 4.5, 5 5)){noformat}
> To explain what is happening:
>  # ST_Normalize.eval is called.
>  # ST_Normalize.eval calls ST_Split.eval.
>  # ST_Split.eval first calls the ST_GeomFromWKT that had the 
> GEOMETRYCOLLECTION wkt.
>  # ST_GeomFromWKT processes the wkt string and generates a Geometry object.
>  # The Geometry object is passed to GeometrySerializer.serialize. This is the 
> first call to serialize.
>  # This object is a GeometryCollection and the GeometrySerializer uses 
> recursion to handle them so you see two more additional calls to serialize.
>  # The GeometryCollection is then immediately deserialized and returned to 
> ST_Split.
>  # The second ST_GeomFromWKT is called (this one has a MULTIPOINT wkt).
>  # ST_GeomFromWKT processes the WKT and then serializes the geometry.
>  # That geometry is immediately deserialized and returned to ST_Split.
>  # ST_Split performs its operation and then serializes the geometry.
>  # That geometry is then immediately deserialized and returned to 
> ST_Normalize.
>  # ST_Normalize normalizes the geometry object and then serializes it for 
> good.
>  # Then the GeometryUDT.deserialize is called to handle the collect call 
> which of course calls GeometrySerializer.deserialize.
> There are multiple instances here where geometry objects are serialized and 
> then immediately deserialized to be further operated on. That is obviously 
> pretty wasteful.
>  
> I propose eliminating this redundancy through the following steps.
>  * Create a trait called SerdeAware which has a single method called 
> doNotSerializeOutput.
>  * This trait is then added to the InferredUnaryExpression and 
> InferredBinaryExpression abstract classes.
>  * When the doNotSerializeOutput is called on one of the expression classes, 
> a serializeOutput flag is set to false.
>  * That flag is read in the class's eval method.
>  * If the flag is false then the output will not be serialized and if the 
> flag is true then the output does get serialized.
>  * Finally, the buildExtractor method of the InferredTypes object is modified 
> to detect if the input expression is SerdeAware and if it is then the 
> doNotSerializeOutput method is called before calling the input expression's 
> eval method.
> In the test implementation I created I was able to get the following output:
> {noformat}
>  **org.apache.spark.sql.sedona_sql.expressions.ST_Normalize**
>  **org.apache.spark.sql.sedona_sql.expressions.ST_Split**
>  **org.apache.spark.sql.sedona_sql.expressions.ST_GeomFromWKT**
>  **org.apache.spark.sql.sedona_sql.expressions.ST_GeomFromWKT**
> Inside GeometrySerializer.serialize
> Inside UDT deserialize.
> Inside GeometrySerializer.deserialize
> MULTILINESTRING ((0 0, 0.5 0.5), (0.5 0.5, 1 1), (1 1, 1.5 1.5, 2 2), (3 3, 
> 3.5 3.5), (3.5 3.5, 4 4), (4 4, 4.5 4.5, 5 5)){noformat}
> You can see that only a single serialization was called and only at the very 
> end of the computation.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to