douglasdennis opened a new pull request, #792: URL: https://github.com/apache/sedona/pull/792
## Did you read the Contributor Guide? - Yes, I have read [Contributor Rules](https://sedona.apache.org/latest-snapshot/community/rule/) and [Contributor Development Guide](https://sedona.apache.org/latest-snapshot/community/develop/) ## Is this PR related to a JIRA ticket? - Yes, the URL of the associated JIRA ticket is https://issues.apache.org/jira/browse/SEDONA-231. ## What changes were proposed in this PR? To avoid redundant serde during Spark execution, I propose the following changes: - Create a `SerdeAware` trait that has a single abstract method called `evalWithoutSerialization`. This method acts the normal `eval` methods except that it will not attempt to serialize its results into a Spark internal type. - Apply this trait to the null safe expression classes as well as whichever other function classes that it will work with. Implement the `evalWithoutSerialization` for those classes such that they do not attempt to serialize any Geometry object they return (and returning the actual Geometry object instead). - When a sedona function goes to extract the values from its input, it first checks to see if the input expression is `SerdeAware`. If the input expression is SerdeAware then the `evalWithoutSerialization` method is called and an actual Geometry object is returned, skipping the serialization and deserialization that would normally happen. To facilitate this, some changes had to be made to a few functions. Specifically, any function that could return LinearRing objects had to be changed such that they guarantee not doing that. This allows for compliance with OGC to be maintained. I have three points of concern, which is why I marked this PR as draft: 1. I'm not sure if my approach is the most succinct or performant. If there are better ways to eliminate the redundant serde then I am open to comment. I can either implement the suggestions myself or I can close this PR and would be happy to see another implementation come out. 2. I don't know how to test this. At one point I had print statements in the serializer (which you can see in the issue) that helped show that I had in fact eliminated the redundant serde, but I'm not sure how to write a test to automate that. Is there a way to mock the geometry serializer? Or maybe add debug logging that runs for the test but is disabled in some kind of release mode? I'm not sure what the convention is here. 3. By enforcing LineStrings for certain functions, I have subtly changed a public API in flink. I'm not sure if that needs to be documented and, if so, where. Spark should not be affected since the serializer already does not allow for LinearRings. ## How was this patch tested? Unit tests pass. A few in flink had to be updated since certain functions no longer return LinearRing objects. As for ensuring the redundant serde is eliminated, please see point #2 above. ## Did this PR include necessary documentation updates? - No, see above. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
