yitao-li opened a new pull request #525: URL: https://github.com/apache/incubator-sedona/pull/525
Signed-off-by: Yitao Li <[email protected]> ## Is this PR related to a proposed Issue? I think we should create an issue for it. The problem is currently if `Adapter.toDf()` is called with a `SpatialRDD` carrying non-spatial attributes (a.k.a "user data") in addition to spatial attributes, then the Spark dataframe returned from `toDf()` can only be evaluated successfully once. If it was evaluated more than once, then a null pointer exception occurrs. If a Spark dataframe is not cached, then the underlying RDD could be evaluated more than once in many scenarios, so, the NPE problem could break quite a few use cases. ## What changes were proposed in this PR? Essentially, the changes in this PR ensures `toDf()` respects the immutability of `RDD`s in Spark. The abovementioned NPE issue happens because the `Geometry` objects stored in the input `RDD` have their `userData` set to `null` within a `RDD.map` transformation, and this causes unexpected `null` values (at least when Spark is running in "local" mode) when this transformation is evaluated the second time. So, the short-term fix (as shown in this PR) is to operate on a copy of the `Geometry` object, rather than modifying the `Geometry` object that is already stored in the `RDD`. Long-term fix: I think it would be a better practice to work with `RDD[ImmutableGeometry]`s where `ImmutableGeometry` is a sub-class of `Geometry` that does not permit modifications of its existing data fields, but permits copying its underling data to a mutable `Geometry` object. I'm unsure whether there may be noticeable perf impact with this type of change though and I hope to get a second opinion on this. But this feels like, in principle, the correct thing to do, because `RDD`s are by design supposed to be immutable in Spark, and correctness issue arises if the immutability contract is not followed. ## How was this patch tested? Unit test (in progress) ## Did this PR include necessary documentation updates? Not applicable -- there is no change to documentation -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
