yitao-li opened a new pull request #525:
URL: https://github.com/apache/incubator-sedona/pull/525


   Signed-off-by: Yitao Li <[email protected]>
   
   ## Is this PR related to a proposed Issue?
   
   I think we should create an issue for it.
   
   The problem is currently if `Adapter.toDf()` is called with a `SpatialRDD` 
carrying non-spatial attributes (a.k.a "user data") in addition to spatial 
attributes, then the Spark dataframe returned from `toDf()` can only be 
evaluated successfully once. If it was  evaluated more than once, then a null 
pointer exception occurrs. If a Spark dataframe is not cached, then the 
underlying RDD could be evaluated more than once in many scenarios, so, the NPE 
problem could break quite a few use cases.
   
   ## What changes were proposed in this PR?
   
   Essentially, the changes in this PR ensures `toDf()` respects the 
immutability of `RDD`s in Spark. The abovementioned NPE issue happens because 
the `Geometry` objects stored in the input `RDD` have their `userData` set to 
`null` within a `RDD.map` transformation, and this causes unexpected `null` 
values (at least when Spark is running in "local" mode) when this 
transformation is evaluated the second time.
   
   So, the short-term fix (as shown in this PR) is to operate on a copy of the 
`Geometry` object, rather than modifying the `Geometry` object that is already 
stored in the `RDD`.
   
   Long-term fix: I think it would be a better practice to work with 
`RDD[ImmutableGeometry]`s where `ImmutableGeometry` is a sub-class of 
`Geometry` that does not permit modifications of its existing data fields, but 
permits copying its underling data to a mutable `Geometry` object. I'm unsure 
whether there may be noticeable perf impact with this type of change though and 
I hope to get a second opinion on this. But this feels like, in principle, the 
correct thing to do, because `RDD`s are by design supposed to be immutable in 
Spark, and correctness issue arises if the immutability contract is not 
followed.
   
   ## How was this patch tested?
   
   Unit test (in progress)
   
   ## Did this PR include necessary documentation updates?
   
   Not applicable -- there is no change to documentation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to