Martin Andersson created SEDONA-221:
---------------------------------------
Summary: Outer join throws NPE for null geometries
Key: SEDONA-221
URL: https://issues.apache.org/jira/browse/SEDONA-221
Project: Apache Sedona
Issue Type: Bug
Reporter: Martin Andersson
The following query throws a NullPointerException.
{code}
select /*+ BROADCAST(t2) */ * from t1 left join t2 on st_intersects(t1.geom,
t2.geom)
{code}
{code}
java.lang.NullPointerException
at org.locationtech.jts.io.WKBReader.read(WKBReader.java:159)
at
org.apache.sedona.sql.utils.GeometrySerializer$.deserialize(GeometrySerializer.scala:50)
at
org.apache.spark.sql.sedona_sql.strategy.join.TraitJoinQueryBase.$anonfun$toSpatialRDD$1(TraitJoinQueryBase.scala:45)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source)
{code}
The failure happens when the streaming side is mapped to a SpatialRDD. The NPE
doesn't happen for inner join with null geometries. I suspect Spark is pushing
a not null predicate since rows with null geometries would be excluded in an
inner join anyway.
Looking at the code I suspect there are more errors in the new broadcast join
types. InternalRow is encoded in the user data field in the geometry. That
doesn't work if the geometry is null. For a left join the InternalRow on the
left side has to be emitted even if the geometry is null. Instead of using a
SpatialRDD it might be better to map the RDD[InternalRow] to a
RDD[Pair[Geometry, InternalRow]] where Geometry might be null.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)