I don't think that's the issue. The join detection is the same for both broadcast and non-broadcast, so the same match statement needs to run either way. I created an issue for what I found from the stack trace (don't have a copy of the stack trace to share easily): https://issues.apache.org/jira/browse/SEDONA-56
Adam On Wed, Aug 4, 2021 at 9:02 PM Jia Yu <jiayu198...@gmail.com> wrote: > Hi Adam, > > I believe the issue is caused by this chunk of code: > https://github.com/apache/incubator-sedona/blob/master/sql/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/JoinQueryDetector.scala#L84-L109 > > If we move the broadcast join detection as the first part of the detector > and set other join detection to the "else". Will it fix the issue? > > if (broadcast XXX) > > else { > all other join detection. > } > > Thanks, > Jia > > On Tue, Aug 3, 2021 at 11:19 AM Adam Binford <adam...@gmail.com> wrote: > >> Okay I actually did encounter it today. It happens when you have AQE >> enabled. Looked into it a little bit and might have to rework the >> SpatialIndexExec node to extend BroadcastExchangeLike or maybe even >> directly BroadcastExchangeExec, but that might only be compatible with >> Spark 3+, so not sure what to do about that. I'm not sure if there's >> specific AQE rules or optimizations that can be disabled to get it to >> work, >> but if you just disable it completely it should work for now. I'm also not >> at all familiar with the inner workings of AQE to know what the right way >> to properly work with that is. >> >> Adam >> >> On Tue, Aug 3, 2021 at 7:46 AM Adam Binford <adam...@gmail.com> wrote: >> >> > I haven't encountered any issues with it but I can investigate with the >> > full stacktrace. Also which version of Spark is this with? >> > >> > Adam >> > >> > On Tue, Aug 3, 2021 at 4:25 AM Jia Yu <ji...@apache.org> wrote: >> > >> >> Hi Pietro, >> >> >> >> Can you please share the full stacktrace of this scala.MatchError? I >> tried >> >> a couple test cases but wasn't able to reproduce this error on my end. >> In >> >> fact, another user complained about the same issue a while back. I >> suspect >> >> there is a bug for this part. >> >> >> >> I also CCed the contributor of Sedona broadcast join. @ >> adam...@gmail.com >> >> <adam...@gmail.com> Hi Adam, do you have any idea about this issue? >> >> >> >> Thanks, >> >> Jia >> >> >> >> On Mon, Aug 2, 2021 at 12:43 AM pietro greselin <p.grese...@gmail.com> >> >> wrote: >> >> >> >> > Hello Jia, >> >> > >> >> > thank you so much for your support. >> >> > >> >> > We have been able to complete our task and to perform a few runs with >> >> > different number of partitions. >> >> > At the moment we obtained the best performance when running on 20 >> nodes >> >> > and setting the number of partitions to be 2000. With this >> >> configuration, >> >> > it took approximately 45 minutes to write the join's output. >> >> > >> >> > Then we tried to perform the same join through broadcast as you >> >> suggested >> >> > to see whether we could achieve better results but actually we >> obtained >> >> the >> >> > following error when calling an action like broadcast_join.show() on >> the >> >> > output >> >> > >> >> > Py4JJavaError: An error occurred while calling o699.showString. >> >> > : scala.MatchError: SpatialIndex polygonshape#422: geometry, >> QUADTREE, >> >> [id=#3312] >> >> > >> >> > We would be grateful if you can support us on this. >> >> > >> >> > >> >> > The broadcast join was performed as follows: broadcast_join = >> >> >> points_df.alias('points').join(f.broadcast(polygons_df).alias('polygons'), >> >> f.expr('ST_Contains(polygons.polygonshape, points.pointshape)')) >> >> > >> >> > Attached you can find the pseudo code we used to test broadcast join. >> >> > >> >> > >> >> > Looking forward to hearing from you. >> >> > >> >> > >> >> > Kind regards, >> >> > >> >> > Pietro Greselin >> >> > >> >> > >> >> > On Wed, 28 Jul 2021 at 00:46, Jia Yu <ji...@apache.org> wrote: >> >> > >> >> >> Hi Pietro, >> >> >> >> >> >> A few tips to optimize your join: >> >> >> >> >> >> 1. Mix DF and RDD together and use RDD API for the join part. See >> the >> >> >> example here: >> >> >> >> >> >> https://github.com/apache/incubator-sedona/blob/master/binder/ApacheSedonaSQL_SpatialJoin_AirportsPerCountry.ipynb >> >> >> >> >> >> 2. When use point_rdd.spatialPartitioning(GridType.KDBTREE, 4), try >> to >> >> >> use a large number of partitions (say 1000 or more) >> >> >> >> >> >> If this approach doesn't work, consider broadcast join if needed. >> >> >> Broadcast the polygon side: >> >> >> https://sedona.apache.org/api/sql/Optimizer/#broadcast-join >> >> >> >> >> >> Thanks, >> >> >> Jia >> >> >> >> >> >> >> >> >> On Tue, Jul 27, 2021 at 2:21 AM pietro greselin < >> p.grese...@gmail.com> >> >> >> wrote: >> >> >> >> >> >>> To whom it may concern, >> >> >>> >> >> >>> we reported the following Sedona behaviour and would like to ask >> your >> >> >>> opinion on how we can otpimize it. >> >> >>> >> >> >>> Our aim is to perform a inner spatial join between a points_df and >> a >> >> >>> polygon_df when a point in points_df is contained in a polygon from >> >> >>> polygons_df. >> >> >>> Below you can find more details about the 2 dataframes we are >> >> >>> considering: >> >> >>> - points_df: it contains 50mln events with latitude and longitude >> >> >>> approximated to the third decimal digit; >> >> >>> - polygon_df: it contains 10k multi-polygons having 600 vertexes on >> >> >>> average. >> >> >>> >> >> >>> The issue we are reporting is a very long computing time and the >> >> spatial >> >> >>> join query never completing even when running on cluster with 40 >> >> workers >> >> >>> with 4 cores each. >> >> >>> No error is being print by driver but we are receiving the >> following >> >> >>> warning: >> >> >>> WARN org.apache.sedona.core.spatialOperator.JoinQuery: UseIndex is >> >> true, >> >> >>> but no index exists. Will build index on the fly. >> >> >>> >> >> >>> Actually we were able to run successfully the same spatial join >> when >> >> >>> only considering a very small sample of events. >> >> >>> Do you have any suggestion on how we can archive the same result on >> >> >>> higher volumes of data or if there is a way we can optimize the >> join? >> >> >>> >> >> >>> Attached you can find the pseudo-code we are running. >> >> >>> >> >> >>> Looking forward to hearing from you. >> >> >>> >> >> >>> Kind regards, >> >> >>> Pietro Greselin >> >> >>> >> >> >> >> >> >> > >> > >> > -- >> > Adam Binford >> > >> >> >> -- >> Adam Binford >> > -- Adam Binford