I haven't encountered any issues with it but I can investigate with the full stacktrace. Also which version of Spark is this with?
Adam On Tue, Aug 3, 2021 at 4:25 AM Jia Yu <ji...@apache.org> wrote: > Hi Pietro, > > Can you please share the full stacktrace of this scala.MatchError? I tried > a couple test cases but wasn't able to reproduce this error on my end. In > fact, another user complained about the same issue a while back. I suspect > there is a bug for this part. > > I also CCed the contributor of Sedona broadcast join. @adam...@gmail.com > <adam...@gmail.com> Hi Adam, do you have any idea about this issue? > > Thanks, > Jia > > On Mon, Aug 2, 2021 at 12:43 AM pietro greselin <p.grese...@gmail.com> > wrote: > > > Hello Jia, > > > > thank you so much for your support. > > > > We have been able to complete our task and to perform a few runs with > > different number of partitions. > > At the moment we obtained the best performance when running on 20 nodes > > and setting the number of partitions to be 2000. With this configuration, > > it took approximately 45 minutes to write the join's output. > > > > Then we tried to perform the same join through broadcast as you suggested > > to see whether we could achieve better results but actually we obtained > the > > following error when calling an action like broadcast_join.show() on the > > output > > > > Py4JJavaError: An error occurred while calling o699.showString. > > : scala.MatchError: SpatialIndex polygonshape#422: geometry, QUADTREE, > [id=#3312] > > > > We would be grateful if you can support us on this. > > > > > > The broadcast join was performed as follows: broadcast_join = > points_df.alias('points').join(f.broadcast(polygons_df).alias('polygons'), > f.expr('ST_Contains(polygons.polygonshape, points.pointshape)')) > > > > Attached you can find the pseudo code we used to test broadcast join. > > > > > > Looking forward to hearing from you. > > > > > > Kind regards, > > > > Pietro Greselin > > > > > > On Wed, 28 Jul 2021 at 00:46, Jia Yu <ji...@apache.org> wrote: > > > >> Hi Pietro, > >> > >> A few tips to optimize your join: > >> > >> 1. Mix DF and RDD together and use RDD API for the join part. See the > >> example here: > >> > https://github.com/apache/incubator-sedona/blob/master/binder/ApacheSedonaSQL_SpatialJoin_AirportsPerCountry.ipynb > >> > >> 2. When use point_rdd.spatialPartitioning(GridType.KDBTREE, 4), try to > >> use a large number of partitions (say 1000 or more) > >> > >> If this approach doesn't work, consider broadcast join if needed. > >> Broadcast the polygon side: > >> https://sedona.apache.org/api/sql/Optimizer/#broadcast-join > >> > >> Thanks, > >> Jia > >> > >> > >> On Tue, Jul 27, 2021 at 2:21 AM pietro greselin <p.grese...@gmail.com> > >> wrote: > >> > >>> To whom it may concern, > >>> > >>> we reported the following Sedona behaviour and would like to ask your > >>> opinion on how we can otpimize it. > >>> > >>> Our aim is to perform a inner spatial join between a points_df and a > >>> polygon_df when a point in points_df is contained in a polygon from > >>> polygons_df. > >>> Below you can find more details about the 2 dataframes we are > >>> considering: > >>> - points_df: it contains 50mln events with latitude and longitude > >>> approximated to the third decimal digit; > >>> - polygon_df: it contains 10k multi-polygons having 600 vertexes on > >>> average. > >>> > >>> The issue we are reporting is a very long computing time and the > spatial > >>> join query never completing even when running on cluster with 40 > workers > >>> with 4 cores each. > >>> No error is being print by driver but we are receiving the following > >>> warning: > >>> WARN org.apache.sedona.core.spatialOperator.JoinQuery: UseIndex is > true, > >>> but no index exists. Will build index on the fly. > >>> > >>> Actually we were able to run successfully the same spatial join when > >>> only considering a very small sample of events. > >>> Do you have any suggestion on how we can archive the same result on > >>> higher volumes of data or if there is a way we can optimize the join? > >>> > >>> Attached you can find the pseudo-code we are running. > >>> > >>> Looking forward to hearing from you. > >>> > >>> Kind regards, > >>> Pietro Greselin > >>> > >> > -- Adam Binford