Re: Spatial join performances

Adam Binford Tue, 03 Aug 2021 04:46:44 -0700

I haven't encountered any issues with it but I can investigate with the
full stacktrace. Also which version of Spark is this with?


Adam

On Tue, Aug 3, 2021 at 4:25 AM Jia Yu <ji...@apache.org> wrote:

> Hi Pietro,
>
> Can you please share the full stacktrace of this scala.MatchError? I tried
> a couple test cases but wasn't able to reproduce this error on my end. In
> fact, another user complained about the same issue a while back. I suspect
> there is a bug for this part.
>
> I also CCed the contributor of Sedona broadcast join. @adam...@gmail.com
> <adam...@gmail.com> Hi Adam, do you have any idea about this issue?
>
> Thanks,
> Jia
>
> On Mon, Aug 2, 2021 at 12:43 AM pietro greselin <p.grese...@gmail.com>
> wrote:
>
> > Hello Jia,
> >
> > thank you so much for your support.
> >
> > We have been able to complete our task and to perform a few runs with
> > different number of partitions.
> > At the moment we obtained the best performance when running on 20 nodes
> > and setting the number of partitions to be 2000. With this configuration,
> > it took approximately 45 minutes to write the join's output.
> >
> > Then we tried to perform the same join through broadcast as you suggested
> > to see whether we could achieve better results but actually we obtained
> the
> > following error when calling an action like broadcast_join.show() on the
> > output
> >
> > Py4JJavaError: An error occurred while calling o699.showString.
> > : scala.MatchError: SpatialIndex polygonshape#422: geometry, QUADTREE,
> [id=#3312]
> >
> > We would be grateful if you can support us on this.
> >
> >
> > The broadcast join was performed as follows: broadcast_join =
> points_df.alias('points').join(f.broadcast(polygons_df).alias('polygons'),
> f.expr('ST_Contains(polygons.polygonshape, points.pointshape)'))
> >
> > Attached you can find the pseudo code we used to test broadcast join.
> >
> >
> > Looking forward to hearing from you.
> >
> >
> > Kind regards,
> >
> > Pietro Greselin
> >
> >
> > On Wed, 28 Jul 2021 at 00:46, Jia Yu <ji...@apache.org> wrote:
> >
> >> Hi Pietro,
> >>
> >> A few tips to optimize your join:
> >>
> >> 1. Mix DF and RDD together and use RDD API for the join part. See the
> >> example here:
> >>
> https://github.com/apache/incubator-sedona/blob/master/binder/ApacheSedonaSQL_SpatialJoin_AirportsPerCountry.ipynb
> >>
> >> 2. When use point_rdd.spatialPartitioning(GridType.KDBTREE, 4), try to
> >> use a large number of partitions (say 1000 or more)
> >>
> >> If this approach doesn't work, consider broadcast join if needed.
> >> Broadcast the polygon side:
> >> https://sedona.apache.org/api/sql/Optimizer/#broadcast-join
> >>
> >> Thanks,
> >> Jia
> >>
> >>
> >> On Tue, Jul 27, 2021 at 2:21 AM pietro greselin <p.grese...@gmail.com>
> >> wrote:
> >>
> >>> To whom it may concern,
> >>>
> >>> we reported the following Sedona behaviour and would like to ask your
> >>> opinion on how we can otpimize it.
> >>>
> >>> Our aim is to perform a inner spatial join between a points_df and a
> >>> polygon_df when a point in points_df is contained in a polygon from
> >>> polygons_df.
> >>> Below you can find more details about the 2 dataframes we are
> >>> considering:
> >>> - points_df: it contains 50mln events with latitude and longitude
> >>> approximated to the third decimal digit;
> >>> - polygon_df: it contains 10k multi-polygons having 600 vertexes on
> >>> average.
> >>>
> >>> The issue we are reporting is a very long computing time and the
> spatial
> >>> join query never completing even when running on cluster with 40
> workers
> >>> with 4 cores each.
> >>> No error is being print by driver but we are receiving the following
> >>> warning:
> >>> WARN org.apache.sedona.core.spatialOperator.JoinQuery: UseIndex is
> true,
> >>> but no index exists. Will build index on the fly.
> >>>
> >>> Actually we were able to run successfully the same spatial join when
> >>> only considering a very small sample of events.
> >>> Do you have any suggestion on how we can archive the same result on
> >>> higher volumes of data or if there is a way we can optimize the join?
> >>>
> >>> Attached you can find the pseudo-code we are running.
> >>>
> >>> Looking forward to hearing from you.
> >>>
> >>> Kind regards,
> >>> Pietro Greselin
> >>>
> >>
>


-- 
Adam Binford

Re: Spatial join performances

Reply via email to