Re: Spatial join performances

Adam Binford Thu, 05 Aug 2021 15:10:26 -0700

I don't think that's the issue. The join detection is the same for both
broadcast and non-broadcast, so the same match statement needs to run
either way. I created an issue for what I found from the stack trace (don't
have a copy of the stack trace to share easily):
https://issues.apache.org/jira/browse/SEDONA-56


Adam

On Wed, Aug 4, 2021 at 9:02 PM Jia Yu <jiayu198...@gmail.com> wrote:

> Hi Adam,
>
> I believe the issue is caused by this chunk of code:
> https://github.com/apache/incubator-sedona/blob/master/sql/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/JoinQueryDetector.scala#L84-L109
>
> If we move the broadcast join detection as the first part of the detector
> and set other join detection to the "else". Will it fix the issue?
>
> if (broadcast XXX)
>
> else {
> all other join detection.
> }
>
> Thanks,
> Jia
>
> On Tue, Aug 3, 2021 at 11:19 AM Adam Binford <adam...@gmail.com> wrote:
>
>> Okay I actually did encounter it today. It happens when you have AQE
>> enabled. Looked into it a little bit and might have to rework the
>> SpatialIndexExec node to extend BroadcastExchangeLike or maybe even
>> directly BroadcastExchangeExec, but that might only be compatible with
>> Spark 3+, so not sure what to do about that. I'm not sure if there's
>> specific AQE rules or optimizations that can be disabled to get it to
>> work,
>> but if you just disable it completely it should work for now. I'm also not
>> at all familiar with the inner workings of AQE to know what the right way
>> to properly work with that is.
>>
>> Adam
>>
>> On Tue, Aug 3, 2021 at 7:46 AM Adam Binford <adam...@gmail.com> wrote:
>>
>> > I haven't encountered any issues with it but I can investigate with the
>> > full stacktrace. Also which version of Spark is this with?
>> >
>> > Adam
>> >
>> > On Tue, Aug 3, 2021 at 4:25 AM Jia Yu <ji...@apache.org> wrote:
>> >
>> >> Hi Pietro,
>> >>
>> >> Can you please share the full stacktrace of this scala.MatchError? I
>> tried
>> >> a couple test cases but wasn't able to reproduce this error on my end.
>> In
>> >> fact, another user complained about the same issue a while back. I
>> suspect
>> >> there is a bug for this part.
>> >>
>> >> I also CCed the contributor of Sedona broadcast join. @
>> adam...@gmail.com
>> >> <adam...@gmail.com> Hi Adam, do you have any idea about this issue?
>> >>
>> >> Thanks,
>> >> Jia
>> >>
>> >> On Mon, Aug 2, 2021 at 12:43 AM pietro greselin <p.grese...@gmail.com>
>> >> wrote:
>> >>
>> >> > Hello Jia,
>> >> >
>> >> > thank you so much for your support.
>> >> >
>> >> > We have been able to complete our task and to perform a few runs with
>> >> > different number of partitions.
>> >> > At the moment we obtained the best performance when running on 20
>> nodes
>> >> > and setting the number of partitions to be 2000. With this
>> >> configuration,
>> >> > it took approximately 45 minutes to write the join's output.
>> >> >
>> >> > Then we tried to perform the same join through broadcast as you
>> >> suggested
>> >> > to see whether we could achieve better results but actually we
>> obtained
>> >> the
>> >> > following error when calling an action like broadcast_join.show() on
>> the
>> >> > output
>> >> >
>> >> > Py4JJavaError: An error occurred while calling o699.showString.
>> >> > : scala.MatchError: SpatialIndex polygonshape#422: geometry,
>> QUADTREE,
>> >> [id=#3312]
>> >> >
>> >> > We would be grateful if you can support us on this.
>> >> >
>> >> >
>> >> > The broadcast join was performed as follows: broadcast_join =
>> >>
>> points_df.alias('points').join(f.broadcast(polygons_df).alias('polygons'),
>> >> f.expr('ST_Contains(polygons.polygonshape, points.pointshape)'))
>> >> >
>> >> > Attached you can find the pseudo code we used to test broadcast join.
>> >> >
>> >> >
>> >> > Looking forward to hearing from you.
>> >> >
>> >> >
>> >> > Kind regards,
>> >> >
>> >> > Pietro Greselin
>> >> >
>> >> >
>> >> > On Wed, 28 Jul 2021 at 00:46, Jia Yu <ji...@apache.org> wrote:
>> >> >
>> >> >> Hi Pietro,
>> >> >>
>> >> >> A few tips to optimize your join:
>> >> >>
>> >> >> 1. Mix DF and RDD together and use RDD API for the join part. See
>> the
>> >> >> example here:
>> >> >>
>> >>
>> https://github.com/apache/incubator-sedona/blob/master/binder/ApacheSedonaSQL_SpatialJoin_AirportsPerCountry.ipynb
>> >> >>
>> >> >> 2. When use point_rdd.spatialPartitioning(GridType.KDBTREE, 4), try
>> to
>> >> >> use a large number of partitions (say 1000 or more)
>> >> >>
>> >> >> If this approach doesn't work, consider broadcast join if needed.
>> >> >> Broadcast the polygon side:
>> >> >> https://sedona.apache.org/api/sql/Optimizer/#broadcast-join
>> >> >>
>> >> >> Thanks,
>> >> >> Jia
>> >> >>
>> >> >>
>> >> >> On Tue, Jul 27, 2021 at 2:21 AM pietro greselin <
>> p.grese...@gmail.com>
>> >> >> wrote:
>> >> >>
>> >> >>> To whom it may concern,
>> >> >>>
>> >> >>> we reported the following Sedona behaviour and would like to ask
>> your
>> >> >>> opinion on how we can otpimize it.
>> >> >>>
>> >> >>> Our aim is to perform a inner spatial join between a points_df and
>> a
>> >> >>> polygon_df when a point in points_df is contained in a polygon from
>> >> >>> polygons_df.
>> >> >>> Below you can find more details about the 2 dataframes we are
>> >> >>> considering:
>> >> >>> - points_df: it contains 50mln events with latitude and longitude
>> >> >>> approximated to the third decimal digit;
>> >> >>> - polygon_df: it contains 10k multi-polygons having 600 vertexes on
>> >> >>> average.
>> >> >>>
>> >> >>> The issue we are reporting is a very long computing time and the
>> >> spatial
>> >> >>> join query never completing even when running on cluster with 40
>> >> workers
>> >> >>> with 4 cores each.
>> >> >>> No error is being print by driver but we are receiving the
>> following
>> >> >>> warning:
>> >> >>> WARN org.apache.sedona.core.spatialOperator.JoinQuery: UseIndex is
>> >> true,
>> >> >>> but no index exists. Will build index on the fly.
>> >> >>>
>> >> >>> Actually we were able to run successfully the same spatial join
>> when
>> >> >>> only considering a very small sample of events.
>> >> >>> Do you have any suggestion on how we can archive the same result on
>> >> >>> higher volumes of data or if there is a way we can optimize the
>> join?
>> >> >>>
>> >> >>> Attached you can find the pseudo-code we are running.
>> >> >>>
>> >> >>> Looking forward to hearing from you.
>> >> >>>
>> >> >>> Kind regards,
>> >> >>> Pietro Greselin
>> >> >>>
>> >> >>
>> >>
>> >
>> >
>> > --
>> > Adam Binford
>> >
>>
>>
>> --
>> Adam Binford
>>
>

-- 
Adam Binford

Re: Spatial join performances

Reply via email to