On Wed, Apr 19, 2023 at 6:30 PM Jia Yu <[email protected]> wrote:
>
> Hi,
>
> This is likely caused by skewed data. To address that:
>
> (1) try to increase the number of partitions in your two input
> DataFrame. For example, df = df.repartition(1000)
> (2) Try to switch the sides of spatial joins, this might improve the
> join performance
>
> Rule of thumb:
> The spatial partitioning grids (which directly affects the load
> balance of the workloads) should be built on the larger dataset in a
> spatial join. We call this dataset the dominant dataset.
>
> In Sedona 1.3.1-incubating and earlier versions:
>
> dominant dataset is df1:
> SELECT * FROM df1, df2 WHERE ST_Contains(df1.geom, df2.geom)
>
> dominant dataset is df2:
> SELECT * FROM df1, df2 WHERE ST_CoveredBy(df2.geom, df1.geom)
>
> In Sedona 1.4.0 and later:
>
> dominant dataset is df1:
> SELECT * FROM df1, df2 WHERE ST_Contains(df1.geom, df2.geom)
>
> dominant dataset is df2:
> SELECT * FROM df2, df1 WHERE ST_Contains(df1.geom, df2.geom)
>
> Thanks,
> Jia
>
> On Wed, Apr 19, 2023 at 12:00 PM Zhang, Hanxi (ISED/ISDE)
> <[email protected]> wrote:
> >
> > Hello sedona community,
> >
> >
> >
> > I am running a geospatial Sedona cluster on Amazon EMR. Specifically, my 
> > cluster is based on Spark 3.3.0 and Sedona 1.3.1-incubating.
> >
> >
> >
> > In my cluster of there are 10 executor nodes, and each runs two executors 
> > based on my configuration. I am using the above cluster to run a large 
> > st_contains join between two datasets in geoparquet format.
> >
> >
> >
> > The issue I have been experiencing is, the vast majority of the executors 
> > complete their tasks within about 2 minutes. However 1 or two executors are 
> > stuck on the last few jobs. From the stderr logs on Spark History Web UI, 
> > this is the final state of the problematic executors:
> >
> >
> >
> > 23/04/19 05:37:10 INFO DynamicIndexLookupJudgement: [73, PID=2] [Streaming 
> > shapes] Reached a milestone: 1600001
> >
> > 23/04/19 06:20:08 INFO DynamicIndexLookupJudgement: [73, PID=2] [Streaming 
> > shapes] Reached a milestone: 1700001
> >
> > 23/04/19 07:02:13 INFO DynamicIndexLookupJudgement: [73, PID=2] [Streaming 
> > shapes] Reached a milestone: 1800001
> >
> > 23/04/19 07:45:25 INFO DynamicIndexLookupJudgement: [73, PID=2] [Streaming 
> > shapes] Reached a milestone: 1900001
> >
> > 23/04/19 08:29:24 INFO DynamicIndexLookupJudgement: [73, PID=2] [Streaming 
> > shapes] Reached a milestone: 2000001
> >
> > 23/04/19 09:14:25 INFO DynamicIndexLookupJudgement: [73, PID=2] [Streaming 
> > shapes] Reached a milestone: 2100001
> >
> > 23/04/19 09:56:18 INFO DynamicIndexLookupJudgement: [73, PID=2] [Streaming 
> > shapes] Reached a milestone: 2200001
> >
> > 23/04/19 10:38:24 INFO DynamicIndexLookupJudgement: [73, PID=2] [Streaming 
> > shapes] Reached a milestone: 2300001
> >
> > 23/04/19 11:21:53 INFO DynamicIndexLookupJudgement: [73, PID=2] [Streaming 
> > shapes] Reached a milestone: 2400001
> >
> > 23/04/19 12:05:49 INFO DynamicIndexLookupJudgement: [73, PID=2] [Streaming 
> > shapes] Reached a milestone: 2500001
> >
> > 23/04/19 12:53:34 INFO DynamicIndexLookupJudgement: [73, PID=2] [Streaming 
> > shapes] Reached a milestone: 2600001
> >
> > 23/04/19 13:36:55 INFO DynamicIndexLookupJudgement: [73, PID=2] [Streaming 
> > shapes] Reached a milestone: 2700001
> >
> > 23/04/19 14:19:18 INFO DynamicIndexLookupJudgement: [73, PID=2] [Streaming 
> > shapes] Reached a milestone: 2800001
> >
> > 23/04/19 15:04:06 INFO DynamicIndexLookupJudgement: [73, PID=2] [Streaming 
> > shapes] Reached a milestone: 2900001
> >
> > 23/04/19 16:08:55 INFO DynamicIndexLookupJudgement: [73, PID=2] [Streaming 
> > shapes] Reached a milestone: 3000001
> >
> > 23/04/19 16:51:07 INFO DynamicIndexLookupJudgement: [73, PID=2] [Streaming 
> > shapes] Reached a milestone: 3100001
> >
> > 23/04/19 17:36:03 INFO DynamicIndexLookupJudgement: [73, PID=2] [Streaming 
> > shapes] Reached a milestone: 3200001
> >
> >
> >
> > I would greatly appreciate any comments on what do the above logs indicate, 
> > and what needs to be done with either the spark sql query, the input 
> > datasets or spark/Sedona configurations to alleviate this issue. Thank you 
> > very much!
> >
> >
> >
> >
> >
> >

Reply via email to