Hi Hanxi,

Thanks for your interest in Sedona. I suggest that you read my post in
Sedona Discord:
https://discord.com/channels/1034906732264181760/1038573531434328135/1038573531434328135

To join our discord, please read:
https://sedona.apache.org/1.5.1/community/contact/#discord-server

To your question:

1. GeoParquet filter pushdown has NO contribution to the performance
of spatial join at all. Because a filter pushdown only works for
"filter" query not join query. The only impact a GeoParquet file could
make is probably the number of partitions.
2. The number of partitions, the dominant side of the join will both
have a significant impact on the join performance. Currently, you will
need to find the best number of partitions yourself. Although Sedona
has been trying to find the best one automatically, it is not perfect.
Sedona 1.5.1 changed the default spatial join index to rtree, it might
speed up the join query performance in some cases.

Thanks,
Jia



On Mon, Jan 15, 2024 at 12:01 PM Zhang, Hanxi (ISED/ISDE)
<hanxi.zh...@ised-isde.gc.ca> wrote:
>
>
>
> Hello Sedona community,
>
>
>
> I was benchmarking the following geospatial range join query and similar 
> using Sedona, based on serverless spark clusters provisioned via Amazon Glue 
> Notebooks. I have been using Sedona version v1.5.0. Pyspark dataframes 't' 
> and 'g' are both loaded from parquet files hosted on Amazon S3, and are 
> partitioned in nested folder structures, based on region, year, month and 
> date. Specifically, ‘g’ is a polygon layer with millions of rows representing 
> polygons across Canada,  and ‘t’ is a point layer that has about 200 columns, 
> and up to tens of billions of rows depending on the region/year/month/date 
> chosen, representing points across Canada and the U.S.
>
>
>
> result = sedona.sql(
>
>      """
>
>          select t.device_uid, g.grid_id
>
>              from t, g
>
>              where st_contains(ST_GeomFromWKT(g.geometry), 
> ST_GeomFromWKT(t.geometry))
>
>      """
>
> )
>
>
>
>
>
> I was able to increase spark cluster size in AWS Glue to handle larger 
> volumes of the 't' dataset. For instance, as a stress test, I was able to 
> complete a query with 't' amounting to 14 billion rows in Canada in a bit 
> over 3 hours, with a spark cluster of 2TB memory. I am in general impressed 
> with the performance, and my questions below are regarding potential venues 
> to improve the performance a bit further if possible, and also help me 
> understand deeper about Sedona performance tuning.  I am attaching the result 
> of running ‘explain’ on my query. Any comments and suggestions are greatly 
> appreciated! And I apologize for the long message.
>
>
>
> 1) Geoparquet-level predicate pushdown
>
>
>
> Based on https://sedona.apache.org/1.5.0/api/sql/Optimizer/, my understanding 
> is that geoparquet level predicate pushdown happens automatically. I have 
> experimented with roughly equal volumes of 't' in Canada and U.S. 
> respectively. When 't' is in the order of hundreds of millions of objects in 
> the U.S., the above query would complete in 1/6 of the time taken by same 
> volume of 't' data in Canada, with an identical size and configuration of 
> spark cluster. That seems to confirm geoparquet-level predicate pushdown is 
> at play and causes 'polygons in Canada containing points in U.S.' to return 
> right away.
>
>
>
> When I increased the volume of 't' data 10x to the order of billions of point 
> objects, the U.S. case is still faster than the Canada case, but only 
> marginally, based on a spark cluster 8x larger in both cases. So for the 
> seemingly diminishing benefit of geoparquet-level predicate pushdown, I 
> wonder if it is hitting any limitation of Sedona's predicate pushdown 
> mechanism, or it is potentially due to limitations associated with the way 
> spark clusters are being scaled up in AWS Glue.
>
>
>
> In general, how can one confirm based on spark logs etc. that 
> geoparquet-level predicate pushdown is active? Can it be turned on and off? 
> And if so can one see statistics on volumes of data and files scanned?
>
>
>
> As a related note and question, in 
> https://sedona.apache.org/1.5.0/api/sql/Optimizer/ under "Push spatial 
> predicates to GeoParquet", it states that "To maximize the performance of 
> Sedona GeoParquet filter pushdown, we suggest that you sort the data by their 
> geohash values and then save as a GeoParquet file". In my case, the large 
> points layer 't' does have a geohash6 column, and I took the following steps 
> to see if that would help
>
>
>
> a)  select * from 't' order by t.geohash6
>
> b)  save the result to S3 as geoparquet files, following the same 
> partitioning scheme
>
> c)  run 'g' contains 't' query against the newly ordered 't' dataset with a 
> spark cluster of the same size and configuration
>
>
>
> Based on my results, that does help the performance a little bit, but it is 
> very marginal to the point it is negligible. I am a little confused as to 
> whether geoparquet-level predicate pushdown is active or not.
>
>
>
> 2) The use of pyspark dataframe function repartition(n, column)
>
>
>
> I have experimented with the repartition function against the dataset 't', 
> i.e., the layer consists of a large number of points. From my experience, 
> increasing the count n will help performance up until a certain point, beyond 
> which having more partitions could actually lead to slower queries, and may 
> also cause the spark cluster to fail with out of memory error.
>
>
>
> If I understand correctly, finding the right partition count is a matter of 
> experimentation. Is there something in the spark logs that could be an 
> indication that the number of partitions is too low or too high?
>
>
>
> And a second sub-question regarding repartition is that, should one 
> repartition all dataframes involved in a query to improve performance, or 
> just the biggest dataframe? In my case, since the point layer 't' is much 
> greater than the polygon layer 'g' in terms of both object count and data 
> volume in GBs on S3, I have only repartitioned 't' so far.
>
>
>
> 3) Anything else that you would recommend trying to improve the performance 
> of my query?
>
>
>
> Thank you!
>
>
>
> Hanxi
>
>
>
>
>
>
>
>
>
>

Reply via email to