[ 
https://issues.apache.org/jira/browse/SEDONA-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17947977#comment-17947977
 ] 

Shaunak GuhaThakurata commented on SEDONA-733:
----------------------------------------------

Thanks [~kontinuation] for your prompt reply. We will look into the 
partitioning and associated spark configuration parameters. 

The data is public - USGS 10M DEM and Overture buildings.

> Raster functions are extremely slow on Google DataProc Spark
> ------------------------------------------------------------
>
>                 Key: SEDONA-733
>                 URL: https://issues.apache.org/jira/browse/SEDONA-733
>             Project: Apache Sedona
>          Issue Type: Bug
>            Reporter: Shaunak GuhaThakurata
>            Priority: Major
>
> Workloads involving raster data, especially RS_Clip, RS_ZontalStats functions 
> are extremely slow. My workload is relatively simple. I am calculating mean 
> elevation of open source structure footprints using the 10m DEM rasters. 
>  * My rasters (10m DEM) are divided into ~ 912 COG geotiff files. The COGs 
> are 16x16 tiled
>  * The vector layer of structure footprints is in geoparquet format, 
> partitioned by county FIPS code, which makes the footprints in each partition 
> co-located spatially
>  * I am testing with one county's data : ~ 134,000 structure footprints
>  * Rasters are kept in a GCS bucket
>  * Output is being written to GCS bucket
>  * I am running on Google DataProc v 2.2
>  * Spark version 3.5. Sedona 1.7.1
>  * Cluster config: workers nodes: 3. Machine type: n1-highmem-64
> The CPU utilization of the worker nodes are always below 10%. There is some 
> initial network traffic but ultimately the traffic as well as disk I/O in the 
> cluster reduces to nearly 0.
> Is this an expected behavior? Are there any workarounds to improve 
> performance?
> Any advice is greatly appreciated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to