[ https://issues.apache.org/jira/browse/SEDONA-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17947977#comment-17947977 ]
Shaunak GuhaThakurata commented on SEDONA-733: ---------------------------------------------- Thanks [~kontinuation] for your prompt reply. We will look into the partitioning and associated spark configuration parameters. The data is public - USGS 10M DEM and Overture buildings. > Raster functions are extremely slow on Google DataProc Spark > ------------------------------------------------------------ > > Key: SEDONA-733 > URL: https://issues.apache.org/jira/browse/SEDONA-733 > Project: Apache Sedona > Issue Type: Bug > Reporter: Shaunak GuhaThakurata > Priority: Major > > Workloads involving raster data, especially RS_Clip, RS_ZontalStats functions > are extremely slow. My workload is relatively simple. I am calculating mean > elevation of open source structure footprints using the 10m DEM rasters. > * My rasters (10m DEM) are divided into ~ 912 COG geotiff files. The COGs > are 16x16 tiled > * The vector layer of structure footprints is in geoparquet format, > partitioned by county FIPS code, which makes the footprints in each partition > co-located spatially > * I am testing with one county's data : ~ 134,000 structure footprints > * Rasters are kept in a GCS bucket > * Output is being written to GCS bucket > * I am running on Google DataProc v 2.2 > * Spark version 3.5. Sedona 1.7.1 > * Cluster config: workers nodes: 3. Machine type: n1-highmem-64 > The CPU utilization of the worker nodes are always below 10%. There is some > initial network traffic but ultimately the traffic as well as disk I/O in the > cluster reduces to nearly 0. > Is this an expected behavior? Are there any workarounds to improve > performance? > Any advice is greatly appreciated. -- This message was sent by Atlassian Jira (v8.20.10#820010)