Shaunak GuhaThakurata created SEDONA-733: --------------------------------------------
Summary: Raster functions are extremely slow on Google DataProc Spark Key: SEDONA-733 URL: https://issues.apache.org/jira/browse/SEDONA-733 Project: Apache Sedona Issue Type: Bug Reporter: Shaunak GuhaThakurata Workloads involving raster data, especially RS_Clip, RS_ZontalStats functions are extremely slow. My workload is relatively simple. I am calculating mean elevation of open source structure footprints using the 10m DEM rasters. * My rasters are divided into ~ 912 COG geotiff files. The COGs are 16x16 tiled * The vector layer of structure footprints is in geoparquet format, partitioned by county FIPS code, which makes the footprints in each partition co-located spatially * Rasters are kept in a GCS bucket * Output is being written to GCS bucket * I am running on Google DataProc v 2.2 * Spark version 3.5. Sedona 1.7.1 * Cluster config: workers nodes: 3. Machine type: n1-highmem-64 The CPU utilization of the worker nodes are always below 10%. There is some initial network traffic but ultimately the traffic as well as disk I/O in the cluster reduces to nearly 0. Is this an expected behavior? Are there any workarounds to improve performance? Any advice is greatly appreciated. -- This message was sent by Atlassian Jira (v8.20.10#820010)