[ https://issues.apache.org/jira/browse/SEDONA-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17947674#comment-17947674 ]
Kristin Cowalcijk commented on SEDONA-733: ------------------------------------------ Can you provide the SQL or DataFrame query you are running? If the raster and footprints data are publicly accessible, providing us the links to the data sources will be very helpful. According to the description of the workload, it is very likely to be a CPU-bound query, but low CPU utilization is observed. This could be caused by the default partitioning scheme of DataFrames cannot distribute the data and workload evenly to all your executor cores. As you mentioned that your cluster has 3 n1-highmem-64 VMs, the workload should be divided into at least 192 partitions to saturate the entire cluster. You can verify if this is the case by inspecting the Executors tab on Spark UI, if there are only a few tasks running on each executor then it could be the problem. You can try repartitioning the input dataframes to more partitions and see if the executor CPUs are still under utilized. > Raster functions are extremely slow on Google DataProc Spark > ------------------------------------------------------------ > > Key: SEDONA-733 > URL: https://issues.apache.org/jira/browse/SEDONA-733 > Project: Apache Sedona > Issue Type: Bug > Reporter: Shaunak GuhaThakurata > Priority: Major > > Workloads involving raster data, especially RS_Clip, RS_ZontalStats functions > are extremely slow. My workload is relatively simple. I am calculating mean > elevation of open source structure footprints using the 10m DEM rasters. > * My rasters (10m DEM) are divided into ~ 912 COG geotiff files. The COGs > are 16x16 tiled > * The vector layer of structure footprints is in geoparquet format, > partitioned by county FIPS code, which makes the footprints in each partition > co-located spatially > * I am testing with one county's data : ~ 134,000 structure footprints > * Rasters are kept in a GCS bucket > * Output is being written to GCS bucket > * I am running on Google DataProc v 2.2 > * Spark version 3.5. Sedona 1.7.1 > * Cluster config: workers nodes: 3. Machine type: n1-highmem-64 > The CPU utilization of the worker nodes are always below 10%. There is some > initial network traffic but ultimately the traffic as well as disk I/O in the > cluster reduces to nearly 0. > Is this an expected behavior? Are there any workarounds to improve > performance? > Any advice is greatly appreciated. -- This message was sent by Atlassian Jira (v8.20.10#820010)