Shaunak GuhaThakurata created SEDONA-733:
--------------------------------------------

             Summary: Raster functions are extremely slow on Google DataProc 
Spark
                 Key: SEDONA-733
                 URL: https://issues.apache.org/jira/browse/SEDONA-733
             Project: Apache Sedona
          Issue Type: Bug
            Reporter: Shaunak GuhaThakurata


Workloads involving raster data, especially RS_Clip, RS_ZontalStats functions 
are extremely slow. My workload is relatively simple. I am calculating mean 
elevation of open source structure footprints using the 10m DEM rasters. 
 * My rasters are divided into ~ 912 COG geotiff files. The COGs are 16x16 tiled
 * The vector layer of structure footprints is in geoparquet format, 
partitioned by county FIPS code, which makes the footprints in each partition 
co-located spatially
 * Rasters are kept in a GCS bucket
 * Output is being written to GCS bucket
 * I am running on Google DataProc v 2.2
 * Spark version 3.5. Sedona 1.7.1
 * Cluster config: workers nodes: 3. Machine type: n1-highmem-64

The CPU utilization of the worker nodes are always below 10%. There is some 
initial network traffic but ultimately the traffic as well as disk I/O in the 
cluster reduces to nearly 0.

Is this an expected behavior? Are there any workarounds to improve performance?

Any advice is greatly appreciated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to