Kontinuation commented on PR #744: URL: https://github.com/apache/sedona/pull/744#issuecomment-1376763045
> @Kontinuation Quick follow-up question: since Sedona already offers ST_GeoHash (https://sedona.apache.org/1.3.1-incubating/api/flink/Function/?h=geohash#st_geohash), I believe one generate a GeoHash for each geometry, sort all geoms by GeoHash, then save the DataFrame in a GeoParquet. This way, the pruning power of GeoParquet will be much stronger given smaller bboxs in each Parquet partition? Yes. The data has to be clustered by spatial proximity to maximize the efficiency of spatial predicate push-down. Sorting the data using GeoHash or space-filling curves is a good approach to reorganizing the dataset for query efficiency. The following figures are results of sorting geoms by high precision GeoHash values, writing data as GeoParquet files and finally plotting the `bbox`es of the GeoParquet files. Though the partitionings are not perfect, they still provide an opportunity for pruning some files depending on the query window. | AREALM (50 files) | AREALM (200 files) | |-------------|-----------| | |  | | MS Buildings NYC (50 files) | MS Buildings NYC (200 files) | |-------------|-----------| | |  | The [Spatial Parquet paper](https://arxiv.org/pdf/2209.02158.pdf) has also experimented with the effect of sorting the data using Z curve and Hilbert curve. Their approach works for file-level data pruning as well, though the paper primarily deals with row group/page-level data skipping. Another approach is to partition the dataset by low-precision GeoHash values. It produces non-overlapping bounding boxes for point datasets. I've taken this approach when creating examples in the documentation. Carto's analytics toolbox for Databricks supports [partitioning the dataset by Z2 index of tile coordinates](https://github.com/CartoDB/analytics-toolbox-core/blob/main/clouds/databricks/libraries/scala/core/src/main/scala/com/carto/analyticstoolbox/spark/spatial/OptimizeSpatial.scala#L49-L67), which is a similar approach. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
