2010YOUY01 opened a new pull request, #560:
URL: https://github.com/apache/sedona-db/pull/560
## Motivation
Today, converting legacy Parquet files that store geometry as raw WKB
payloads inside `BINARY` columns into GeoParquet requires a full SQL rewrite
pipeline. Users must explicitly parse WKB, assign CRS, and reconstruct the
geometry column before writing:
```python
# geo_legacy.parquet schema
# - geo_bin: Binary (payload is WKB)
# - c1: Int32
# - c2: Int32
df = sd.read_parquet("/data/geo_legacy.parquet")
df = df.to_view("t", overwrite=True)
df = sd.sql("""
SELECT
ST_SetSRID(ST_GeomFromWKB(geo_bin), 4326) AS geometry,
* EXCLUDE (geo_bin)
FROM t
""")
df.to_parquet("geo_geoparquet.parquet")
```
This works, but it would be have a easier to use python API:
> “Treat this binary column as a geometry column with encoding=WKB and
CRS=EPSG:4326.”
This PR introduces a `geometry_columns` option on `read_parquet()` so legacy
Parquet files can be interpreted as GeoParquet directly, without SQL rewriting.
---
## Proposed Python API
### Demo
```python
df = sd.read_parquet(
"/data/geo_legacy.parquet",
geometry_columns={
"geo_bin": {
"encoding": "WKB",
"crs": 4326,
}
},
)
df.to_parquet("geo_geoparquet.parquet")
```
### Specification
```
geometry_columns: Optional mapping of column name to GeoParquet
column
metadata (e.g., {"geom": {"encoding": "WKB"}}). Use this to
mark
binary WKB columns as geometry columns. Supported keys:
- encoding: "WKB" (required)
- crs: string (e.g., "EPSG:4326") or integer SRID (e.g.,
4326).
If not provided, the default CRS is OGC:CRS84
(https://www.opengis.net/def/crs/OGC/1.3/CRS84), which
means
the data in this column must be stored in
longitude/latitude
based on the WGS84 datum.
- edges: "planar" (default) or "spherical"
Useful for:
- Legacy Parquet files with Binary columns containing WKB
payloads.
- Overriding GeoParquet metadata when fields like `crs` are
missing.
Precedence:
- If a column appears in both GeoParquet metadata and this
option,
the geometry_columns entry takes precedence.
Example:
- For `geo.parquet(geo1: geometry, geo2: geometry, geo3:
binary)`,
`read_parquet("geo.parquet", geometry_columns={"geo2":
{...}, "geo3": ...})`
will override `geo2` metadata and treat `geo3` as a
geometry column.
Safety:
- Columns specified here are not validated for WKB
correctness.
Invalid WKB payloads may cause undefined behavior.
```
---
## Key Changes
1. Parse python option fields into rust `GeoParquetColumnMetadata` struct
2. In the schema inference step, first infer the metadata from `GeoParquet`
metadata as before, next look at the options, to add/override additional
geometry columns
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]