2010YOUY01 opened a new issue, #530:
URL: https://github.com/apache/sedona-db/issues/530
# Motivation
To convert legacy Parquet files that store geometry as a `BINARY` column
whose payload is WKB into GeoParquet, the snippet below can be used. It
explicitly converts the binary WKB payload into a geometry value (and sets the
SRID), so that SedonaDB recognizes the column as geometry and `to_parquet()`
can write GeoParquet metadata correctly.
```python
# geo_legacy.parquet schema
# - geo_bin: Binary (payload is WKB)
# - c1: Int32
# - c2: Int32
df = sd.read_parquet("/data/geo_legacy.parquet")
# Register a view name for SQL
df = df.to_view("t", overwrite=True)
df = sd.sql("""
SELECT
ST_SetSRID(ST_GeomFromWKB(geo_bin), 4326) AS geometry,
* EXCLUDE (geo_bin)
FROM t
""")
df.to_parquet("geo_geoparquet.parquet")
```
# Proposed new API
It would be helpful to have an easier API for this. Using a dedicated method
(instead of fusing the cast into `read_parquet()` or `to_parquet()`) makes the
conversion more flexible, especially when “logically geometry, physically
WKB-in-binary” columns come from other sources or are produced mid-query.
```python
def with_geometry(...):
"""
Convert one or more binary WKB columns into geometry columns.
Args:
columns: A column name or list of column names containing WKB
payloads.
crs: Optional CRS identifier (e.g., 4326 or "EPSG:4326").
validate: If True, validate WKB payloads while converting.
primary: Optional name to mark as the primary geometry column.
The converted geometry columns are projected first (in the order of
``columns`` or with ``primary`` first), followed by the remaining
columns.
Examples:
>>> sd = sedona.db.connect()
>>> df = sd.read_parquet("geo_legacy.parquet").with_geometry(
... columns=["geo_bin"],
... crs="EPSG:4326",
... validate=True,
... primary="geo_bin",
... )
"""
```
## Example usage
```python
# geo_legacy.parquet schema
# - geo_bin: Binary (payload is WKB)
# - c1: Int32
# - c2: Int32
df = sd.read_parquet("/data/geo_legacy.parquet")
df = df.with_geometry(
columns="geo_bin",
crs=4326,
validate=True,
primary="geo_bin",
)
df.to_parquet("geo_geoparquet.parquet")
```
## Implementation
Internally, it's simply add expression projection on geometry columns (with
`ST_SetSRID(ST_GeomFromWKB(geo_bin), 4326)`)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]