2010YOUY01 opened a new pull request, #560:
URL: https://github.com/apache/sedona-db/pull/560

   ## Motivation
   
   Today, converting legacy Parquet files that store geometry as raw WKB 
payloads inside `BINARY` columns into GeoParquet requires a full SQL rewrite 
pipeline. Users must explicitly parse WKB, assign CRS, and reconstruct the 
geometry column before writing:
   
   ```python
   # geo_legacy.parquet schema
   # - geo_bin: Binary (payload is WKB)
   # - c1: Int32
   # - c2: Int32
   
   df = sd.read_parquet("/data/geo_legacy.parquet")
   
   df = df.to_view("t", overwrite=True)
   
   df = sd.sql("""
     SELECT
       ST_SetSRID(ST_GeomFromWKB(geo_bin), 4326) AS geometry,
       * EXCLUDE (geo_bin)
     FROM t
   """)
   
   df.to_parquet("geo_geoparquet.parquet")
   ```
   
   This works, but it would be have a easier to use python API:
   
   > “Treat this binary column as a geometry column with encoding=WKB and 
CRS=EPSG:4326.”
   
   This PR introduces a `geometry_columns` option on `read_parquet()` so legacy 
Parquet files can be interpreted as GeoParquet directly, without SQL rewriting.
   
   ---
   
   ## Proposed Python API
   
   ### Demo
   ```python
   df = sd.read_parquet(
       "/data/geo_legacy.parquet",
       geometry_columns={
           "geo_bin": {
               "encoding": "WKB",
               "crs": 4326,
           }
       },
   )
   
   df.to_parquet("geo_geoparquet.parquet")
   ```
   
   ### Specification
   ```
               geometry_columns: Optional mapping of column name to GeoParquet 
column
                   metadata (e.g., {"geom": {"encoding": "WKB"}}). Use this to 
mark
                   binary WKB columns as geometry columns. Supported keys:
                   - encoding: "WKB" (required)
                   - crs: string (e.g., "EPSG:4326") or integer SRID (e.g., 
4326).
                     If not provided, the default CRS is OGC:CRS84
                     (https://www.opengis.net/def/crs/OGC/1.3/CRS84), which 
means
                     the data in this column must be stored in 
longitude/latitude
                     based on the WGS84 datum.
                   - edges: "planar" (default) or "spherical"
                   Useful for:
                   - Legacy Parquet files with Binary columns containing WKB 
payloads.
                   - Overriding GeoParquet metadata when fields like `crs` are 
missing.
                   Precedence:
                   - If a column appears in both GeoParquet metadata and this 
option,
                     the geometry_columns entry takes precedence.
                   Example:
                   - For `geo.parquet(geo1: geometry, geo2: geometry, geo3: 
binary)`,
                     `read_parquet("geo.parquet", geometry_columns={"geo2": 
{...}, "geo3": ...})`
                     will override `geo2` metadata and treat `geo3` as a 
geometry column.
                   Safety:
                   - Columns specified here are not validated for WKB 
correctness.
                     Invalid WKB payloads may cause undefined behavior.
   ```
   
   ---
   
   ## Key Changes
   
   1. Parse python option fields into rust `GeoParquetColumnMetadata` struct
   2. In the schema inference step, first infer the metadata from `GeoParquet` 
metadata as before, next look at the options, to add/override additional 
geometry columns


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to