paleolimbot commented on code in PR #560:
URL: https://github.com/apache/sedona-db/pull/560#discussion_r2742177033


##########
python/sedonadb/python/sedonadb/context.py:
##########
@@ -134,13 +135,46 @@ def read_parquet(
                 files.
             options: Optional dictionary of options to pass to the Parquet 
reader.
                 For S3 access, use {"aws.skip_signature": True, "aws.region": 
"us-west-2"} for anonymous access to public buckets.
+            geometry_columns: Optional mapping of column name to GeoParquet 
column
+                metadata (e.g., {"geom": {"encoding": "WKB"}}). Use this to 
mark
+                binary WKB columns as geometry columns. Supported keys:

Review Comment:
   ```suggestion
                   metadata (e.g., {"geom": {"encoding": "WKB"}}). Use this to 
mark
                   binary WKB columns as geometry columns or correct metadata 
such
                   as the column CRS. Supported keys:
   ```



##########
python/sedonadb/python/sedonadb/context.py:
##########
@@ -134,13 +135,46 @@ def read_parquet(
                 files.
             options: Optional dictionary of options to pass to the Parquet 
reader.
                 For S3 access, use {"aws.skip_signature": True, "aws.region": 
"us-west-2"} for anonymous access to public buckets.
+            geometry_columns: Optional mapping of column name to GeoParquet 
column
+                metadata (e.g., {"geom": {"encoding": "WKB"}}). Use this to 
mark
+                binary WKB columns as geometry columns. Supported keys:
+                - encoding: "WKB" (required)
+                - crs: string (e.g., "EPSG:4326") or integer SRID (e.g., 4326).
+                  If not provided, the default CRS is OGC:CRS84
+                  (https://www.opengis.net/def/crs/OGC/1.3/CRS84), which means
+                  the data in this column must be stored in longitude/latitude
+                  based on the WGS84 datum.
+                - edges: "planar" (default) or "spherical"

Review Comment:
   Perhaps just "See the GeoParquet specification for the required fields for 
each column".
   
   https://geoparquet.org/releases/v1.1.0/



##########
python/sedonadb/python/sedonadb/context.py:
##########
@@ -126,6 +126,7 @@ def read_parquet(
         self,
         table_paths: Union[str, Path, Iterable[str]],
         options: Optional[Dict[str, Any]] = None,
+        geometry_columns: Optional[Dict[str, Union[str, Dict[str, Any]]]] = 
None,

Review Comment:
   I would probably make this just `Optional[Mapping[str, Any]]`. In Python a 
user can rather easily decode that from JSON if they need to.



##########
python/sedonadb/tests/test_context.py:
##########
@@ -100,6 +104,38 @@ def test_read_parquet_options_parameter(con, 
geoarrow_data):
     )  # Should be identical (option ignored but not errored)
 
 
+# Basic test for `geometry_columns` option for `read_parquet(..)`
+def test_read_parquet_geometry_columns_roundtrip(con, tmp_path):
+    # Write regular parquet table with wkb column in Binary type
+    geom = shapely.from_wkt("POINT (0 1)").wkb
+    table = pa.table({"id": [1], "geom": [geom]})
+    src = tmp_path / "plain.parquet"
+    pq.write_table(table, src)
+
+    # Check metadata: geoparquet meatadata should not be available

Review Comment:
   ```suggestion
       # Check metadata: geoparquet metadata should not be available
   ```



##########
python/sedonadb/src/context.rs:
##########
@@ -80,6 +203,7 @@ impl InternalContext {
         py: Python<'py>,
         table_paths: Vec<String>,
         options: HashMap<String, PyObject>,
+        geometry_columns: Option<HashMap<String, PyObject>>,

Review Comment:
   ```suggestion
           geometry_columns: Option<String>,
   ```
   
   ..I think JSON is the right format for this particular step (reduces 
bindings code considerably!)



##########
python/sedonadb/tests/test_context.py:
##########
@@ -100,6 +104,38 @@ def test_read_parquet_options_parameter(con, 
geoarrow_data):
     )  # Should be identical (option ignored but not errored)
 
 
+# Basic test for `geometry_columns` option for `read_parquet(..)`
+def test_read_parquet_geometry_columns_roundtrip(con, tmp_path):
+    # Write regular parquet table with wkb column in Binary type
+    geom = shapely.from_wkt("POINT (0 1)").wkb
+    table = pa.table({"id": [1], "geom": [geom]})
+    src = tmp_path / "plain.parquet"
+    pq.write_table(table, src)
+
+    # Check metadata: geoparquet meatadata should not be available
+    metadata = pq.read_metadata(src).metadata
+    assert metadata is not None
+    assert b"geo" not in metadata
+
+    # `read_parquet()` with geo type hint produces `geometry` columns, writing 
it
+    # again will produce GeoParquet file
+    df = con.read_parquet(
+        src, geometry_columns={"geom": {"encoding": "WKB", "crs": 4326}}
+    )
+    tab = df.to_arrow_table()
+    assert tab["geom"].type.extension_name == "geoarrow.wkb"
+
+    out = tmp_path / "geo.parquet"
+    df.to_parquet(out)
+    metadata = pq.read_metadata(out).metadata
+    assert metadata is not None
+    geo = metadata.get(b"geo")
+    assert geo is not None
+    geo_metadata = json.loads(geo.decode("utf-8"))
+    print(json.dumps(geo_metadata, indent=2, sort_keys=True))
+    assert geo_metadata["columns"]["geom"]["crs"] == "EPSG:4326"

Review Comment:
   I think you can probably skip this bit of the test (verifying the 
geometry-ness and CRS of the input seems reasonable to me).



##########
rust/sedona-geoparquet/src/provider.rs:
##########
@@ -81,6 +89,7 @@ pub async fn geoparquet_listing_table(
 pub struct GeoParquetReadOptions<'a> {
     inner: ParquetReadOptions<'a>,
     table_options: Option<HashMap<String, String>>,
+    geometry_columns: Option<HashMap<String, GeoParquetColumnMetadata>>,

Review Comment:
   ```suggestion
       geometry_columns: Option<String>,
   ```
   
   ...just keeping this as a String will make this work for SQL, too (easy to 
import from the `HashMap<String, String>` that DataFusion gives us)



##########
rust/sedona-schema/src/crs.rs:
##########
@@ -51,7 +51,7 @@ pub fn deserialize_crs(crs_str: &str) -> Result<Crs> {
     }
 
     // Handle JSON strings "OGC:CRS84", "EPSG:4326", "{AUTH}:{CODE}" and "0"
-    let crs = if LngLat::is_str_lnglat(crs_str) {
+    let crs = if crs_str == "OGC:CRS84" {

Review Comment:
   These changes should be reverted (there is >1 string that can represent 
lon/lat)



##########
python/sedonadb/src/context.rs:
##########
@@ -30,6 +34,125 @@ use crate::{
     udf::PySedonaScalarUdf,
 };
 
+fn parse_geometry_columns<'py>(
+    py: Python<'py>,
+    geometry_columns: HashMap<String, PyObject>,
+) -> Result<HashMap<String, GeoParquetColumnMetadata>, PySedonaError> {

Review Comment:
   I think this bit can be avoided by just passing a string at this point 
(i.e., in Python, use `json.dumps()` before passing to Rust).



##########
rust/sedona-schema/src/crs.rs:
##########
@@ -88,6 +88,10 @@ pub fn deserialize_crs_from_obj(crs_value: 
&serde_json::Value) -> Result<Crs> {
         }
     }
 
+    if let Some(number) = crs_value.as_number() {

Review Comment:
   This part is OK (but perhaps add a test)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to