jorisvandenbossche commented on PR #45459:
URL: https://github.com/apache/arrow/pull/45459#issuecomment-2749495657

   To confirm, the segfault I mentioned is no longer present with the latest 
change to remove `GeoCrsContext` handling. What triggered it with an older 
version of this branch:
   
   ```python
   # creation of the file
   gdf = geopandas.GeoDataFrame(geometry=geopandas.points_from_xy([1,2,3], 
[1,2,3]), crs="EPSG:3857")
   table = pa.table(gdf.to_arrow())
   pa.feather.write_feather(table, "test_geometry_3857.arrow")
   ```
   
   And then writing that Arrow table segfaults if the extension type was 
enabled:
   
   ```python
   In [1]: import pyarrow.feather
   
   In [2]: table = pa.feather.read_table("test_geometry_3857.arrow")
   
   In [3]: table.schema
   Out[3]: 
   geometry: binary
     -- field metadata --
     ARROW:extension:name: 'geoarrow.wkb'
     ARROW:extension:metadata: '{"crs": "{\"$schema\":\"https://proj.org/sch' + 
2613
   -- schema metadata --
   pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 
373
   
   In [4]: import pyarrow.parquet as pq
   
   # works fine when no geoarrow extension type is enabled
   In [5]: pq.write_table(table, "test_geometry_from_table_no_ext.parquet")
   
   In [6]: import geoarrow.pyarrow as ga
   
   In [7]: table = pa.feather.read_table("test_geometry_3857.arrow")
   
   In [8]: table.schema
   Out[8]: 
   geometry: extension<geoarrow.wkb<WkbType>>
   -- schema metadata --
   pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 
373
   
   # segfaults when it is enabled
   In [9]: pq.write_table(table, "test_geometry_from_table_ext.parquet")
   terminate called after throwing an instance of 'parquet::ParquetException'
     what():  Crs encoding 'unknown' is not suppored by GeoCrsContext
   Aborted (core dumped)
   ```
   
   Not sure if this is related to geopandas creating invalid crs metadata? (the 
bug you fixed on geopandas main)  
   But so in any case, the above now works without segfault on the latest 
version of this branch.
   
   This difference makes me wonder one other thing, though, which is still 
present: for the Parquet code to "see" a geometry column in the Arrow data it 
is writing, it needs to be an actual _registered_ extension type, and having 
the extension metadata in the field metadata is not sufficient? (purely for the 
spec, the metadata is what defines the extension type, and so someone 
roundtripping Parquet data (e.g. reading in, doing some filtering, writing out 
again) would loose this type information if they do not have the extension type 
registered)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to