jorisvandenbossche commented on PR #45459:
URL: https://github.com/apache/arrow/pull/45459#issuecomment-2749495657
To confirm, the segfault I mentioned is no longer present with the latest
change to remove `GeoCrsContext` handling. What triggered it with an older
version of this branch:
```python
# creation of the file
gdf = geopandas.GeoDataFrame(geometry=geopandas.points_from_xy([1,2,3],
[1,2,3]), crs="EPSG:3857")
table = pa.table(gdf.to_arrow())
pa.feather.write_feather(table, "test_geometry_3857.arrow")
```
And then writing that Arrow table segfaults if the extension type was
enabled:
```python
In [1]: import pyarrow.feather
In [2]: table = pa.feather.read_table("test_geometry_3857.arrow")
In [3]: table.schema
Out[3]:
geometry: binary
-- field metadata --
ARROW:extension:name: 'geoarrow.wkb'
ARROW:extension:metadata: '{"crs": "{\"$schema\":\"https://proj.org/sch' +
2613
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' +
373
In [4]: import pyarrow.parquet as pq
# works fine when no geoarrow extension type is enabled
In [5]: pq.write_table(table, "test_geometry_from_table_no_ext.parquet")
In [6]: import geoarrow.pyarrow as ga
In [7]: table = pa.feather.read_table("test_geometry_3857.arrow")
In [8]: table.schema
Out[8]:
geometry: extension<geoarrow.wkb<WkbType>>
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' +
373
# segfaults when it is enabled
In [9]: pq.write_table(table, "test_geometry_from_table_ext.parquet")
terminate called after throwing an instance of 'parquet::ParquetException'
what(): Crs encoding 'unknown' is not suppored by GeoCrsContext
Aborted (core dumped)
```
Not sure if this is related to geopandas creating invalid crs metadata? (the
bug you fixed on geopandas main)
But so in any case, the above now works without segfault on the latest
version of this branch.
This difference makes me wonder one other thing, though, which is still
present: for the Parquet code to "see" a geometry column in the Arrow data it
is writing, it needs to be an actual _registered_ extension type, and having
the extension metadata in the field metadata is not sufficient? (purely for the
spec, the metadata is what defines the extension type, and so someone
roundtripping Parquet data (e.g. reading in, doing some filtering, writing out
again) would loose this type information if they do not have the extension type
registered)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]