paleolimbot commented on issue #2098:
URL: https://github.com/apache/arrow-adbc/issues/2098#issuecomment-2346905822

   For PostGIS, we are reasonably close to being able to export geometry column 
with the "geoarrow.wkb" extension type. The main barrier that we have is that 
Postgres embeds the spatial reference identifier as an integer key at the row 
level, and GeoArrow embeds the coordinate reference system as a string at the 
type level.
   
   Various tools work around this differently...GeoPandas, for example, pulls 
the SRID out of the first element, assumes that is an EPSG code, and assumes 
that all following values have the same SRID. This is frequently the case; 
however, I would like to do something safer by default at the driver level. 
There are a few options:
   
   - We actually do have the ability to inspect one or or more rows of data at 
the time that we pull the first data from COPY (which includes the header and 
the first row of data). We can record the SRID there, put it as type metadata, 
and check that all subsequent items have an identical SRID. This won't work if 
the first item is NULL but that doesn't happen frequently for geometry columns.
   - I think we can get the table/column or object identifier if we're pulling 
straight from a table (or filtered table), in which case we might be able to 
issue a query to the SRID table: 
https://postgis.net/docs/using_postgis_dbmanagement.html#geometry_columns / 
https://postgis.net/docs/using_postgis_dbmanagement.html#spatial_ref_sys_table .
   - For non-COPY queries, we get the whole result at once and we can do the 
same thing, but possibly doing a better job for queries where the first 
value(s) are null.
   
   In PostGIS, the values themselves are emitted as EKWB, which is currently 
"valid" for the `geoarrow.wkb` extension type (although we prefer ISO WKB). 
Reencoding EWKB as ISO WKB is probably not worth it since most readers can 
handle both.
   
   For SQLite, the type we're dealing with is probably GeoPackage's GEOMETRY 
(as reported by the `decltype`, which I don't think we inspect at all at the 
moment). This has a similar setup to PostGIS: the GEOMETRY column has an 
embeded item-level SRID but all elements in a single column have the same value 
whose textual representation is embedded in a dedicated table. The BLOB value 
that comes through from GeoPackage is prefixed WKB ( 
https://www.geopackage.org/spec/#gpb_data_blob_format ) which can be stripped 
before appending to the output ( 
https://github.com/paleolimbot/gpkg/blob/4d18da00b753f79b225d0e3ac7dac4e517733b06/src/gpkg.cpp#L196-L232
 ).
   
   For SQLite, there is also Spatialite, which has its own encoding ( 
https://www.gaia-gis.it/gaia-sins/BLOB-Geometry.html ). Annoyingly, I think 
this also has a decltype of GEOMETRY.
   
   I'll have to double check for snowflake and bigquery, but I believe that 
snowflake emits GeoJSON with item-level SRID and BigQuery emits well-known 
text. BigQuery only deals with global "geographies", so item-level SRID is not 
a problem (i.e. it's always OGC:CRS84).
   
   All of this is possible to stick in ADBC, but there's enough 
geometry-specific formatting that it might be worth figuring out how to make a 
wrapper driver (or make it easy to make a wrapper driver!) so that the 
geometry-specific details could live elsewhere. Also useful would be a driver 
embedded into GDAL (since GDAL already implements all of this logic, including 
export to Arrow and import from Arrow).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to