paleolimbot commented on issue #2098: URL: https://github.com/apache/arrow-adbc/issues/2098#issuecomment-2346905822
For PostGIS, we are reasonably close to being able to export geometry column with the "geoarrow.wkb" extension type. The main barrier that we have is that Postgres embeds the spatial reference identifier as an integer key at the row level, and GeoArrow embeds the coordinate reference system as a string at the type level. Various tools work around this differently...GeoPandas, for example, pulls the SRID out of the first element, assumes that is an EPSG code, and assumes that all following values have the same SRID. This is frequently the case; however, I would like to do something safer by default at the driver level. There are a few options: - We actually do have the ability to inspect one or or more rows of data at the time that we pull the first data from COPY (which includes the header and the first row of data). We can record the SRID there, put it as type metadata, and check that all subsequent items have an identical SRID. This won't work if the first item is NULL but that doesn't happen frequently for geometry columns. - I think we can get the table/column or object identifier if we're pulling straight from a table (or filtered table), in which case we might be able to issue a query to the SRID table: https://postgis.net/docs/using_postgis_dbmanagement.html#geometry_columns / https://postgis.net/docs/using_postgis_dbmanagement.html#spatial_ref_sys_table . - For non-COPY queries, we get the whole result at once and we can do the same thing, but possibly doing a better job for queries where the first value(s) are null. In PostGIS, the values themselves are emitted as EKWB, which is currently "valid" for the `geoarrow.wkb` extension type (although we prefer ISO WKB). Reencoding EWKB as ISO WKB is probably not worth it since most readers can handle both. For SQLite, the type we're dealing with is probably GeoPackage's GEOMETRY (as reported by the `decltype`, which I don't think we inspect at all at the moment). This has a similar setup to PostGIS: the GEOMETRY column has an embeded item-level SRID but all elements in a single column have the same value whose textual representation is embedded in a dedicated table. The BLOB value that comes through from GeoPackage is prefixed WKB ( https://www.geopackage.org/spec/#gpb_data_blob_format ) which can be stripped before appending to the output ( https://github.com/paleolimbot/gpkg/blob/4d18da00b753f79b225d0e3ac7dac4e517733b06/src/gpkg.cpp#L196-L232 ). For SQLite, there is also Spatialite, which has its own encoding ( https://www.gaia-gis.it/gaia-sins/BLOB-Geometry.html ). Annoyingly, I think this also has a decltype of GEOMETRY. I'll have to double check for snowflake and bigquery, but I believe that snowflake emits GeoJSON with item-level SRID and BigQuery emits well-known text. BigQuery only deals with global "geographies", so item-level SRID is not a problem (i.e. it's always OGC:CRS84). All of this is possible to stick in ADBC, but there's enough geometry-specific formatting that it might be worth figuring out how to make a wrapper driver (or make it easy to make a wrapper driver!) so that the geometry-specific details could live elsewhere. Also useful would be a driver embedded into GDAL (since GDAL already implements all of this logic, including export to Arrow and import from Arrow). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
