Re: [Discuss] If and how we should integrate geospatial data (specs) in Arrow

Mauricio Vargas Fri, 25 Jun 2021 10:18:27 -0700

Dear Jon

Thanks for sending this. Based on previous projects, WKB works well with
SQLite, DuckDB and others, at the expense of creating heavier size columns
compared to PostGIS.


In order to experiment with, it can be interesting to use the CENSO 2017
shape files: https://github.com/ropensci/censo2017-cartografias;
https://github.com/ropensci/censo2017-cartografias/releases/download/v0.4/cartografias-censo2017.zip
This includes rivers, streets, etc etc.

Provided that Arrow is installed in a very straightforward way (for
Windows, at least), creating something based on PostGIS is probably not a
bad idea, but WKB works ok, and it integrates with 0 problems with the SF
package. I clearly see a great compression advantage here if we decide to
use WKB, as LZ4 shall make it very lightweight compared to, say, a CSV.

Best,







On Fri, Jun 25, 2021 at 1:05 PM Jonathan Keane <[email protected]> wrote:

> Hello,
>
> There is an emerging spec[1] for how to store geospatial data in Arrow
> + pass through parquet files in the geopandas world. There is even a
> new R package that implements a wrapper to do the same in R[2]. These
> both define a serialization[3] for storing geospatial data as an Arrow
> table (and thus also when saving to parquet with Arrow).
>
> I could see a number of ways that we might interact with standards
> like these, and for any of these that we pursue it would be good to
> clarify that in our docs:
>
> 1. Point to the standard — we could mention that this standard exists
> and that if someone is building a geospatial data aware application,
> they _could_ refer to this standard if they want to.
> 2. Adopt a/this standard — this could range from stating that we've
> adopted it as the way that spatial data _ought_ to be stored to asking
> the creators if maintaining it within the Arrow project itself would
> be better (either by adopting it or creating a fork — of course
> communication with the folks working on it now would be critical!)
> 3. Create extension type(s) for geospatial data — this would require
> adopting a standard like the one linked, but on top of that providing
> an extension type within Arrow itself that the various clients could
> implement as they saw fit.
> 4. Create new, fully separate type(s) for geospatial data — again,
> this would require adopting a standard of some sort, but we would
> implement it as a specific type and presumably support it in all of
> the clients as we could.
>
> There are of course pros and cons to all of these. This type of data
> *is* somewhat specialized and I don't think we want to have a huge
> profusion of types for all of the possible specialized data types out
> there. But, at a minimum we should acknowledge (or adopt) a standard
> if it exists and encourage implementations that use Arrow to follow
> that standard (like sfarrow does to be compatible with geopandas) so
> that some level of interoperability is there + people aren't needing
> to reinvent the wheel each time they store spatial data.
>
> Thoughts? Are there other projects out there that already do something
> like this with Arrow that we should consider?
>
> [1] https://github.com/geopandas/geo-arrow-spec/pull/2
> [2] https://github.com/wcjochem/sfarrow
> [3] for now they create a binary WKB column + attach a bit of metadata
> to the schema that that's what happened, though there are other ways
> one could encode this and the spec might include other way(s) to store
> this data in the future.
>
> -Jon
>


-- 
—
*Mauricio 'Pachá' Vargas Sepúlveda*
Site: pacha.dev
Blog: pacha.dev/blog

Re: [Discuss] If and how we should integrate geospatial data (specs) in Arrow

Reply via email to