jornfranke commented on issue #2586: URL: https://github.com/apache/iceberg/issues/2586#issuecomment-1671940533
I think it would be good to have Geospatial support in Apache Iceberg, although it is certainly a more complex feature. While the spatialx-project seems to have done a lot of useful implementation, it is a bit difficult to use as the changes made to the Iceberg are unclear and I could also not find some documentation on how geometry is added to Iceberg. I propose that this documentation is started so this issue can move on. @badbye does it make sense to you that you or me create a Google Docs (or Cryptpad, https://cryptpad.fr/) document that is viewable by everybody (similarly to how other specs are done in Iceberg?)? Happy to also help with the writing structuring. One could initially have it as follows: What benefit does one have to use Apache Iceberg with Geospatial data instead of using, for instance, simply [geoparquet](https://geoparquet.org/)? I would think about: * Support writing of individual rows (this can be useful in streaming scenarios, e.g. Internet of Thing devices communicating their position). * Natively already query the geospatial data without manual/error-prone conversion (e.g. using right CRS when loading etc.) and by having a much higher performance ... maybe you have some other motivations why you started geolake? One can also think about other features (would not add them in the first spec due to complexity): * Partition of data according to spatial (location) criteria (see also: https://github.com/opengeospatial/geoparquet/issues/13#issuecomment-1057437189), which seems to be supported by Geolake (I wonder can we instead/additionally use the z-ordering feature of Iceberg to reuse the Iceberg functionality?) * Loading/Storing rasters (at the moment all proposals, including geoparquet, include only vector data), more complex, the raster should be split in equal small tiles et. I suggest that a public Google Doc is started and that one can add what it would mean for Iceberg to support Geospatial support, e.g.: * Augmentation of the Iceberg Spec (https://iceberg.apache.org/spec/) * Update Data type to include Geometry (https://iceberg.apache.org/spec/#schemas-and-data-types), probably it should be internally based on geoarrow (https://github.com/geoarrow/geoarrow/) - or maybe you have some idea based on your Geolake implementation? * Requirements for storage formats (Suggest to focus in the initial release only on parquet as it the only one which has geoparquet defined, but in future releases one could also include avro, orc using a similar specification as geoparquet) * ... * Interdependencies to tools * Apache Sedona - how can we make sure that the Geometry column is compatible (should we reuse the Sedona Geometry class? Or should we provide as a pull request to Sedona a Spark function that does the conversion?). It seems you provide a solution here: https://github.com/spatialx-project/sedona-iceberg-extension * Geopandas - how can we integrate Geopandas (https://geopandas.org/en/stable/) with PyIceberg (https://py.iceberg.apache.org/) * ... (e.g. QGIS support - this could be solved if the Geopandas support is solved) * Planning of a roadmap of features (as said before, I suggest to have more complex things later) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
