csringhofer commented on PR #156: URL: https://github.com/apache/parquet-site/pull/156#issuecomment-3876292387
Reflecting on the discussion about incomplete statistic support. I checked a few implementation and while writing statistics for geometries seems to be there in general, I haven't found a single implementation of geography with any edge interpolation algorithm. The rust [implementation](https://github.com/apache/arrow-rs/blob/7dbe58a6e0e18985861db1dfa71507174e838cae/parquet/src/geospatial/accumulator.rs#L151) seems to handle the stats for points (where edge interpolation is not needed) and allows the user to inject its own implementation. >Maybe a more accurate summary is that the column statistics collection is not yet fully integrated into all engines. I agree in case of geometry, but I think that it would make things clearer to mention that for geography this is incomplete, at least in common open source libraries. The blog post mentions "Spatial statistics" as core feature and generally mentions geometry and geography side by side, so the reader may assume that statistics support is widely available for both logical types. This also effect the approach to choosing the best type to use - if bounding boxes are not yet available for geography and per file skipping is critical, then the user should try to build their workload on geometry. I don't know the status of statistics implementation of geography, but I haven't seen PRs about this, so my assumption is that it may take a significant time to have at least spherical interpolation available widely in Parquet libraries (or extension libraries). I would be happy to be proven wrong :) Btw the blog was a great read! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
