Copilot commented on code in PR #2229: URL: https://github.com/apache/sedona/pull/2229#discussion_r2256122274
########## docs/community/geopandas.md: ########## @@ -0,0 +1,69 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + --> + +# Geopandas on Sedona + +This folder contains the source code for the GeoPandas component of Apache Sedona. This README.md will outline a few important considerations to consider when contributing changes to this component as a developer. Again **this README.md is targeted towards contributors**, the official documentation is more tailored towards users. Review Comment: The phrase "a few important considerations to consider" contains redundant language. It should be "a few important considerations" or "a few important things to consider". ```suggestion This folder contains the source code for the GeoPandas component of Apache Sedona. This README.md will outline a few important considerations when contributing changes to this component as a developer. Again **this README.md is targeted towards contributors**, the official documentation is more tailored towards users. ``` ########## docs/community/geopandas.md: ########## @@ -0,0 +1,69 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + --> + +# Geopandas on Sedona + +This folder contains the source code for the GeoPandas component of Apache Sedona. This README.md will outline a few important considerations to consider when contributing changes to this component as a developer. Again **this README.md is targeted towards contributors**, the official documentation is more tailored towards users. + +**General Approach**: This component is built on top of the PySpark Pandas API. The `GeoDataFrame` and `GeoSeries` classes both inherit from pyspark pandas' `ps.DataFrame` and `ps.Series` classes, respectively. When possible, it is generally preferred to leverage pyspark pandas' implementation of a function and extending it from there (i.e. find a way to leverage `super()` rather than copying over parts of logic). The code structure resembles the structure of the [Geopandas repo](https://github.com/geopandas/geopandas). + +**Lazy Evaluation**: Spark uses lazy evaluation. Spark's distributed and lazy nature occasionally comes in the way of implementing functionality in the same way the original GeoPandas library does so. For example, GeoPandas has many checks for invalid crs in many places (e.g `GeoSeries.__init__()`, `set_crs()`). Sedona's implementation for getting the `crs` currently is expensive compared to GeoPandas because it requires us to run an eager `ST_SRID()` query. If we eagerly query for the crs in every initialization of `GeoSeries`, all of our functions calls (e.g `.area()`, etc) would also become eager and would incur a noticeable slowdown, resulting a poor user experience. + +**Maintaining Order**: Because Spark uses distributed data, maintaining the order of data across operations takes extra time and effort. Maintaining order for some operations is not very meaningful. In those cases, it's reasonable to skip the post-sorting to avoid an unnecessary performance hit. Documentation should indicate so accordingly. For example, `sjoin` currently does not maintain the traditional pandas dataframe order after performing the join. This follows the same convention as traditional PySpark Pandas. The user can always post-sort using a separate function such as `sort_index()`, but we should avoid sorting unnecessarily by default. Review Comment: The phrase "indicate so accordingly" is awkward. It should be "indicate this accordingly" or "document this behavior". ```suggestion **Maintaining Order**: Because Spark uses distributed data, maintaining the order of data across operations takes extra time and effort. Maintaining order for some operations is not very meaningful. In those cases, it's reasonable to skip the post-sorting to avoid an unnecessary performance hit. Documentation should document this behavior. For example, `sjoin` currently does not maintain the traditional pandas dataframe order after performing the join. This follows the same convention as traditional PySpark Pandas. The user can always post-sort using a separate function such as `sort_index()`, but we should avoid sorting unnecessarily by default. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
