Re: [PR] [GH-2227] Add Geopandas Developer Guide README.md [sedona]

via GitHub Wed, 06 Aug 2025 02:50:07 -0700


Copilot commented on code in PR #2229:
URL: https://github.com/apache/sedona/pull/2229#discussion_r2256122274



##########
docs/community/geopandas.md:
##########
@@ -0,0 +1,69 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+ -->
+
+# Geopandas on Sedona
+
+This folder contains the source code for the GeoPandas component of Apache 
Sedona. This README.md will outline a few important considerations to consider 
when contributing changes to this component as a developer. Again **this 
README.md is targeted towards contributors**, the official documentation is 
more tailored towards users.

Review Comment:
   The phrase "a few important considerations to consider" contains redundant 
language. It should be "a few important considerations" or "a few important 
things to consider".
   ```suggestion
   This folder contains the source code for the GeoPandas component of Apache 
Sedona. This README.md will outline a few important considerations when 
contributing changes to this component as a developer. Again **this README.md 
is targeted towards contributors**, the official documentation is more tailored 
towards users.
   ```



##########
docs/community/geopandas.md:
##########
@@ -0,0 +1,69 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+ -->
+
+# Geopandas on Sedona
+
+This folder contains the source code for the GeoPandas component of Apache 
Sedona. This README.md will outline a few important considerations to consider 
when contributing changes to this component as a developer. Again **this 
README.md is targeted towards contributors**, the official documentation is 
more tailored towards users.
+
+**General Approach**: This component is built on top of the PySpark Pandas 
API. The `GeoDataFrame` and `GeoSeries` classes both inherit from pyspark 
pandas' `ps.DataFrame` and `ps.Series` classes, respectively. When possible, it 
is generally preferred to leverage pyspark pandas' implementation of a function 
and extending it from there (i.e. find a way to leverage `super()` rather than 
copying over parts of logic). The code structure resembles the structure of the 
[Geopandas repo](https://github.com/geopandas/geopandas).
+
+**Lazy Evaluation**: Spark uses lazy evaluation. Spark's distributed and lazy 
nature occasionally comes in the way of implementing functionality in the same 
way the original GeoPandas library does so. For example, GeoPandas has many 
checks for invalid crs in many places (e.g `GeoSeries.__init__()`, 
`set_crs()`). Sedona's implementation for getting the `crs` currently is 
expensive compared to GeoPandas because it requires us to run an eager 
`ST_SRID()` query. If we eagerly query for the crs in every initialization of 
`GeoSeries`, all of our functions calls (e.g `.area()`, etc) would also become 
eager and would incur a noticeable slowdown, resulting a poor user experience.
+
+**Maintaining Order**: Because Spark uses distributed data, maintaining the 
order of data across operations takes extra time and effort. Maintaining order 
for some operations is not very meaningful. In those cases, it's reasonable to 
skip the post-sorting to avoid an unnecessary performance hit. Documentation 
should indicate so accordingly. For example, `sjoin` currently does not 
maintain the traditional pandas dataframe order after performing the join. This 
follows the same convention as traditional PySpark Pandas. The user can always 
post-sort using a separate function such as `sort_index()`, but we should avoid 
sorting unnecessarily by default.

Review Comment:
   The phrase "indicate so accordingly" is awkward. It should be "indicate this 
accordingly" or "document this behavior".
   ```suggestion
   **Maintaining Order**: Because Spark uses distributed data, maintaining the 
order of data across operations takes extra time and effort. Maintaining order 
for some operations is not very meaningful. In those cases, it's reasonable to 
skip the post-sorting to avoid an unnecessary performance hit. Documentation 
should document this behavior. For example, `sjoin` currently does not maintain 
the traditional pandas dataframe order after performing the join. This follows 
the same convention as traditional PySpark Pandas. The user can always 
post-sort using a separate function such as `sort_index()`, but we should avoid 
sorting unnecessarily by default.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [GH-2227] Add Geopandas Developer Guide README.md [sedona]

Reply via email to