Re: [PR] [GH-2227] Add Geopandas Developer Guide README.md [sedona]

via GitHub Wed, 06 Aug 2025 11:41:36 -0700


Copilot commented on code in PR #2229:
URL: https://github.com/apache/sedona/pull/2229#discussion_r2257982866



##########
docs/community/geopandas.md:
##########
@@ -0,0 +1,85 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+ -->
+
+# Geopandas on Sedona
+
+This guide outlines a few important considerations when contributing changes 
to the GeoPandas component of Apache Sedona as a developer. Again, **this guide 
is targeted towards contributors**; the official documentation is more tailored 
towards users.
+
+**General Approach**: This component is built on top of the PySpark Pandas 
API. The `GeoDataFrame` and `GeoSeries` classes both inherit from pyspark 
pandas' `ps.DataFrame` and `ps.Series` classes, respectively. When possible, it 
is generally preferred to leverage pyspark pandas' implementation of a function 
and extending it from there (i.e. find a way to leverage `super()` rather than 
copying over parts of logic). The code structure resembles the structure of the 
[Geopandas repo](https://github.com/geopandas/geopandas).
+
+**Lazy Evaluation**: Spark uses lazy evaluation. Spark's distributed and lazy 
nature occasionally comes in the way of implementing functionality in the same 
way the original GeoPandas library does so. For example, GeoPandas has many 
checks for invalid crs in many places (e.g `GeoSeries.__init__()`, 
`set_crs()`). Sedona's implementation for getting the `crs` currently is 
expensive compared to GeoPandas because it requires us to run an eager 
`ST_SRID()` query. If we eagerly query for the crs in every initialization of 
`GeoSeries`, all of our function calls (e.g `.area()`, etc) would also become 
eager and would incur a noticeable slowdown, resulting in a poor user 
experience.
+
+**Maintaining Order**: Because Spark uses distributed data, maintaining the 
order of data across operations takes extra time and effort. Maintaining order 
for some operations is not very meaningful. In those cases, it's reasonable to 
skip the post-sorting to avoid an unnecessary performance hit. Documentation 
should document this behavior. For example, `sjoin` currently does not maintain 
the traditional pandas dataframe order after performing the join. This follows 
the same convention as traditional PySpark Pandas. The user can always 
post-sort using a separate function such as `sort_index()`, but we should avoid 
sorting unnecessarily by default.
+
+**Conventions**: The conventional shorthand for the Sedona Geopandas package 
is `sgpd`. Notice it's the same as the geopandas shorthand (`gpd`), except 
prefixed with an 's'. The conventional short hands for adjacent packages are 
shown below also.
+
+```python
+import pandas as pd
+import geopandas as gpd
+import pyspark.pandas as ps
+import sedona.spark.geopandas as sgpd
+```
+
+**Conversion Methods**: Sedona's Implementation of Geopandas, provides useful 
methods to convert to and from other dataframes using the following methods. 
These apply to both `GeoDataFrame` and `GeoSeries`:

Review Comment:
   There's an unnecessary comma after "Geopandas". It should read "Sedona's 
Implementation of GeoPandas provides useful methods..."
   ```suggestion
   **Conversion Methods**: Sedona's Implementation of Geopandas provides useful 
methods to convert to and from other dataframes using the following methods. 
These apply to both `GeoDataFrame` and `GeoSeries`:
   ```



##########
docs/community/geopandas.md:
##########
@@ -0,0 +1,85 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+ -->
+
+# Geopandas on Sedona
+
+This guide outlines a few important considerations when contributing changes 
to the GeoPandas component of Apache Sedona as a developer. Again, **this guide 
is targeted towards contributors**; the official documentation is more tailored 
towards users.
+
+**General Approach**: This component is built on top of the PySpark Pandas 
API. The `GeoDataFrame` and `GeoSeries` classes both inherit from pyspark 
pandas' `ps.DataFrame` and `ps.Series` classes, respectively. When possible, it 
is generally preferred to leverage pyspark pandas' implementation of a function 
and extending it from there (i.e. find a way to leverage `super()` rather than 
copying over parts of logic). The code structure resembles the structure of the 
[Geopandas repo](https://github.com/geopandas/geopandas).
+
+**Lazy Evaluation**: Spark uses lazy evaluation. Spark's distributed and lazy 
nature occasionally comes in the way of implementing functionality in the same 
way the original GeoPandas library does so. For example, GeoPandas has many 
checks for invalid crs in many places (e.g `GeoSeries.__init__()`, 
`set_crs()`). Sedona's implementation for getting the `crs` currently is 
expensive compared to GeoPandas because it requires us to run an eager 
`ST_SRID()` query. If we eagerly query for the crs in every initialization of 
`GeoSeries`, all of our function calls (e.g `.area()`, etc) would also become 
eager and would incur a noticeable slowdown, resulting in a poor user 
experience.
+
+**Maintaining Order**: Because Spark uses distributed data, maintaining the 
order of data across operations takes extra time and effort. Maintaining order 
for some operations is not very meaningful. In those cases, it's reasonable to 
skip the post-sorting to avoid an unnecessary performance hit. Documentation 
should document this behavior. For example, `sjoin` currently does not maintain 
the traditional pandas dataframe order after performing the join. This follows 
the same convention as traditional PySpark Pandas. The user can always 
post-sort using a separate function such as `sort_index()`, but we should avoid 
sorting unnecessarily by default.
+
+**Conventions**: The conventional shorthand for the Sedona Geopandas package 
is `sgpd`. Notice it's the same as the geopandas shorthand (`gpd`), except 
prefixed with an 's'. The conventional short hands for adjacent packages are 
shown below also.
+
+```python
+import pandas as pd
+import geopandas as gpd
+import pyspark.pandas as ps
+import sedona.spark.geopandas as sgpd
+```
+
+**Conversion Methods**: Sedona's Implementation of Geopandas, provides useful 
methods to convert to and from other dataframes using the following methods. 
These apply to both `GeoDataFrame` and `GeoSeries`:
+
+- `to_geopandas()`: Sedona Geo(DataFrame/Series) to Geopandas
+- `to_geoframe()`: Sedona GeoSeries to Sedona GeoDataFrame
+- `to_spark_pandas()`: Sedona Geo(DataFrame/Series) to Pandas on PySpark
+- `to_spark()` (inherited): Sedona GeoDataFrame to Spark DataFrame
+- `to_frame()` (inherited): Sedona GeoSeries to PySpark Pandas DataFrame
+
+**GeoSeries functions**: Most geometry manipulation operations in Geopandas 
are considered GeoSeries functions. However, we can call them from a 
`GeoDataFrame` object as well to execute on it's active geometry column. We 
implement the functions in the `GeoSeries` class. However in `base.py`, we add 
a `_delegate_to_geometry_column()` call to allow the `GeoDataFrame` to also 
execute the function on it's active geometry column. We also specify the 
docstring for the function here instead of `GeoSeries`, so that both 
`GeoDataFrame` and `GeoSeries` will inherit the shared docstring (avoiding 
duplicated docstrings).

Review Comment:
   "it's" should be "its" (possessive form, not contraction). This appears 
twice in the sentence.
   ```suggestion
   **GeoSeries functions**: Most geometry manipulation operations in Geopandas 
are considered GeoSeries functions. However, we can call them from a 
`GeoDataFrame` object as well to execute on its active geometry column. We 
implement the functions in the `GeoSeries` class. However in `base.py`, we add 
a `_delegate_to_geometry_column()` call to allow the `GeoDataFrame` to also 
execute the function on its active geometry column. We also specify the 
docstring for the function here instead of `GeoSeries`, so that both 
`GeoDataFrame` and `GeoSeries` will inherit the shared docstring (avoiding 
duplicated docstrings).
   ```



##########
docs/community/geopandas.md:
##########
@@ -0,0 +1,85 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+ -->
+
+# Geopandas on Sedona
+
+This guide outlines a few important considerations when contributing changes 
to the GeoPandas component of Apache Sedona as a developer. Again, **this guide 
is targeted towards contributors**; the official documentation is more tailored 
towards users.
+
+**General Approach**: This component is built on top of the PySpark Pandas 
API. The `GeoDataFrame` and `GeoSeries` classes both inherit from pyspark 
pandas' `ps.DataFrame` and `ps.Series` classes, respectively. When possible, it 
is generally preferred to leverage pyspark pandas' implementation of a function 
and extending it from there (i.e. find a way to leverage `super()` rather than 
copying over parts of logic). The code structure resembles the structure of the 
[Geopandas repo](https://github.com/geopandas/geopandas).
+
+**Lazy Evaluation**: Spark uses lazy evaluation. Spark's distributed and lazy 
nature occasionally comes in the way of implementing functionality in the same 
way the original GeoPandas library does so. For example, GeoPandas has many 
checks for invalid crs in many places (e.g `GeoSeries.__init__()`, 
`set_crs()`). Sedona's implementation for getting the `crs` currently is 
expensive compared to GeoPandas because it requires us to run an eager 
`ST_SRID()` query. If we eagerly query for the crs in every initialization of 
`GeoSeries`, all of our function calls (e.g `.area()`, etc) would also become 
eager and would incur a noticeable slowdown, resulting in a poor user 
experience.
+
+**Maintaining Order**: Because Spark uses distributed data, maintaining the 
order of data across operations takes extra time and effort. Maintaining order 
for some operations is not very meaningful. In those cases, it's reasonable to 
skip the post-sorting to avoid an unnecessary performance hit. Documentation 
should document this behavior. For example, `sjoin` currently does not maintain 
the traditional pandas dataframe order after performing the join. This follows 
the same convention as traditional PySpark Pandas. The user can always 
post-sort using a separate function such as `sort_index()`, but we should avoid 
sorting unnecessarily by default.
+
+**Conventions**: The conventional shorthand for the Sedona Geopandas package 
is `sgpd`. Notice it's the same as the geopandas shorthand (`gpd`), except 
prefixed with an 's'. The conventional short hands for adjacent packages are 
shown below also.
+
+```python
+import pandas as pd
+import geopandas as gpd
+import pyspark.pandas as ps
+import sedona.spark.geopandas as sgpd
+```
+
+**Conversion Methods**: Sedona's Implementation of Geopandas, provides useful 
methods to convert to and from other dataframes using the following methods. 
These apply to both `GeoDataFrame` and `GeoSeries`:
+
+- `to_geopandas()`: Sedona Geo(DataFrame/Series) to Geopandas
+- `to_geoframe()`: Sedona GeoSeries to Sedona GeoDataFrame
+- `to_spark_pandas()`: Sedona Geo(DataFrame/Series) to Pandas on PySpark
+- `to_spark()` (inherited): Sedona GeoDataFrame to Spark DataFrame
+- `to_frame()` (inherited): Sedona GeoSeries to PySpark Pandas DataFrame
+
+**GeoSeries functions**: Most geometry manipulation operations in Geopandas 
are considered GeoSeries functions. However, we can call them from a 
`GeoDataFrame` object as well to execute on it's active geometry column. We 
implement the functions in the `GeoSeries` class. However in `base.py`, we add 
a `_delegate_to_geometry_column()` call to allow the `GeoDataFrame` to also 
execute the function on it's active geometry column. We also specify the 
docstring for the function here instead of `GeoSeries`, so that both 
`GeoDataFrame` and `GeoSeries` will inherit the shared docstring (avoiding 
duplicated docstrings).
+
+**Explain Query Plans**: Because these dataframe abstractions are built on 
Spark, we can retrieve the query plan for an operation for a Dataframe by using 
the `.spark.explain()` method.
+
+Example:
+
+```python
+geoseries = GeoSeries([Polygon([(0, 0), (1, 0), (1, 1), (0, 0)])])
+# Currently PySpark pandas Series does not have the spark.explain() method, so 
a workaround is to convert it to a dataframe first
+print(geoseries.area.to_frame().spark.explain(extended=True))
+```
+
+```
+== Parsed Logical Plan ==
+Project [__index_level_0__#19L, 0#27 AS None#31]
++- Project [ **org.apache.spark.sql.sedona_sql.expressions.ST_Area**   AS 
0#27, __index_level_0__#19L, __natural_order__#23L]
+   +- Project [__index_level_0__#19L, 0#20, monotonically_increasing_id() AS 
__natural_order__#23L]
+      +- LogicalRDD [__index_level_0__#19L, 0#20], false
+
+== Analyzed Logical Plan ==
+...
+
+== Optimized Logical Plan ==
+...
+
+== Physical Plan ==
+Project [__index_level_0__#19L,  
**org.apache.spark.sql.sedona_sql.expressions.ST_Area**   AS None#31]
++- *(1) Scan ExistingRDD[__index_level_0__#19L,0#20]
+```
+
+## Suggested Short Readings:
+
+- [PySpark Pandas Best 
Practices](https://spark.apache.org/docs/latest/api/python/tutorial/pandas_on_spark/best_practices.html)
 - This mentions other useful notes to consider such as why `__iter__()` is not 
supported
+- [Geopandas User 
Guide](https://geopandas.org/en/stable/docs/user_guide/data_structures.html) - 
Specifically it's useful to understand the GeoDataFrame's "active geometry 
column"

Review Comment:
   [nitpick] "it's" should be "it is" or restructured to "Specifically, it is 
useful..." for better clarity in documentation.
   ```suggestion
   - [Geopandas User 
Guide](https://geopandas.org/en/stable/docs/user_guide/data_structures.html) - 
Specifically, it is useful to understand the GeoDataFrame's "active geometry 
column"
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [GH-2227] Add Geopandas Developer Guide README.md [sedona]

Reply via email to