Re: [PR] [GH-2149] Geopandas: Implement `to_file`, `from_file`, `read_file` [sedona]

via GitHub Thu, 24 Jul 2025 10:39:32 -0700


zhangfengcdt commented on code in PR #2150:
URL: https://github.com/apache/sedona/pull/2150#discussion_r2229146624



##########
python/sedona/geopandas/geodataframe.py:
##########
@@ -1552,7 +1484,7 @@ def buffer(
         mitre_limit=5.0,
         single_sided=False,
         **kwargs,
-    ) -> GeoDataFrame:
+    ) -> sgpd.GeoSeries:

Review Comment:
   This does not seem to be consistent with the comment below: Returns a 
GeoDataFrame with all ...



##########
python/sedona/geopandas/io.py:
##########
@@ -0,0 +1,242 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import os
+from typing import Union
+import warnings
+import pyspark.pandas as ps
+from sedona.geopandas import GeoDataFrame
+from pyspark.pandas.utils import default_session, scol_for
+from pyspark.pandas.internal import SPARK_DEFAULT_INDEX_NAME, 
NATURAL_ORDER_COLUMN_NAME
+from pyspark.pandas.frame import InternalFrame
+from pyspark.pandas.utils import validate_mode, log_advice
+from pandas.api.types import is_integer_dtype
+
+
+def _to_file(
+    df: GeoDataFrame,
+    path: str,
+    driver: Union[str, None] = None,
+    index: Union[bool, None] = True,
+    **kwargs,
+):
+    """
+    Write the ``GeoDataFrame`` to a file.
+
+    Parameters
+    ----------
+    path : string
+        File path or file handle to write to.
+    driver : string, default None
+        The format driver used to write the file.
+        If not specified, it attempts to infer it from the file extension.
+        If no extension is specified, Sedona will error.
+        Options:
+            - "geojson"
+            - "geopackage"
+            - "geoparquet"
+    schema : dict, default None
+        Not applicable to Sedona's implementation
+    index : bool, default None
+        If True, write index into one or more columns (for MultiIndex).
+        Default None writes the index into one or more columns only if
+        the index is named, is a MultiIndex, or has a non-integer data
+        type. If False, no index is written.
+    mode : string, default 'w'
+        The write mode, 'w' to overwrite the existing file and 'a' to append.
+        'overwrite' and 'append' are equivalent to 'w' and 'a' respectively.
+    crs : pyproj.CRS, default None
+        If specified, the CRS is passed to Fiona to
+        better control how the file is written. If None, GeoPandas
+        will determine the crs based on crs df attribute.
+        The value can be anything accepted
+        by :meth:`pyproj.CRS.from_user_input() 
<pyproj.crs.CRS.from_user_input>`,
+        such as an authority string (eg "EPSG:4326") or a WKT string.
+    engine : str
+        Not applicable to Sedona's implementation
+    metadata : dict[str, str], default None
+        Optional metadata to be stored in the file. Keys and values must be
+        strings. Supported only for "GPKG" driver. Not supported by Sedona
+    **kwargs :
+        Keyword args to be passed to the engine, and can be used to write
+        to multi-layer data, store data within archives (zip files), etc.
+        In case of the "pyogrio" engine, the keyword arguments are passed to
+        `pyogrio.write_dataframe`. In case of the "fiona" engine, the keyword
+        arguments are passed to fiona.open`. For more information on possible
+        keywords, type: ``import pyogrio; help(pyogrio.write_dataframe)``.
+
+    Examples
+    --------
+
+    >>> gdf = GeoDataFrame({"geometry": [Point(0, 0), LineString([(0, 0), (1, 
1)])], "int": [1, 2]}
+    >>> gdf.to_file(filepath, format="geoparquet")
+
+    With selected drivers you can also append to a file with `mode="a"`:
+
+    >>> gdf.to_file(gdf, driver="geojson", mode="a")
+
+    When the index is of non-integer dtype, index=None (default) is treated as 
True, writing the index to the file.
+
+    >>> gdf = GeoDataFrame({"geometry": [Point(0, 0)]}, index=["a", "b"])
+    >>> gdf.to_file(gdf, driver="geoparquet")
+    """
+
+    ext_to_driver = {
+        ".parquet": "Parquet",
+        ".json": "GeoJSON",
+        ".geojson": "GeoJSON",
+    }
+
+    # auto detect driver from filename if not provided
+    if driver is None:
+        _, extension = os.path.splitext(path)
+        if extension not in ext_to_driver:
+            raise ValueError(f"Unsupported file extension: {extension}")
+        driver = ext_to_driver[extension]
+
+    spark_fmt = driver.lower()
+
+    crs = kwargs.pop("crs", None)
+    if crs:
+        from pyproj import CRS
+
+        crs = CRS.from_user_input(crs)
+
+    spark_df = df._internal.spark_frame.drop(NATURAL_ORDER_COLUMN_NAME)
+
+    if index is None:
+        # Determine if index attribute(s) should be saved to file
+        # (only if they are named or are non-integer)
+        index = list(df.index.names) != [None] or not 
is_integer_dtype(df.index.dtype)
+
+    if not index:
+        log_advice(
+            "If index is not True is not specified for `to_file`, "
+            "the existing index is lost when writing to a file."
+        )
+        spark_df = spark_df.drop(SPARK_DEFAULT_INDEX_NAME)
+
+    if spark_fmt == "geoparquet":
+        writer = spark_df.write.format("geoparquet")
+
+        # if not saving index we sort by GeoHash to optimize reading
+        if not index and df.active_geometry_name:
+            from sedona.spark import ST_GeoHash
+
+            spark_df = spark_df.orderBy(ST_GeoHash(df.geometry.spark.column, 
5))

Review Comment:
   why do we need to hard-code 5 here?  consider making this configurable from 
the argument?



##########
python/sedona/geopandas/io.py:
##########
@@ -0,0 +1,242 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import os
+from typing import Union
+import warnings
+import pyspark.pandas as ps
+from sedona.geopandas import GeoDataFrame
+from pyspark.pandas.utils import default_session, scol_for
+from pyspark.pandas.internal import SPARK_DEFAULT_INDEX_NAME, 
NATURAL_ORDER_COLUMN_NAME
+from pyspark.pandas.frame import InternalFrame
+from pyspark.pandas.utils import validate_mode, log_advice
+from pandas.api.types import is_integer_dtype
+
+
+def _to_file(
+    df: GeoDataFrame,
+    path: str,
+    driver: Union[str, None] = None,
+    index: Union[bool, None] = True,
+    **kwargs,
+):
+    """
+    Write the ``GeoDataFrame`` to a file.
+
+    Parameters
+    ----------
+    path : string
+        File path or file handle to write to.
+    driver : string, default None
+        The format driver used to write the file.
+        If not specified, it attempts to infer it from the file extension.
+        If no extension is specified, Sedona will error.
+        Options:
+            - "geojson"
+            - "geopackage"
+            - "geoparquet"
+    schema : dict, default None
+        Not applicable to Sedona's implementation
+    index : bool, default None
+        If True, write index into one or more columns (for MultiIndex).
+        Default None writes the index into one or more columns only if
+        the index is named, is a MultiIndex, or has a non-integer data
+        type. If False, no index is written.
+    mode : string, default 'w'
+        The write mode, 'w' to overwrite the existing file and 'a' to append.
+        'overwrite' and 'append' are equivalent to 'w' and 'a' respectively.
+    crs : pyproj.CRS, default None
+        If specified, the CRS is passed to Fiona to
+        better control how the file is written. If None, GeoPandas
+        will determine the crs based on crs df attribute.
+        The value can be anything accepted
+        by :meth:`pyproj.CRS.from_user_input() 
<pyproj.crs.CRS.from_user_input>`,
+        such as an authority string (eg "EPSG:4326") or a WKT string.
+    engine : str
+        Not applicable to Sedona's implementation
+    metadata : dict[str, str], default None
+        Optional metadata to be stored in the file. Keys and values must be
+        strings. Supported only for "GPKG" driver. Not supported by Sedona
+    **kwargs :
+        Keyword args to be passed to the engine, and can be used to write
+        to multi-layer data, store data within archives (zip files), etc.
+        In case of the "pyogrio" engine, the keyword arguments are passed to
+        `pyogrio.write_dataframe`. In case of the "fiona" engine, the keyword
+        arguments are passed to fiona.open`. For more information on possible
+        keywords, type: ``import pyogrio; help(pyogrio.write_dataframe)``.
+
+    Examples
+    --------
+
+    >>> gdf = GeoDataFrame({"geometry": [Point(0, 0), LineString([(0, 0), (1, 
1)])], "int": [1, 2]}
+    >>> gdf.to_file(filepath, format="geoparquet")
+
+    With selected drivers you can also append to a file with `mode="a"`:
+
+    >>> gdf.to_file(gdf, driver="geojson", mode="a")
+
+    When the index is of non-integer dtype, index=None (default) is treated as 
True, writing the index to the file.
+
+    >>> gdf = GeoDataFrame({"geometry": [Point(0, 0)]}, index=["a", "b"])
+    >>> gdf.to_file(gdf, driver="geoparquet")
+    """
+
+    ext_to_driver = {
+        ".parquet": "Parquet",
+        ".json": "GeoJSON",
+        ".geojson": "GeoJSON",
+    }
+
+    # auto detect driver from filename if not provided
+    if driver is None:
+        _, extension = os.path.splitext(path)
+        if extension not in ext_to_driver:
+            raise ValueError(f"Unsupported file extension: {extension}")
+        driver = ext_to_driver[extension]
+
+    spark_fmt = driver.lower()
+
+    crs = kwargs.pop("crs", None)
+    if crs:
+        from pyproj import CRS
+
+        crs = CRS.from_user_input(crs)
+
+    spark_df = df._internal.spark_frame.drop(NATURAL_ORDER_COLUMN_NAME)
+
+    if index is None:
+        # Determine if index attribute(s) should be saved to file
+        # (only if they are named or are non-integer)
+        index = list(df.index.names) != [None] or not 
is_integer_dtype(df.index.dtype)
+
+    if not index:
+        log_advice(
+            "If index is not True is not specified for `to_file`, "
+            "the existing index is lost when writing to a file."
+        )
+        spark_df = spark_df.drop(SPARK_DEFAULT_INDEX_NAME)
+
+    if spark_fmt == "geoparquet":
+        writer = spark_df.write.format("geoparquet")
+
+        # if not saving index we sort by GeoHash to optimize reading
+        if not index and df.active_geometry_name:
+            from sedona.spark import ST_GeoHash
+
+            spark_df = spark_df.orderBy(ST_GeoHash(df.geometry.spark.column, 
5))
+
+    elif spark_fmt == "geojson":
+        writer = spark_df.write.format("geojson")
+
+    else:
+        raise ValueError(f"Unsupported spark format: {spark_fmt}")
+
+    default_mode = "overwrite"
+    mode = validate_mode(kwargs.pop("mode", default_mode))
+
+    writer.mode(mode).save(path, **kwargs)
+
+
+def read_file(filename: str, format: Union[str, None] = None, **kwargs):
+    """
+    Alternate constructor to create a ``GeoDataFrame`` from a file.
+
+    Parameters
+    ----------
+    filename : str
+        File path or file handle to read from. If the path is a directory,
+        Sedona will read all files in the directory into a dataframe.
+    format : str, default None
+        The format of the file to read. If None, Sedona will infer the format
+        from the file extension. Note, inferring the format from the file 
extension
+        is not supported for directories.
+        Options:
+            - "shapefile"
+            - "geojson"
+            - "geopackage"
+            - "geoparquet"
+
+    table_name : str, default None
+        The name of the table to read from a geopackage file. Required if 
format is geopackage.
+
+    See also
+    --------
+    GeoDataFrame.to_file : write GeoDataFrame to file
+    """
+
+    # We warn the user if they try to use arguments that geopandas supports 
but not Sedona
+    if kwargs:
+        warnings.warn(f"The given arguments are not supported in Sedona: 
{kwargs}")
+
+    spark = default_session()
+
+    # If format is not specified, infer it from the file extension
+    if format is None:
+        if os.path.isdir(filename):
+            raise ValueError(
+                f"Inferring the format from the file extension is not 
supported for directories: {filename}"
+            )
+        if filename.endswith(".shp"):

Review Comment:
   Can we make these case-insensitive matching?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [GH-2149] Geopandas: Implement `to_file`, `from_file`, `read_file` [sedona]

Reply via email to