paleolimbot opened a new pull request, #661: URL: https://github.com/apache/sedona-db/pull/661
This is just a check against https://github.com/georust/geo-index/pull/157 to ensure that the fix is correct and that it actually solves the motivating use case. ```python import sedona.db sd = sedona.db.connect() sd.read_parquet( "https://github.com/geoarrow/geoarrow-data/releases/download/v0.2.0/ns-water_elevation.parquet" ).to_view("elevation") sd.read_parquet( "https://github.com/geoarrow/geoarrow-data/releases/download/v0.2.0/ns-water_water-point.parquet" ).to_view("water_point") sd.sql(""" SELECT water_point."OBJECTID", water_point.geometry, elevation."ZVALUE" FROM water_point INNER JOIN elevation ON ST_KNN( ST_Transform(water_point.geometry, 26920), ST_Transform(elevation.geometry, 26920), 1, false ) """).to_parquet("foofy.parquet") ``` ...now completes. It is, however, very slow compared to GeoPandas (1m 30s vs 16s). <details> ```python import pandas as pd import geopandas as gpd import sedona.db sd = sedona.db.connect() sd.read_parquet( "https://github.com/geoarrow/geoarrow-data/releases/download/v0.2.0/ns-water_elevation.parquet" ).to_view("elevation") sd.read_parquet( "https://github.com/geoarrow/geoarrow-data/releases/download/v0.2.0/ns-water_water-point.parquet" ).to_view("water_point") # 10s elevation = sd.view("elevation").to_pandas() water_point = sd.view("water_point").to_pandas() # 5 s elevation = elevation.to_crs(26920) water_point = water_point.to_crs(26920) # 1.9s water_point_with_elevation = gpd.sjoin_nearest( water_point, elevation, how="left", distance_col="distance" ) # 1m 30s sd.sql(""" SELECT water_point."OBJECTID", water_point.geometry, elevation."ZVALUE" FROM water_point INNER JOIN elevation ON ST_KNN( ST_Transform(water_point.geometry, 26920), ST_Transform(elevation.geometry, 26920), 1, false ) """).to_memtable().to_view("joined") # The results are similar but there are a different number of results water_point_with_elevation_gpd = water_point_with_elevation[["OBJECTID", "geometry", "ZVALUE_left", "ZVALUE_right"]].sort_values("OBJECTID") water_point_with_elevation_sdb = sd.view("joined").to_pandas().sort_values("OBJECTID") print(water_point_with_elevation_sdb.ZVALUE.to_numpy()[:10]) #> [2.9 1.3 2.5 2.2 0.6 3.5 1.5 1. 0.7 0.7] print(water_point_with_elevation_gpd.ZVALUE_right.to_numpy()[:10]) #> [2.9 1.3 2.5 2.5 2.2 0.6 3.5 1.5 1. 0.7] pd.testing.assert_series_equal( water_point_with_elevation_sdb.ZVALUE, water_point_with_elevation_gpd.ZVALUE_right, check_index=False ) #> AssertionError: Series are different #> Series length are different #> [left]: 44690 #> [right]: 45359 ``` </details> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
