paleolimbot opened a new pull request, #661:
URL: https://github.com/apache/sedona-db/pull/661

   This is just a check against https://github.com/georust/geo-index/pull/157 
to ensure that the fix is correct and that it actually solves the motivating 
use case.
   
   ```python
   import sedona.db
   
   sd = sedona.db.connect()
   
   sd.read_parquet(
       
"https://github.com/geoarrow/geoarrow-data/releases/download/v0.2.0/ns-water_elevation.parquet";
   ).to_view("elevation")
   
   sd.read_parquet(
       
"https://github.com/geoarrow/geoarrow-data/releases/download/v0.2.0/ns-water_water-point.parquet";
   ).to_view("water_point")
   
   sd.sql("""
     SELECT water_point."OBJECTID", water_point.geometry, elevation."ZVALUE"
     FROM water_point
     INNER JOIN elevation ON ST_KNN(
       ST_Transform(water_point.geometry, 26920),
       ST_Transform(elevation.geometry, 26920),
       1,
       false
     )
   """).to_parquet("foofy.parquet")
   ```
   
   ...now completes. It is, however, very slow compared to GeoPandas (1m 30s vs 
16s).
   
   <details>
   
   ```python
   import pandas as pd
   import geopandas as gpd
   import sedona.db
   
   sd = sedona.db.connect()
   
   sd.read_parquet(
       
"https://github.com/geoarrow/geoarrow-data/releases/download/v0.2.0/ns-water_elevation.parquet";
   ).to_view("elevation")
   
   sd.read_parquet(
       
"https://github.com/geoarrow/geoarrow-data/releases/download/v0.2.0/ns-water_water-point.parquet";
   ).to_view("water_point")
   
   # 10s
   elevation = sd.view("elevation").to_pandas()
   water_point = sd.view("water_point").to_pandas()
   
   # 5 s
   elevation = elevation.to_crs(26920)
   water_point = water_point.to_crs(26920)
   
   # 1.9s
   water_point_with_elevation = gpd.sjoin_nearest(
       water_point, elevation, how="left", distance_col="distance"
   )
   
   # 1m 30s
   sd.sql("""
     SELECT water_point."OBJECTID", water_point.geometry, elevation."ZVALUE"
     FROM water_point
     INNER JOIN elevation ON ST_KNN(
       ST_Transform(water_point.geometry, 26920),
       ST_Transform(elevation.geometry, 26920),
       1,
       false
     )
   """).to_memtable().to_view("joined")
   
   
   # The results are similar but there are a different number of results
   water_point_with_elevation_gpd = water_point_with_elevation[["OBJECTID", 
"geometry", "ZVALUE_left", "ZVALUE_right"]].sort_values("OBJECTID")
   water_point_with_elevation_sdb = 
sd.view("joined").to_pandas().sort_values("OBJECTID")
   
   print(water_point_with_elevation_sdb.ZVALUE.to_numpy()[:10])
   #> [2.9 1.3 2.5 2.2 0.6 3.5 1.5 1.  0.7 0.7]
   print(water_point_with_elevation_gpd.ZVALUE_right.to_numpy()[:10])
   #> [2.9 1.3 2.5 2.5 2.2 0.6 3.5 1.5 1.  0.7]
   
   pd.testing.assert_series_equal(
       water_point_with_elevation_sdb.ZVALUE,
       water_point_with_elevation_gpd.ZVALUE_right,
       check_index=False
   )
   #> AssertionError: Series are different
   #> Series length are different
   #> [left]:  44690
   #> [right]: 45359
   ```
   
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to