Copilot commented on code in PR #30:
URL:
https://github.com/apache/sedona-spatialbench/pull/30#discussion_r2371142231
##########
docs/queries.ipynb:
##########
@@ -420,7 +677,7 @@
" ST_GeomFromWKB(t.t_pickuploc),\n",
" ST_GeomFromWKB(t.t_dropoffloc)\n",
" )\n",
- " ) / 0.000009 AS line_distance_m -- 1 meter = 0.000009 degree\n",
+ " ) * 111111 AS line_distance_m -- Approx. meters per degree\n",
Review Comment:
The magic number 111111 for meters per degree conversion is not accurate for
all latitudes. Consider using a more precise conversion formula or adding a
comment explaining the approximation's limitations.
##########
docs/queries.md:
##########
@@ -0,0 +1,612 @@
+# Run the SpatialBench Queries
+
+This notebook contains the queries that make up the SpatialBench benchmark.
+
+SpatialBench is a benchmark for assessing geospatial SQL analytics query
performance across database systems. The queries represent common, real-world
geospatial analytics tasks and are designed to test a wide range of spatial
functions and join conditions.
+
+The benchmark uses a realistic but synthetic, transportation-themed dataset to
ensure the queries reflect practical use cases. By running these queries, you
can evaluate and compare the relative performance of different spatial query
engines in a consistent and unbiased manner.
+
+## Before you start
+
+Before running this notebook, ensure that you have installed the packages in
the `requirements.txt` file:
+
+
+```python
+%pip install -r ~/sedona-spatialbench/docs/requirements.txt
+```
+
+ ...
+ ...
+ Note: you may need to restart the kernel to use updated packages.
+
+
+Additionally, install the SpatialBench CLI and generate the synthetic data on
your machine:
+
+```
+# SpatialBench CLI
+cargo install --path ./spatialbench-cli
+# Generate the benchmarking data to the sf1-parquet directory
+spatialbench-cli -s 1 --format=parquet --output-dir sf1-parquet
+```
+
+
+```python
+import sedona.db
+```
+
+
+```python
+sd = sedona.db.connect()
+```
+
+
+```python
+sd.read_parquet(f"../sf1-parquet/building.parquet").to_view("building")
+sd.read_parquet(f"../sf1-parquet/customer.parquet").to_view("customer")
+sd.read_parquet(f"../sf1-parquet/driver.parquet").to_view("driver")
+sd.read_parquet(f"../sf1-parquet/trip.parquet").to_view("trip")
+sd.read_parquet(f"../sf1-parquet/vehicle.parquet").to_view("vehicle")
+sd.read_parquet(f"../sf1-parquet/zone.parquet").to_view("zone")
+```
+
+## Q1: Find trips starting within 50km of Sedona city center, ordered by
distance
+
+**Real-life scenario:** Identify and rank trips by proximity to a city center
for urban planning and transportation analysis.
+
+This query finds all taxi or rideshare trips that started within 50 kilometers
of downtown Sedona, Arizona. For each qualifying trip, it shows the trip ID,
pickup coordinates, pickup time, and calculates the exact distance from the
pickup location to Sedona's city center. The results are sorted to show the
trips that picked up closest to downtown Sedona first, making it easy to see
which rides originated nearest to the city center.
+
+**Spatial query characteristics tested:**
+
+1. Distance-based spatial filtering (ST_DWithin)
+2. Distance calculation to a fixed point
+3. Coordinate extraction (ST_X, ST_Y)
+4. Ordering by spatial distance
+
+
+```python
+sd.sql("""
+SELECT
+ t.t_tripkey,
+ ST_X(ST_GeomFromWKB(t.t_pickuploc)) AS pickup_lon,
+ ST_Y(ST_GeomFromWKB(t.t_pickuploc)) AS pickup_lat,
+ t.t_pickuptime,
+ ST_Distance(
+ ST_GeomFromWKB(t.t_pickuploc),
+ ST_GeomFromText('POINT (-111.7610 34.8697)')
+ ) AS distance_to_center
+FROM trip t
+WHERE ST_DWithin(
+ ST_GeomFromWKB(t.t_pickuploc),
+ ST_GeomFromText('POINT (-111.7610 34.8697)'),
+ 0.45 -- 50km radius around Sedona center in degrees
+)
+ORDER BY distance_to_center ASC, t.t_tripkey ASC
+""").show(3)
+```
+
+
┌───────────┬────────────────┬──────────────┬─────────────────────┬──────────────────────┐
+ │ t_tripkey ┆ pickup_lon ┆ pickup_lat ┆ t_pickuptime ┆
distance_to_center │
+ │ int64 ┆ float64 ┆ float64 ┆ timestamp ┆
float64 │
+
╞═══════════╪════════════════╪══════════════╪═════════════════════╪══════════════════════╡
+ │ 1451371 ┆ -111.791052127 ┆ 34.826733457 ┆ 1998-08-12T06:47:01 ┆
0.05243333056935387 │
+
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
+ │ 2047835 ┆ -111.706967009 ┆ 34.883889472 ┆ 1992-04-08T07:36:09 ┆
0.055865062714050374 │
+
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
+ │ 3936870 ┆ -111.827619221 ┆ 34.882950924 ┆ 1998-11-10T13:32:07 ┆
0.06792427838042854 │
+
└───────────┴────────────────┴──────────────┴─────────────────────┴──────────────────────┘
+
+
+## Q2: Count trips starting within Coconino County (Arizona) zone
+
+**Real-life scenario:** Count all trips originating within a specific
administrative boundary (county) for regional transportation statistics.
+
+This query counts how many taxi or rideshare trips started within Coconino
County, Arizona. It does this by checking if each trip's pickup location falls
inside the county's geographic boundaries. The result is a simple count showing
the total number of trips that originated anywhere within Coconino County.
+
+**Spatial query characteristics tested:**
+
+1. Point-in-polygon spatial filtering (ST_Intersects)
+2. Subquery with spatial geometry selection
+3. Simple aggregation on spatially filtered data
+
+
+```python
+sd.sql("""
+SELECT COUNT(*) AS trip_count_in_coconino_county
+FROM trip t
+WHERE ST_Intersects(
+ ST_GeomFromWKB(t.t_pickuploc),
+ (
+ SELECT ST_GeomFromWKB(z.z_boundary)
+ FROM zone z
+ WHERE z.z_name = 'Coconino County'
+ LIMIT 1
+ )
+)
+""").show(3)
+```
+
+ ┌───────────────────────────────┐
+ │ trip_count_in_coconino_county │
+ │ int64 │
+ ╞═══════════════════════════════╡
+ │ 541 │
+ └───────────────────────────────┘
+
+
+## Q3: Monthly trip statistics within a 15km radius of the Sedona city center
+
+**Real-life scenario:** Track monthly travel trends and performance metrics in
a metropolitan area with seasonal analysis.
+
+This query analyzes taxi and rideshare trip patterns around Sedona, Arizona,
by grouping trips into monthly summaries. It looks at all trips that started
within a 15-kilometer area around Sedona (a 10km box plus 5km buffer) and
calculates key statistics for each month, including total number of trips,
average trip distance, average trip duration, and average fare. The results are
organized chronologically by month, allowing you to see seasonal trends and
changes in ride patterns over time in the Sedona area.
+
+**Spatial query characteristics tested:**
+
+1. Distance-based spatial filtering (ST_DWithin) with buffer
+2. Temporal grouping (monthly aggregation)
+3. Multiple statistical aggregations on spatially filtered data
+
+
+```python
+sd.sql("""
+SELECT
+ DATE_TRUNC('month', t.t_pickuptime) AS pickup_month,
+ COUNT(t.t_tripkey) AS total_trips,
+ AVG(t.t_distance) AS avg_distance,
+ AVG(t.t_dropofftime - t.t_pickuptime) AS avg_duration,
+ AVG(t.t_fare) AS avg_fare
+FROM trip t
+WHERE ST_DWithin(
+ ST_GeomFromWKB(t.t_pickuploc),
+ ST_GeomFromText('POLYGON((
+ -111.9060 34.7347, -111.6160 34.7347,
+ -111.6160 35.0047, -111.9060 35.0047,
+ -111.9060 34.7347
+ ))'), -- Bounding box around Sedona
+ 0.045 -- Additional 5km buffer in degrees
+)
+GROUP BY pickup_month
+ORDER BY pickup_month
+""").show(3)
+```
+
+
┌─────────────────────┬─────────────┬──────────────┬─────────────────────────────────┬─────────────┐
+ │ pickup_month ┆ total_trips ┆ avg_distance ┆
avg_duration ┆ avg_fare │
+ │ timestamp ┆ int64 ┆ decimal128 ┆ duration
┆ decimal128 │
+
╞═════════════════════╪═════════════╪══════════════╪═════════════════════════════════╪═════════════╡
+ │ 1992-04-01T00:00:00 ┆ 2 ┆ 0.000020000 ┆ 0 days 1 hours 23
mins 47.000 … ┆ 0.000075000 │
+
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
+ │ 1992-07-01T00:00:00 ┆ 1 ┆ 0.000010000 ┆ 0 days 0 hours 58
mins 58.000 … ┆ 0.000040000 │
+
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
+ │ 1994-02-01T00:00:00 ┆ 2 ┆ 0.000020000 ┆ 0 days 1 hours 23
mins 50.000 … ┆ 0.000050000 │
+
└─────────────────────┴─────────────┴──────────────┴─────────────────────────────────┴─────────────┘
+
+
+## Q4: Zone distribution of top 1000 trips by tip amount
+
+**Real-life scenario:** Analyze the geographic distribution of high-value
trips (by tip amount) to understand premium service areas.
+
+This query identifies which neighborhoods or zones produced the most generous
tippers by analyzing the top 1000 highest-tipping trips. It first finds the
1000 trips with the largest tips, then determines which geographic zones or
neighborhoods those pickup locations fall within, and counts how many of these
high-tip trips originated from each area. The results show a ranking of zones
by the number of big tippers they produced, helping identify the most lucrative
pickup areas for drivers seeking high-tip rides.
+
+**Spatial query characteristics tested:**
+
+1. Subquery with ordering and limiting
+2. Point-in-polygon spatial join (ST_Within)
+3. Aggregation on spatially joined results
+4. Multi-step query with spatial filtering and grouping
+
+
+```python
+sd.sql("""
+SELECT
+ z.z_zonekey,
+ z.z_name,
+ COUNT(*) AS trip_count
+FROM
+ zone z
+ JOIN (
+ SELECT t.t_pickuploc
+ FROM trip t
+ ORDER BY t.t_tip DESC, t.t_tripkey ASC
+ LIMIT 1000
+ ) top_trips
+ ON ST_Within(
+ ST_GeomFromWKB(top_trips.t_pickuploc),
+ ST_GeomFromWKB(z.z_boundary)
+ )
+GROUP BY z.z_zonekey, z.z_name
+ORDER BY trip_count DESC, z.z_zonekey ASC
+""").show(3)
+```
+
+ ┌───────────┬─────────────────────────────────┬────────────┐
+ │ z_zonekey ┆ z_name ┆ trip_count │
+ │ int64 ┆ utf8 ┆ int64 │
+ ╞═══════════╪═════════════════════════════════╪════════════╡
+ │ 65008 ┆ Ndélé ┆ 35 │
+ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
+ │ 147530 ┆ 乐山市 ┆ 27 │
+ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
+ │ 150276 ┆ 锡林郭勒盟 ᠰᠢᠯᠢ ᠶᠢᠨ ᠭᠣᠣᠯ ᠠᠶᠢᠮᠠᠭ ┆ 19 │
+ └───────────┴─────────────────────────────────┴────────────┘
+
+
+## Q5: Monthly travel patterns for repeat customers (convex hull of dropoff
locations)
+
+**Real-life scenario:** Analyze the geographic spread of travel patterns for
frequent customers to understand their mobility behavior.
+
+This query analyzes the monthly travel patterns of frequent customers by
measuring how much geographic area they cover with their trips. For each
customer who took more than five trips in a month, it calculates the size of
the "travel hull" - the area enclosed by connecting all their dropoff locations
that month. The results reveal which customers have the most expansive travel
patterns, helping to identify power users who cover large geographic areas
versus those who stick to smaller, local areas.
+
+**Spatial query characteristics tested:**
+
+1. Spatial aggregation (ST_Collect/ARRAY_AGG)
+2. Convex hull computation (ST_ConvexHull)
+3. Area calculation on complex geometries
+4. Temporal and customer-based grouping with spatial operations
+
+
+```python
+sd.sql("""
+SELECT
+ c.c_custkey,
+ c.c_name AS customer_name,
+ DATE_TRUNC('month', t.t_pickuptime) AS pickup_month,
+ ST_Area(
+ ST_ConvexHull(ST_Collect(ST_GeomFromWKB(t.t_dropoffloc)))
+ ) AS monthly_travel_hull_area,
+ COUNT(*) as dropoff_count
+FROM trip t
+JOIN customer c
+ ON t.t_custkey = c.c_custkey
+GROUP BY c.c_custkey, c.c_name, pickup_month
+HAVING dropoff_count > 5 -- Only include repeat customers
+ORDER BY monthly_travel_hull_area DESC, c.c_custkey ASC
+""").show(3)
+```
+
+
┌───────────┬────────────────────┬─────────────────────┬────────────────────┬───────────────┐
+ │ c_custkey ┆ customer_name ┆ pickup_month ┆
monthly_travel_hul ┆ dropoff_count │
+ │ int64 ┆ utf8 ┆ timestamp ┆ l_area…
┆ int64 │
+
╞═══════════╪════════════════════╪═════════════════════╪════════════════════╪═══════════════╡
+ │ 25975 ┆ Customer#000025975 ┆ 1992-02-01T00:00:00 ┆
34941.303419053635 ┆ 10 │
+
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
+ │ 12061 ┆ Customer#000012061 ┆ 1997-03-01T00:00:00 ┆
34607.53871953154 ┆ 14 │
+
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
+ │ 21418 ┆ Customer#000021418 ┆ 1993-08-01T00:00:00 ┆
34465.32323910264 ┆ 9 │
+
└───────────┴────────────────────┴─────────────────────┴────────────────────┴───────────────┘
+
+
+## Q6: Zone statistics for trips within a 50km radius of the Sedona city center
+
+**Real-life scenario:** Analyze trip patterns in zones within a metropolitan
area around a specific city center.
+
+This query analyzes ride activity across all neighborhoods and zones within a
50-kilometer area around Sedona, Arizona. It identifies which zones had the
most pickup activity by counting total trips that started in each region. Also,
it calculates the average trip cost and duration for rides originating from
each zone. The results are ranked by pickup volume, showing which neighborhoods
or areas generate the most ride demand and their typical trip characteristics
within the greater Sedona region.
+
+**Spatial query characteristics tested:**
+
+1. Polygon containment check (ST_Contains) with bounding box
+2. Point-in-polygon spatial join (ST_Within)
+
+
+```python
+sd.sql("""
+SELECT
+ z.z_zonekey,
+ z.z_name,
+ COUNT(t.t_tripkey) AS total_pickups,
+ AVG(t.t_distance) AS avg_distance, -- Corrected from t_totalamount
+ AVG(t.t_dropofftime - t.t_pickuptime) AS avg_duration
+FROM trip t, zone z
+WHERE ST_Intersects(
+ ST_GeomFromText('POLYGON((
+ -112.2110 34.4197, -111.3110 34.4197,
+ -111.3110 35.3197, -112.2110 35.3197,
+ -112.2110 34.4197
+ ))'), -- Bounding box around Sedona
+ ST_GeomFromWKB(z.z_boundary)
+ )
+ AND ST_Within(
+ ST_GeomFromWKB(t.t_pickuploc),
+ ST_GeomFromWKB(z.z_boundary)
+ )
+GROUP BY z.z_zonekey, z.z_name
+ORDER BY total_pickups DESC, z.z_zonekey ASC
+""").show(3)
+```
+
+
┌───────────┬─────────────────┬───────────────┬──────────────┬────────────────────────────────────┐
+ │ z_zonekey ┆ z_name ┆ total_pickups ┆ avg_distance ┆
avg_duration │
+ │ int64 ┆ utf8 ┆ int64 ┆ decimal128 ┆
duration │
+
╞═══════════╪═════════════════╪═══════════════╪══════════════╪════════════════════════════════════╡
+ │ 30084 ┆ Coconino County ┆ 541 ┆ 0.000030406 ┆ 0 days 1
hours 45 mins 16.591 secs │
+
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
+ │ 30083 ┆ Yavapai County ┆ 292 ┆ 0.000027157 ┆ 0 days 1
hours 36 mins 43.647 secs │
+
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
+ │ 29488 ┆ Gila County ┆ 39 ┆ 0.000021282 ┆ 0 days 1
hours 16 mins 59.769 secs │
+
└───────────┴─────────────────┴───────────────┴──────────────┴────────────────────────────────────┘
+
+
+## Q7: Detect potential route detours by comparing reported vs. geometric
distances
+
+**Real-life scenario:** Identify suspicious trips where the reported distance
significantly exceeds the straight-line distance, potentially indicating fare
manipulation.
+
+This query analyzes how much taxi and rideshare trips deviate from the most
direct route by comparing the actual reported trip distance to the
straight-line distance between pickup and dropoff points. It calculates a
"detour ratio" that shows how much longer the actual route was compared to
flying in a straight line. For example, a ratio of 1.5 means the trip was 50%
longer than the direct path. The results are sorted to show the trips with the
highest detour ratios first, helping identify routes that took significant
detours due to traffic, road layouts, or other factors.
+
+**Spatial query characteristics tested:**
+
+1. Line geometry creation (ST_MakeLine)
+2. Length calculation (ST_Length)
+3. Coordinate system conversion and distance calculations
+4. Ratio-based filtering on geometric vs. reported measurements
+
+
+```python
+sd.sql("""
+WITH trip_lengths AS (
+ SELECT
+ t.t_tripkey,
+ t.t_distance AS reported_distance_m,
+ ST_Length(
+ ST_MakeLine(
+ ST_GeomFromWKB(t.t_pickuploc),
+ ST_GeomFromWKB(t.t_dropoffloc)
+ )
+ ) * 111111 AS line_distance_m -- Approx. meters per degree
Review Comment:
The magic number 111111 for meters per degree conversion is not accurate for
all latitudes. Consider using a more precise conversion formula or adding a
comment explaining the approximation's limitations.
```suggestion
ST_Distance(
ST_GeomFromWKB(t.t_pickuploc)::geography,
ST_GeomFromWKB(t.t_dropoffloc)::geography
) AS line_distance_m -- Straight-line (great-circle) distance in
meters
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]