Feng Zhang created SEDONA-648:
---------------------------------

             Summary: Implement Distributed K Nearest Neighbor Join
                 Key: SEDONA-648
                 URL: https://issues.apache.org/jira/browse/SEDONA-648
             Project: Apache Sedona
          Issue Type: New Feature
            Reporter: Feng Zhang


A geospatial k-Nearest Neighbors (kNN) join is a specialized form of the kNN 
join that specifically deals with geospatial data. This method involves 
identifying the k-nearest neighbors for a given spatial point or region based 
on geographic proximity, typically using spatial coordinates and a suitable 
distance metric like Euclidean or great-circle distance.

A kNN join operation involves two datasets. For each record in the first 
dataset, it finds the k-nearest neighbors from the second dataset based on a 
given distance metric. In a distributed environment, this process involves 
several challenges:
 * {*}Data Partitioning{*}: Data needs to be partitioned across different nodes 
in a way that minimizes the inter-node communication and balances the load 
among nodes.
 * {*}Efficient Search{*}: Implementing efficient algorithms that can quickly 
find the k-nearest neighbors among potentially billions of data points.
 * {*}Data Locality{*}: Keeping data as close as possible to where it is 
processed to reduce network transfers and latency.

- Syntax Definition

```sql
SELECT <column_list>
FROM <tableR>
JOIN <tableS> ON ST_KNN(<tableR.column>, <tableS.column>, <k>, <use_spheroid>)
```

- **Parameters:**
    - **`<column_list>`**: The list of columns to be selected from both tables.
    - **`<tableR>`**: The left table in the join.
    - **`<tableS>`**: The right table in the join.
    - **`<table1.column>`**: The column from the left table containing 
geometric data.
    - **`<table2.column>`**: The column from the right table containing 
geometric data.
    - **`<k>`**: The number of nearest neighbors to match between tables.
    - **`<use_spheroid>`**: If the distance calculation will be based on 
spherical coordinate system (e.g. WGS 84 Long Lat SRID=4326). Set it to false 
to use the projected coordinate system (e.g., Mercator EPSG:3785).

- Example Usage:

```sql
SELECT R.id, S.id, R.location, S.location
FROM TableS S
JOIN TableR R ON ST_KNN(R.location, S.location, 5, true)
```

In this example, **`TableS`** and **`TableR`** are joined based on the 5 
approximate nearest neighbors in their respective **`location`** columns, using 
the GCD metric.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to