Feng Zhang created SEDONA-648: --------------------------------- Summary: Implement Distributed K Nearest Neighbor Join Key: SEDONA-648 URL: https://issues.apache.org/jira/browse/SEDONA-648 Project: Apache Sedona Issue Type: New Feature Reporter: Feng Zhang
A geospatial k-Nearest Neighbors (kNN) join is a specialized form of the kNN join that specifically deals with geospatial data. This method involves identifying the k-nearest neighbors for a given spatial point or region based on geographic proximity, typically using spatial coordinates and a suitable distance metric like Euclidean or great-circle distance. A kNN join operation involves two datasets. For each record in the first dataset, it finds the k-nearest neighbors from the second dataset based on a given distance metric. In a distributed environment, this process involves several challenges: * {*}Data Partitioning{*}: Data needs to be partitioned across different nodes in a way that minimizes the inter-node communication and balances the load among nodes. * {*}Efficient Search{*}: Implementing efficient algorithms that can quickly find the k-nearest neighbors among potentially billions of data points. * {*}Data Locality{*}: Keeping data as close as possible to where it is processed to reduce network transfers and latency. - Syntax Definition ```sql SELECT <column_list> FROM <tableR> JOIN <tableS> ON ST_KNN(<tableR.column>, <tableS.column>, <k>, <use_spheroid>) ``` - **Parameters:** - **`<column_list>`**: The list of columns to be selected from both tables. - **`<tableR>`**: The left table in the join. - **`<tableS>`**: The right table in the join. - **`<table1.column>`**: The column from the left table containing geometric data. - **`<table2.column>`**: The column from the right table containing geometric data. - **`<k>`**: The number of nearest neighbors to match between tables. - **`<use_spheroid>`**: If the distance calculation will be based on spherical coordinate system (e.g. WGS 84 Long Lat SRID=4326). Set it to false to use the projected coordinate system (e.g., Mercator EPSG:3785). - Example Usage: ```sql SELECT R.id, S.id, R.location, S.location FROM TableS S JOIN TableR R ON ST_KNN(R.location, S.location, 5, true) ``` In this example, **`TableS`** and **`TableR`** are joined based on the 5 approximate nearest neighbors in their respective **`location`** columns, using the GCD metric. -- This message was sent by Atlassian Jira (v8.20.10#820010)