alamb commented on issue #17718:
URL: https://github.com/apache/datafusion/issues/17718#issuecomment-3338050011

   More feedback from @zhangfengcdt in discord which I think is also really 
interesting:
   
   
   Really nice discussion! When we implement the KNN join in SedonaDB, we had 
the similar concerns and did some experiments, and found the marker function + 
optimizer rule pattern works well for our case. There are mainly two challenges 
in our case: (1) asymmetric KNN execution, meaning we need to build a spatial 
index on the build side, and for each left side geometry find the k nearest 
neighbors (2) join order control, meaning we need to control the predicate 
evaluation order
   
   For the first challenge, we register a stub scalar UDF and in query planner, 
we detect ST_KNN predicates in join filters and transforms them into a 
specialized SpatialJoinExec physical plan with KNN semantics. for the second 
challange, we add a barrier function to serve as an optimization barrier to 
prevent filter pushdown and control predicate evaluation order. This is 
critical for maintaining semantic correctness (KNN then filter vs. filter then 
KNN). Both work well for the purpose.
   
   The marker function + optimizer rule pattern is indeed the most practical 
approach for adding custom join strategies to DataFusion/Arrow-based systems. 
It's more robust than it might seem because:
   
   The optimizer rule has full control over when to apply the transformation
   The stub function provides type checking and documentation
   It integrates naturally with SQL without parser modifications
   
   
   We have the similar approach on Apache SedonaSpark as well. I would 
recommend these steps for custom joins for reference:
   
   Use marker functions for SQL integration
   Implement robust pattern matching in optimizer rules
   Provide optimization barriers when semantics are order-dependent
   Document the transformation clearly for users
   Consider providing both SQL and DataFrame APIs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to