[GitHub] [spark] wankunde opened a new pull request #35550: [SPARK-38238]Contains Join for Spark SQL

GitBox Thu, 17 Feb 2022 01:22:52 -0800


wankunde opened a new pull request #35550:
URL: https://github.com/apache/spark/pull/35550



   ### What changes were proposed in this pull request?
   
   Try to optimize the string contains join query which could run for a long 
time. 
   For example:
   ```sql
   SELECT a.text, b.pattern
   FROM fact_table a
   JOIN patterns b
   ON a.text like concat('%', b.pattern, '%');
   ```
   Or
   ```sql
   SELECT a.text, b.pattern
   FROM fact_table a
   JOIN patterns b
   ON position(b.pattern, a.text) > 0;
   ```
   The query will go from **O(M * N * m * n)** to O**(M * m * max(n))**
   M = number of records in the fact table
   N = number of records in the patterns table
   m = row length of the fact table
   n = row length of the patterns table
   
   ### Why are the changes needed?
   
   Before this change, if we want to match many patterns  for each row of the 
fact table, it could run a very long time.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   
   ### How was this patch tested?
   
   Added UTs
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] wankunde opened a new pull request #35550: [SPARK-38238]Contains Join for Spark SQL

Reply via email to