[I] [CH] A bad case for joining with mixed join conditions [incubator-gluten]

via GitHub Fri, 09 Aug 2024 00:10:28 -0700


lgbo-ustc opened a new issue, #6768:
URL: https://github.com/apache/incubator-gluten/issues/6768


   ### Backend
   
   CH (ClickHouse)
   
   ### Bug description
   
   [Expected behavior] and [actual behavior].
   
   We met a query in production environment which has a real bad performace on 
join. The query looks like follow
   ```sql
   select * from t1 left join t2 on t1.uid = t2.uid and (t1.id1 = t2.id1 or 
t1.id2 = t2.id2 or t1.id3 = t2.id3)
   ```
   
   There are two main problems
   
   First, The right table is very large, over 5,000,000,000 rows. Using it to 
build the join hash table is very resource intensive
   
   Second, when only apply join condition `t1.uid = t2.uid`, it could bring a 
very large matching results, >  5,000,000,000 * 100. But after apply filter 
condition `(t1.id1 = t2.id1 or t1.id2 = t2.id2 or t1.id3 = t2.id3)` on this 
matching result, less then 100000 rows left.
   
   
   
   
   ### Spark version
   
   None
   
   ### Spark configurations
   
   _No response_
   
   ### System information
   
   _No response_
   
   ### Relevant logs
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] [CH] A bad case for joining with mixed join conditions [incubator-gluten]

Reply via email to