Ashok kumar Rajendran created SPARK-13900:
---------------------------------------------

             Summary: Spark SQL queries with OR condition is not optimized 
properly
                 Key: SPARK-13900
                 URL: https://issues.apache.org/jira/browse/SPARK-13900
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.6.0
            Reporter: Ashok kumar Rajendran


I have a large table with few billions of rows and have a very small table with 
4 dimensional values. All the data is stored in parquet format. I would like to 
get rows that match any of these dimensions. For example,

Select field1, field2 from A, B where A.dimension1 = B.dimension1 OR 
A.dimension2 = B.dimension2 OR A.dimension3 = B.dimension3 OR A.dimension4 = 
B.dimension4.

The query plan takes this as BroadcastNestedLoopJoin and executes for very long 
time.

If I execute this as Union queries, it takes around 1.5mins for each dimension. 
Each query internally does BroadcastHashJoin.

Select field1, field2 from A, B where A.dimension1 = B.dimension1
UNION ALL
Select field1, field2 from A, B where A.dimension2 = B.dimension2
UNION ALL
Select field1, field2 from A, B where  A.dimension3 = B.dimension3
UNION ALL
Select field1, field2 from A, B where  A.dimension4 = B.dimension4.

This is obviously not an optimal solution as it makes multiple scanning at same 
table but it gives result much better than OR condition. 

Seems the SQL optimizer is not working properly which causes huge performance 
impact on this type of OR query.

Please correct me if I miss anything here. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to