Ashok kumar Rajendran created SPARK-13900:
---------------------------------------------
Summary: Spark SQL queries with OR condition is not optimized
properly
Key: SPARK-13900
URL: https://issues.apache.org/jira/browse/SPARK-13900
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 1.6.0
Reporter: Ashok kumar Rajendran
I have a large table with few billions of rows and have a very small table with
4 dimensional values. All the data is stored in parquet format. I would like to
get rows that match any of these dimensions. For example,
Select field1, field2 from A, B where A.dimension1 = B.dimension1 OR
A.dimension2 = B.dimension2 OR A.dimension3 = B.dimension3 OR A.dimension4 =
B.dimension4.
The query plan takes this as BroadcastNestedLoopJoin and executes for very long
time.
If I execute this as Union queries, it takes around 1.5mins for each dimension.
Each query internally does BroadcastHashJoin.
Select field1, field2 from A, B where A.dimension1 = B.dimension1
UNION ALL
Select field1, field2 from A, B where A.dimension2 = B.dimension2
UNION ALL
Select field1, field2 from A, B where A.dimension3 = B.dimension3
UNION ALL
Select field1, field2 from A, B where A.dimension4 = B.dimension4.
This is obviously not an optimal solution as it makes multiple scanning at same
table but it gives result much better than OR condition.
Seems the SQL optimizer is not working properly which causes huge performance
impact on this type of OR query.
Please correct me if I miss anything here.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]