Rajesh Balamohan created SPARK-14588:
----------------------------------------

             Summary: Consider getting column stats from files (wherever 
feasible) to get better stats for joins
                 Key: SPARK-14588
                 URL: https://issues.apache.org/jira/browse/SPARK-14588
             Project: Spark
          Issue Type: Improvement
          Components: SQL
            Reporter: Rajesh Balamohan


Broadcast join is determined by "spark.sql.autoBroadcastJoinThreshold". Stats 
for this is determined from the files and by determining the projected columns 
(internally it assumes 20 bytes for string columns). However, estimated stats 
could be invalid if the dataset contains greater than 20 bytes for string 
columns . In such instances, broadcast join would not be invoked. 

File formats like ORC would be able to provide the raw data size for the 
projected columns. It might be good to consider those (whenever available) to 
determine the accurate stats for broadcast threshold.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to