Rajesh Balamohan created SPARK-14588: ----------------------------------------
Summary: Consider getting column stats from files (wherever feasible) to get better stats for joins Key: SPARK-14588 URL: https://issues.apache.org/jira/browse/SPARK-14588 Project: Spark Issue Type: Improvement Components: SQL Reporter: Rajesh Balamohan Broadcast join is determined by "spark.sql.autoBroadcastJoinThreshold". Stats for this is determined from the files and by determining the projected columns (internally it assumes 20 bytes for string columns). However, estimated stats could be invalid if the dataset contains greater than 20 bytes for string columns . In such instances, broadcast join would not be invoked. File formats like ORC would be able to provide the raw data size for the projected columns. It might be good to consider those (whenever available) to determine the accurate stats for broadcast threshold. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org