[ https://issues.apache.org/jira/browse/HIVE-4486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gopal V updated HIVE-4486: -------------------------- Description: While looking at log files for SMB joins in hive, it was noticed that the actual join op didn't show up as a significant fraction of the time spent. Most of the time was spent parsing configuration files. To confirm, I put log lines in the HiveConf constructor and eventually made the following edit to the code {code} --- ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java +++ ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java @@ -648,8 +648,7 @@ public ObjectInspector getOutputObjectInspector() throws HiveException { * @return list of file status entries */ private FileStatus[] listStatusUnderPath(FileSystem fs, Path p) throws IOException { - HiveConf hiveConf = new HiveConf(job, FetchOperator.class); - boolean recursive = hiveConf.getBoolVar(HiveConf.ConfVars.HADOOPMAPREDINPUTDIRRECURSIVE); + boolean recursive = false; if (!recursive) { return fs.listStatus(p); } {code} And re-ran my query to compare timings. || ||Before||After|| |Cumulative CPU| 731.07 sec|386.0 sec| |Total time | 347.66 seconds | 218.855 seconds | | The query used was {code}INSERT OVERWRITE LOCAL DIRECTORY '/grid/0/smb/' select inv_item_sk from inventory inv join store_sales ss on (ss.ss_item_sk = inv.inv_item_sk) limit 100000 ; {code} On a scale=2 tpcds data-set, where both store_sales & inventory are bucketed into 4 buckets, with store_sales split into 7 partitions and inventory into 261 partitions. 78% of all CPU time was spent within new HiveConf(). The yourkit profiler runs are attached. was: While looking at log files for SMB joins in hive, it was noticed that the actual join op didn't show up as a significant fraction of the time spent. Most of the time was spent parsing configuration files. To confirm, I put log lines in the HiveConf constructor and eventually made the following edit to the code {code} --- ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java +++ ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java @@ -648,8 +648,7 @@ public ObjectInspector getOutputObjectInspector() throws HiveException { * @return list of file status entries */ private FileStatus[] listStatusUnderPath(FileSystem fs, Path p) throws IOException { - HiveConf hiveConf = new HiveConf(job, FetchOperator.class); - boolean recursive = hiveConf.getBoolVar(HiveConf.ConfVars.HADOOPMAPREDINPUTDIRRECURSIVE); + boolean recursive = false; if (!recursive) { return fs.listStatus(p); } {code} And re-ran my query to compare timings. ||Before||After|| |Cumulative CPU| 731.07 sec|386.0 sec| |Total time | 347.66 seconds | 218.855 seconds | | The query used was {code}INSERT OVERWRITE LOCAL DIRECTORY '/grid/0/smb/' select inv_item_sk from inventory inv join store_sales ss on (ss.ss_item_sk = inv.inv_item_sk) limit 100000 ; {code} On a scale=2 tpcds data-set, where both store_sales & inventory are bucketed into 4 buckets, with store_sales split into 7 partitions and inventory into 261 partitions. 78% of all CPU time was spent within new HiveConf(). The yourkit profiler runs are attached. > FetchOperator slows down SMB map joins by 50% when there are many partitions > ---------------------------------------------------------------------------- > > Key: HIVE-4486 > URL: https://issues.apache.org/jira/browse/HIVE-4486 > Project: Hive > Issue Type: Bug > Components: Query Processor > Environment: Ubuntu LXC 12.10 > Reporter: Gopal V > Priority: Minor > Attachments: smb-profile.html > > > While looking at log files for SMB joins in hive, it was noticed that the > actual join op didn't show up as a significant fraction of the time spent. > Most of the time was spent parsing configuration files. > To confirm, I put log lines in the HiveConf constructor and eventually made > the following edit to the code > {code} > --- ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java > +++ ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java > @@ -648,8 +648,7 @@ public ObjectInspector getOutputObjectInspector() throws > HiveException { > * @return list of file status entries > */ > private FileStatus[] listStatusUnderPath(FileSystem fs, Path p) throws > IOException { > - HiveConf hiveConf = new HiveConf(job, FetchOperator.class); > - boolean recursive = > hiveConf.getBoolVar(HiveConf.ConfVars.HADOOPMAPREDINPUTDIRRECURSIVE); > + boolean recursive = false; > if (!recursive) { > return fs.listStatus(p); > } > {code} > And re-ran my query to compare timings. > || ||Before||After|| > |Cumulative CPU| 731.07 sec|386.0 sec| > |Total time | 347.66 seconds | 218.855 seconds | > | > The query used was > {code}INSERT OVERWRITE LOCAL DIRECTORY > '/grid/0/smb/' > select inv_item_sk > from > inventory inv > join store_sales ss on (ss.ss_item_sk = inv.inv_item_sk) > limit 100000 > ; > {code} > On a scale=2 tpcds data-set, where both store_sales & inventory are bucketed > into 4 buckets, with store_sales split into 7 partitions and inventory into > 261 partitions. > 78% of all CPU time was spent within new HiveConf(). The yourkit profiler > runs are attached. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira