[ https://issues.apache.org/jira/browse/HIVE-4486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13654668#comment-13654668 ]
Gopal V commented on HIVE-4486: ------------------------------- Closed that (wrong diff), opened as https://reviews.apache.org/r/11048/diff/ instead. > FetchOperator slows down SMB map joins by 50% when there are many partitions > ---------------------------------------------------------------------------- > > Key: HIVE-4486 > URL: https://issues.apache.org/jira/browse/HIVE-4486 > Project: Hive > Issue Type: Bug > Components: Query Processor > Affects Versions: 0.12.0 > Environment: Ubuntu LXC 12.10 > Reporter: Gopal V > Priority: Minor > Attachments: HIVE-4486.patch, smb-profile.html > > > While looking at log files for SMB joins in hive, it was noticed that the > actual join op didn't show up as a significant fraction of the time spent. > Most of the time was spent parsing configuration files. > To confirm, I put log lines in the HiveConf constructor and eventually made > the following edit to the code > {code} > --- ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java > +++ ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java > @@ -648,8 +648,7 @@ public ObjectInspector getOutputObjectInspector() throws > HiveException { > * @return list of file status entries > */ > private FileStatus[] listStatusUnderPath(FileSystem fs, Path p) throws > IOException { > - HiveConf hiveConf = new HiveConf(job, FetchOperator.class); > - boolean recursive = > hiveConf.getBoolVar(HiveConf.ConfVars.HADOOPMAPREDINPUTDIRRECURSIVE); > + boolean recursive = false; > if (!recursive) { > return fs.listStatus(p); > } > {code} > And re-ran my query to compare timings. > || ||Before||After|| > |Cumulative CPU| 731.07 sec|386.0 sec| > |Total time | 347.66 seconds | 218.855 seconds | > | > The query used was > {code}INSERT OVERWRITE LOCAL DIRECTORY > '/grid/0/smb/' > select inv_item_sk > from > inventory inv > join store_sales ss on (ss.ss_item_sk = inv.inv_item_sk) > limit 100000 > ; > {code} > On a scale=2 tpcds data-set, where both store_sales & inventory are bucketed > into 4 buckets, with store_sales split into 7 partitions and inventory into > 261 partitions. > 78% of all CPU time was spent within new HiveConf(). The yourkit profiler > runs are attached. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira