[jira] [Updated] (HIVE-4486) FetchOperator slows down SMB map joins by 50% when there are many partitions

Gopal V (JIRA) Fri, 03 May 2013 07:04:17 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-4486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Gopal V updated HIVE-4486:
--------------------------

    Description: 
While looking at log files for SMB joins in hive, it was noticed that the 
actual join op didn't show up as a significant fraction of the time spent. Most 
of the time was spent parsing configuration files.

To confirm, I put log lines in the HiveConf constructor and eventually made the 
following edit to the code

{code}
--- ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java
+++ ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java
@@ -648,8 +648,7 @@ public ObjectInspector getOutputObjectInspector() throws 
HiveException {
    * @return list of file status entries
    */
   private FileStatus[] listStatusUnderPath(FileSystem fs, Path p) throws 
IOException {
-    HiveConf hiveConf = new HiveConf(job, FetchOperator.class);
-    boolean recursive = 
hiveConf.getBoolVar(HiveConf.ConfVars.HADOOPMAPREDINPUTDIRRECURSIVE);
+    boolean recursive = false;
     if (!recursive) {
       return fs.listStatus(p);
     }
{code}

And re-ran my query to compare timings.

|| ||Before||After||
|Cumulative CPU| 731.07 sec|386.0 sec|
|Total time | 347.66 seconds | 218.855 seconds | 
|

The query used was 

{code}INSERT OVERWRITE LOCAL DIRECTORY
'/grid/0/smb/'
select inv_item_sk
from
     inventory inv
     join store_sales ss on (ss.ss_item_sk = inv.inv_item_sk)
limit 100000
;
{code}

On a scale=2 tpcds data-set, where both store_sales & inventory are bucketed 
into 4 buckets, with store_sales split into 7 partitions and inventory into 261 
partitions.

78% of all CPU time was spent within new HiveConf(). The yourkit profiler runs 
are attached.

  was:
While looking at log files for SMB joins in hive, it was noticed that the 
actual join op didn't show up as a significant fraction of the time spent. Most 
of the time was spent parsing configuration files.

To confirm, I put log lines in the HiveConf constructor and eventually made the 
following edit to the code

{code}
--- ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java
+++ ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java
@@ -648,8 +648,7 @@ public ObjectInspector getOutputObjectInspector() throws 
HiveException {
    * @return list of file status entries
    */
   private FileStatus[] listStatusUnderPath(FileSystem fs, Path p) throws 
IOException {
-    HiveConf hiveConf = new HiveConf(job, FetchOperator.class);
-    boolean recursive = 
hiveConf.getBoolVar(HiveConf.ConfVars.HADOOPMAPREDINPUTDIRRECURSIVE);
+    boolean recursive = false;
     if (!recursive) {
       return fs.listStatus(p);
     }
{code}

And re-ran my query to compare timings.

||Before||After||
|Cumulative CPU| 731.07 sec|386.0 sec|
|Total time | 347.66 seconds | 218.855 seconds | 
|

The query used was 

{code}INSERT OVERWRITE LOCAL DIRECTORY
'/grid/0/smb/'
select inv_item_sk
from
     inventory inv
     join store_sales ss on (ss.ss_item_sk = inv.inv_item_sk)
limit 100000
;
{code}

On a scale=2 tpcds data-set, where both store_sales & inventory are bucketed 
into 4 buckets, with store_sales split into 7 partitions and inventory into 261 
partitions.

78% of all CPU time was spent within new HiveConf(). The yourkit profiler runs 
are attached.

    
> FetchOperator slows down SMB map joins by 50% when there are many partitions
> ----------------------------------------------------------------------------
>
>                 Key: HIVE-4486
>                 URL: https://issues.apache.org/jira/browse/HIVE-4486
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>         Environment: Ubuntu LXC 12.10
>            Reporter: Gopal V
>            Priority: Minor
>         Attachments: smb-profile.html
>
>
> While looking at log files for SMB joins in hive, it was noticed that the 
> actual join op didn't show up as a significant fraction of the time spent. 
> Most of the time was spent parsing configuration files.
> To confirm, I put log lines in the HiveConf constructor and eventually made 
> the following edit to the code
> {code}
> --- ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java
> +++ ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java
> @@ -648,8 +648,7 @@ public ObjectInspector getOutputObjectInspector() throws 
> HiveException {
>     * @return list of file status entries
>     */
>    private FileStatus[] listStatusUnderPath(FileSystem fs, Path p) throws 
> IOException {
> -    HiveConf hiveConf = new HiveConf(job, FetchOperator.class);
> -    boolean recursive = 
> hiveConf.getBoolVar(HiveConf.ConfVars.HADOOPMAPREDINPUTDIRRECURSIVE);
> +    boolean recursive = false;
>      if (!recursive) {
>        return fs.listStatus(p);
>      }
> {code}
> And re-ran my query to compare timings.
> || ||Before||After||
> |Cumulative CPU| 731.07 sec|386.0 sec|
> |Total time | 347.66 seconds | 218.855 seconds | 
> |
> The query used was 
> {code}INSERT OVERWRITE LOCAL DIRECTORY
> '/grid/0/smb/'
> select inv_item_sk
> from
>      inventory inv
>      join store_sales ss on (ss.ss_item_sk = inv.inv_item_sk)
> limit 100000
> ;
> {code}
> On a scale=2 tpcds data-set, where both store_sales & inventory are bucketed 
> into 4 buckets, with store_sales split into 7 partitions and inventory into 
> 261 partitions.
> 78% of all CPU time was spent within new HiveConf(). The yourkit profiler 
> runs are attached.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-4486) FetchOperator slows down SMB map joins by 50% when there are many partitions

Reply via email to