[ 
https://issues.apache.org/jira/browse/HIVE-7661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257922#comment-14257922
 ] 

Navis commented on HIVE-7661:
-----------------------------

[~vishal.kamath] Booked an issue for split sampling described above. Wish it's 
helpful.

> Observed performance issues while sorting using Hive's Parallel Order by 
> clause while retaining pre-existing sort order.
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-7661
>                 URL: https://issues.apache.org/jira/browse/HIVE-7661
>             Project: Hive
>          Issue Type: Bug
>          Components: Logical Optimizer
>    Affects Versions: 0.12.0
>         Environment: Cloudera 5.0
> hive-0.12.0-cdh5.0.0
> Red Hat Linux
>            Reporter: Vishal Kamath
>              Labels: performance
>             Fix For: 0.12.1
>
>
> Improve Hive's sampling logic to accommodate use cases that require to retain 
> the pre existing sort in the underlying source table. 
> In order to support Parallel order by clause, Hive Samples the source table 
> based on values provided to hive.optimize.sampling.orderby.number and 
> hive.optimize.sampling.orderby.percent. 
> This does work with reasonable performance when sorting is performed on a 
> columns having random distribution of data but has severe performance issues 
> when retaining the sort order. 
> Let us try to understand this with an example. 
> insert overwrite table lineitem_temp_report 
> select 
>   l_orderkey, l_partkey, l_suppkey, l_linenumber, l_quantity, 
> l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus, l_shipdate, 
> l_commitdate, l_receiptdate, l_shipinstruct, l_shipmode, l_comment
> from 
>   lineitem
> order by l_orderkey, l_partkey, l_suppkey;
> Sample data set for lineitem table. The first column represents the 
> l_orderKey and is sorted.
>  
> l_orderkey|l_partkey|l_suppkey|l_linenumber|l_quantity|l_extendedprice|l_discount|l_tax|l_returnflag|l_linestatus|l_shipdate|l_commitdate|l_receiptdate|l_shipinstruct|l_shipmode|l_comment
> 197|1771022|96040|2|8|8743.52|0.09|0.02|A|F|1995-04-17|1995-07-01|1995-0
> 197|1771022|96040|2|8|4-27|DELIVER IN PERSON|SHIP|y blithely even 
> 197|1771022|96040|2|8|deposits. blithely fina|
> 197|1558290|83306|3|17|22919.74|0.06|0.02|N|O|1995-08-02|1995-06-23|1995
> 197|1558290|83306|3|17|-08-03|COLLECT COD|REG AIR|ts. careful|
> 197|179355|29358|4|25|35858.75|0.04|0.01|N|F|1995-06-13|1995-05-23|1995-
> 197|179355|29358|4|25|06-24|TAKE BACK RETURN|FOB|s-- quickly final 
> 197|179355|29358|4|25|accounts|
> 197|414653|39658|5|14|21946.82|0.09|0.01|R|F|1995-05-08|1995-05-24|1995-
> 197|414653|39658|5|14|05-12|TAKE BACK RETURN|RAIL|use slyly slyly silent 
> 197|414653|39658|5|14|depo|
> 197|1058800|8821|6|1|1758.75|0.07|0.05|N|O|1995-07-15|1995-06-21|1995-08
> 197|1058800|8821|6|1|-11|COLLECT COD|RAIL| even, thin dependencies sno|
> 198|560609|60610|1|33|55096.14|0.07|0.02|N|O|1998-01-05|1998-03-20|1998-
> 198|560609|60610|1|33|01-10|TAKE BACK RETURN|TRUCK|carefully caref|
> 198|152287|77289|2|20|26785.60|0.03|0.00|N|O|1998-01-15|1998-03-31|1998-
> 198|152287|77289|2|20|01-25|DELIVER IN PERSON|FOB|carefully final 
> 198|152287|77289|2|20|escapades a|
> 224|1899665|74720|3|41|68247.37|0.07|0.04|A|F|1994-09-01|1994-09-15|1994
> 224|1899665|74720|3|41|-09-02|TAKE BACK RETURN|SHIP|after the furiou|
> When we try to either sort on a presorted column or do a multi-column sort 
> while trying to retain the sort order on the source table,
> Source table "lineitem" has 600 million rows. 
> We don't see equal distribution of data to the reducers. Out of 100 reducers, 
> 99 complete in less than 40 seconds. The last reducer is doing the bulk of 
> the work processing nearly 570 million rows. 
> So, let us understand what is going wrong here ..
> on a table having 600 million records with orderkey column sorted, i created 
> temp table with 10% sampling.  
> insert overwrite table sampTempTbl (select * from lineitem tablesample (10 
> percent) t);
> select min(l_orderkey), max(l_orderkey) from sampTempTbl ;
> 12306309,    142321700
> where as on the source table, the orderkey range (select min(l_orderkey), 
> max(l_orderkey) from lineitem)  is 1 and 600000000  
> So naturally bulk of the records will be directed towards single reducer. 
> One way to work around this problem is to increase the 
> hive.optimize.sampling.orderby.number to a larger value (as close as the # 
> rows in the input source table). But then we will have to provide higher heap 
> (hive-env.sh) for hive, otherwise it will fail while creating the Sampling 
> Data. With larger data volume, it is not practical to sample the entire data 
> set. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to