Vishal Kamath created HIVE-7661:
-----------------------------------

             Summary: Observed performance issues while sorting using Hive's 
Parallel Order by clause while retaining pre-existing sort order.
                 Key: HIVE-7661
                 URL: https://issues.apache.org/jira/browse/HIVE-7661
             Project: Hive
          Issue Type: Bug
          Components: Logical Optimizer
    Affects Versions: 0.12.0
         Environment: Cloudera 5.0
hive-0.12.0-cdh5.0.0
Red Hat Linux
            Reporter: Vishal Kamath
             Fix For: 0.12.1


Improve Hive's sampling logic to accommodate use cases that require to retain 
the pre existing sort in the underlying source table. 

In order to support Parallel order by clause, Hive Samples the source table 
based on values provided to hive.optimize.sampling.orderby.number and 
hive.optimize.sampling.orderby.percent. 

This does work with reasonable performance when sorting is performed on a 
columns having random distribution of data but has severe performance issues 
when retaining the sort order. 

Let us try to understand this with an example. 

insert overwrite table lineitem_temp_report 
select 
  l_orderkey, l_partkey, l_suppkey, l_linenumber, l_quantity, l_extendedprice, 
l_discount, l_tax, l_returnflag, l_linestatus, l_shipdate, l_commitdate, 
l_receiptdate, l_shipinstruct, l_shipmode, l_comment
from 
  lineitem
order by l_orderkey, l_partkey, l_suppkey;

Sample data set for lineitem table. The first column represents the l_orderKey 
and is sorted.
 
l_orderkey|l_partkey|l_suppkey|l_linenumber|l_quantity|l_extendedprice|l_discount|l_tax|l_returnflag|l_linestatus|l_shipdate|l_commitdate|l_receiptdate|l_shipinstruct|l_shipmode|l_comment
197|1771022|96040|2|8|8743.52|0.09|0.02|A|F|1995-04-17|1995-07-01|1995-0
197|1771022|96040|2|8|4-27|DELIVER IN PERSON|SHIP|y blithely even 
197|1771022|96040|2|8|deposits. blithely fina|
197|1558290|83306|3|17|22919.74|0.06|0.02|N|O|1995-08-02|1995-06-23|1995
197|1558290|83306|3|17|-08-03|COLLECT COD|REG AIR|ts. careful|
197|179355|29358|4|25|35858.75|0.04|0.01|N|F|1995-06-13|1995-05-23|1995-
197|179355|29358|4|25|06-24|TAKE BACK RETURN|FOB|s-- quickly final 
197|179355|29358|4|25|accounts|
197|414653|39658|5|14|21946.82|0.09|0.01|R|F|1995-05-08|1995-05-24|1995-
197|414653|39658|5|14|05-12|TAKE BACK RETURN|RAIL|use slyly slyly silent 
197|414653|39658|5|14|depo|
197|1058800|8821|6|1|1758.75|0.07|0.05|N|O|1995-07-15|1995-06-21|1995-08
197|1058800|8821|6|1|-11|COLLECT COD|RAIL| even, thin dependencies sno|
198|560609|60610|1|33|55096.14|0.07|0.02|N|O|1998-01-05|1998-03-20|1998-
198|560609|60610|1|33|01-10|TAKE BACK RETURN|TRUCK|carefully caref|
198|152287|77289|2|20|26785.60|0.03|0.00|N|O|1998-01-15|1998-03-31|1998-
198|152287|77289|2|20|01-25|DELIVER IN PERSON|FOB|carefully final 
198|152287|77289|2|20|escapades a|
224|1899665|74720|3|41|68247.37|0.07|0.04|A|F|1994-09-01|1994-09-15|1994
224|1899665|74720|3|41|-09-02|TAKE BACK RETURN|SHIP|after the furiou|


When we try to either sort on a presorted column or do a multi-column sort 
while trying to retain the sort order on the source table,

Source table "lineitem" has 600 million rows. 

We don't see equal distribution of data to the reducers. Out of 100 reducers, 
99 complete in less than 40 seconds. The last reducer is doing the bulk of the 
work processing nearly 570 million rows. 

So, let us understand what is going wrong here ..

on a table having 600 million records with orderkey column sorted, i created 
temp table with 10% sampling.  

insert overwrite table sampTempTbl (select * from lineitem tablesample (10 
percent) t);

select min(l_orderkey), max(l_orderkey) from sampTempTbl ;
12306309,    142321700

where as on the source table, the orderkey range (select min(l_orderkey), 
max(l_orderkey) from lineitem)  is 1 and 600000000  

So naturally bulk of the records will be directed towards single reducer. 

One way to work around this problem is to increase the 
hive.optimize.sampling.orderby.number to a larger value (as close as the # rows 
in the input source table). But then we will have to provide higher heap 
(hive-env.sh) for hive, otherwise it will fail while creating the Sampling 
Data. With larger data volume, it is not practical to sample the entire data 
set. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to