RE: why hive ignore my setting about reduce task number?

John Sichi Wed, 12 May 2010 10:48:40 -0700

If it turns out that you actually DO need a total order over a large data set, 
you can adapt the procedure documented here:


http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad

For a better sampling query pattern, check out the "Sampling Query for Range 
Partitioning" slide in this presentation:

http://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010

The row_sequence UDF it references is available in HIVE-1304.

What's missing (besides having Hive do all of this automatically)?  Well, 
instead of writing to HiveHFileOutputFormat, you'd be writing your results to a 
normal Hive table, so at the end of the day, you would need to figure out how 
to sequence the result files correctly.  (The HBase bulk load script does this 
by opening up each file and peeking at the header to get the key range.)

JVS

________________________________________
From: Zheng Shao [[email protected]]
Sent: Wednesday, May 12, 2010 10:32 AM
To: [email protected]
Subject: Re: why hive ignore my setting about reduce task number?

Do you need to get all records in the order? In most of our use cases users are 
only interested in the top 100 or something. If you do limit 100 together with 
order by, it will be much faster.


Sent from my iPhone

On May 12, 2010, at 1:54 PM, 
[email protected]<mailto:[email protected]> wrote:

Thanks, Ted.
If I have very big data to sort, only 1 reduce task will have performance issue.
Do hive have some skill to optimize it?
I have observe that the reduce task is very slow in my job.

________________________________

你的1G网络U盘真好用！<http://goto.mail.sohu.com/goto.php?code=udisk_zhujiao>
查薪酬：对比同行工资！<http://sohu.ad-plus.cn/event.ng/Type=click&FlightID=201004&TargetID=sohu&Values=df789d86,92d3d91d,277177cc,c2935d8d&AdID=54157>

RE: why hive ignore my setting about reduce task number?

Reply via email to