If it turns out that you actually DO need a total order over a large data set, you can adapt the procedure documented here:
http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad For a better sampling query pattern, check out the "Sampling Query for Range Partitioning" slide in this presentation: http://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010 The row_sequence UDF it references is available in HIVE-1304. What's missing (besides having Hive do all of this automatically)? Well, instead of writing to HiveHFileOutputFormat, you'd be writing your results to a normal Hive table, so at the end of the day, you would need to figure out how to sequence the result files correctly. (The HBase bulk load script does this by opening up each file and peeking at the header to get the key range.) JVS ________________________________________ From: Zheng Shao [[email protected]] Sent: Wednesday, May 12, 2010 10:32 AM To: [email protected] Subject: Re: why hive ignore my setting about reduce task number? Do you need to get all records in the order? In most of our use cases users are only interested in the top 100 or something. If you do limit 100 together with order by, it will be much faster. Sent from my iPhone On May 12, 2010, at 1:54 PM, [email protected]<mailto:[email protected]> wrote: Thanks, Ted. If I have very big data to sort, only 1 reduce task will have performance issue. Do hive have some skill to optimize it? I have observe that the reduce task is very slow in my job. ________________________________ 你的1G网络U盘真好用!<http://goto.mail.sohu.com/goto.php?code=udisk_zhujiao> 查薪酬:对比同行工资!<http://sohu.ad-plus.cn/event.ng/Type=click&FlightID=201004&TargetID=sohu&Values=df789d86,92d3d91d,277177cc,c2935d8d&AdID=54157>
