On Sat, Jun 12, 2010 at 10:50 AM, Edward Capriolo <[email protected]>wrote:
> > On Fri, Jun 11, 2010 at 7:09 PM, Ashish Thusoo <[email protected]>wrote: > >> +1 to that. That should help provided you are running hadoop 0.20 .. >> >> Ashish >> >> ------------------------------ >> *From:* wd [mailto:[email protected]] >> *Sent:* Thursday, June 10, 2010 11:36 PM >> *To:* [email protected] >> *Subject:* Re: Dealing with large number of partitions >> >> Try set >> hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; >> before you query, this may be help. >> >> >> >> 2010/6/11 Sammy Yu <[email protected]> >> >>> Hi, >>> I am having an issue with a large number of 4000 partitions (each >>> being very small <10k files). Any queries that I do which involve these >>> partitions take an extremely long time to complete (10+ hours), I was >>> wondering if there was any easy way in hive without having to merge the >>> files improve it's performance. I can see the map reduce jobs are taking a >>> long time due to the fact that there are so many separated raw data files >>> that need to be read. I saw that HIVE-1332 dealt with using HAR files for >>> partitioning. Could this perhaps help performance rather than hurt it, >>> given that the queries will be using all the partitions in the har file? >>> >>> Thanks, >>> Sammy >>> >>> >>> >>> >>> >>> >> > Unfortunately, I have/had the same issue. When you have that many > partitions the query planning phase NOT the query execution phase takes a > long time. This is an identical problem I had, connection pooling has been > added to trunk > http://osdir.com/ml/hive-dev-hadoop-apache/2010-05/msg00023.html. > Without the connection pooling queries against tables with too many > partitions will eventually fail. > > The only thing you can do to spead up the query planning phase is beef up > your metastore follow whatever tuning for your backend derby/mysql is out > there. > > Other then that try to eliminate over-partitioning if possible. You really > do not want small partitions, just like you do not want small files. > So a simple way to tell is if the slowdown is in the planning phase or the execution phase is to explain the query. If the explain takes a long time, try the connection pooling and updating your metastore. But if you run the query but reach the map/reduce phase quicky. 2010-06-12 11:07:52,603 Stage-1 map = 0%, reduce = 0% Then the metastore/backend is not the slowdown.
