Great, thanks for the help guys, the combination of both suggestion really helped.
On Sat, Jun 12, 2010 at 8:12 AM, Edward Capriolo <[email protected]>wrote: > > > On Sat, Jun 12, 2010 at 10:50 AM, Edward Capriolo > <[email protected]>wrote: > >> >> On Fri, Jun 11, 2010 at 7:09 PM, Ashish Thusoo <[email protected]>wrote: >> >>> +1 to that. That should help provided you are running hadoop 0.20 .. >>> >>> Ashish >>> >>> ------------------------------ >>> *From:* wd [mailto:[email protected]] >>> *Sent:* Thursday, June 10, 2010 11:36 PM >>> *To:* [email protected] >>> *Subject:* Re: Dealing with large number of partitions >>> >>> Try set >>> hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; >>> before you query, this may be help. >>> >>> >>> >>> 2010/6/11 Sammy Yu <[email protected]> >>> >>>> Hi, >>>> I am having an issue with a large number of 4000 partitions (each >>>> being very small <10k files). Any queries that I do which involve these >>>> partitions take an extremely long time to complete (10+ hours), I was >>>> wondering if there was any easy way in hive without having to merge the >>>> files improve it's performance. I can see the map reduce jobs are taking a >>>> long time due to the fact that there are so many separated raw data files >>>> that need to be read. I saw that HIVE-1332 dealt with using HAR files for >>>> partitioning. Could this perhaps help performance rather than hurt it, >>>> given that the queries will be using all the partitions in the har file? >>>> >>>> Thanks, >>>> Sammy >>>> >>>> >>>> >>>> >>>> >>>> >>> >> Unfortunately, I have/had the same issue. When you have that many >> partitions the query planning phase NOT the query execution phase takes a >> long time. This is an identical problem I had, connection pooling has been >> added to trunk >> http://osdir.com/ml/hive-dev-hadoop-apache/2010-05/msg00023.html. >> Without the connection pooling queries against tables with too many >> partitions will eventually fail. >> >> The only thing you can do to spead up the query planning phase is beef up >> your metastore follow whatever tuning for your backend derby/mysql is out >> there. >> >> Other then that try to eliminate over-partitioning if possible. You really >> do not want small partitions, just like you do not want small files. >> > > > So a simple way to tell is if the slowdown is in the planning phase or the > execution phase is to explain the query. If the explain takes a long time, > try the connection pooling and updating your metastore. > > But if you run the query but reach the map/reduce phase quicky. > 2010-06-12 11:07:52,603 Stage-1 map = 0%, reduce = 0% > > Then the metastore/backend is not the slowdown. > > > > > -- Chief Architect, BrightEdge email: [email protected] | mobile: 650.539.4867 | fax: 650.521.9678 | address: 1850 Gateway Dr Suite 400, San Mateo, CA 94404
