Great, thanks for the help guys, the combination of both suggestion really
helped.

On Sat, Jun 12, 2010 at 8:12 AM, Edward Capriolo <[email protected]>wrote:

>
>
> On Sat, Jun 12, 2010 at 10:50 AM, Edward Capriolo 
> <[email protected]>wrote:
>
>>
>> On Fri, Jun 11, 2010 at 7:09 PM, Ashish Thusoo <[email protected]>wrote:
>>
>>>  +1 to that. That should help provided you are running hadoop 0.20 ..
>>>
>>> Ashish
>>>
>>>  ------------------------------
>>> *From:* wd [mailto:[email protected]]
>>> *Sent:* Thursday, June 10, 2010 11:36 PM
>>> *To:* [email protected]
>>> *Subject:* Re: Dealing with large number of partitions
>>>
>>> Try set 
>>> hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; 
>>> before you query, this may be help.
>>>
>>>
>>>
>>> 2010/6/11 Sammy Yu <[email protected]>
>>>
>>>> Hi,
>>>>    I am having an issue with a large number of 4000 partitions (each
>>>> being very small <10k files).  Any queries that I do which involve these
>>>> partitions take an extremely long time to complete (10+ hours), I was
>>>> wondering if there was any easy way in hive without having to merge the
>>>> files improve it's performance.  I can see the map reduce jobs are taking a
>>>> long time due to the fact that there are so many separated raw data files
>>>> that need to be read.  I saw that HIVE-1332 dealt with using HAR files for
>>>> partitioning.  Could this perhaps help performance rather than hurt it,
>>>> given that the queries will be using all the partitions in the har file?
>>>>
>>>> Thanks,
>>>> Sammy
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>> Unfortunately, I have/had the same issue. When you have that many
>> partitions the query planning phase NOT the query execution phase takes a
>> long time. This is an identical problem I had, connection pooling has been
>> added to trunk
>> http://osdir.com/ml/hive-dev-hadoop-apache/2010-05/msg00023.html.
>> Without the connection pooling queries against tables with too many
>> partitions will eventually fail.
>>
>> The only thing you can do to spead up the query planning phase is beef up
>> your metastore follow whatever tuning for your backend derby/mysql is out
>> there.
>>
>> Other then that try to eliminate over-partitioning if possible. You really
>> do not want small partitions, just like you do not want small files.
>>
>
>
> So a simple way to tell is if the slowdown is in the planning phase or the
> execution phase is to explain the query. If the explain takes a long time,
> try the connection pooling and updating your metastore.
>
> But if you run the query but reach the map/reduce phase quicky.
> 2010-06-12 11:07:52,603 Stage-1 map = 0%,  reduce = 0%
>
> Then the metastore/backend is not the slowdown.
>
>
>
>
>


-- 
Chief Architect, BrightEdge
email: [email protected]   |   mobile: 650.539.4867  |   fax: 650.521.9678
 |  address: 1850 Gateway Dr Suite 400, San Mateo, CA 94404

Reply via email to