Re: Dealing with large number of partitions

Edward Capriolo Sat, 12 Jun 2010 08:12:52 -0700

On Sat, Jun 12, 2010 at 10:50 AM, Edward Capriolo <[email protected]>wrote:


>
> On Fri, Jun 11, 2010 at 7:09 PM, Ashish Thusoo <[email protected]>wrote:
>
>>  +1 to that. That should help provided you are running hadoop 0.20 ..
>>
>> Ashish
>>
>>  ------------------------------
>> *From:* wd [mailto:[email protected]]
>> *Sent:* Thursday, June 10, 2010 11:36 PM
>> *To:* [email protected]
>> *Subject:* Re: Dealing with large number of partitions
>>
>> Try set 
>> hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; 
>> before you query, this may be help.
>>
>>
>>
>> 2010/6/11 Sammy Yu <[email protected]>
>>
>>> Hi,
>>>    I am having an issue with a large number of 4000 partitions (each
>>> being very small <10k files).  Any queries that I do which involve these
>>> partitions take an extremely long time to complete (10+ hours), I was
>>> wondering if there was any easy way in hive without having to merge the
>>> files improve it's performance.  I can see the map reduce jobs are taking a
>>> long time due to the fact that there are so many separated raw data files
>>> that need to be read.  I saw that HIVE-1332 dealt with using HAR files for
>>> partitioning.  Could this perhaps help performance rather than hurt it,
>>> given that the queries will be using all the partitions in the har file?
>>>
>>> Thanks,
>>> Sammy
>>>
>>>
>>>
>>>
>>>
>>>
>>
> Unfortunately, I have/had the same issue. When you have that many
> partitions the query planning phase NOT the query execution phase takes a
> long time. This is an identical problem I had, connection pooling has been
> added to trunk
> http://osdir.com/ml/hive-dev-hadoop-apache/2010-05/msg00023.html.
> Without the connection pooling queries against tables with too many
> partitions will eventually fail.
>
> The only thing you can do to spead up the query planning phase is beef up
> your metastore follow whatever tuning for your backend derby/mysql is out
> there.
>
> Other then that try to eliminate over-partitioning if possible. You really
> do not want small partitions, just like you do not want small files.
>


So a simple way to tell is if the slowdown is in the planning phase or the
execution phase is to explain the query. If the explain takes a long time,
try the connection pooling and updating your metastore.

But if you run the query but reach the map/reduce phase quicky.
2010-06-12 11:07:52,603 Stage-1 map = 0%,  reduce = 0%

Then the metastore/backend is not the slowdown.

Re: Dealing with large number of partitions

Reply via email to