On Fri, Jun 11, 2010 at 7:09 PM, Ashish Thusoo <[email protected]> wrote:

>  +1 to that. That should help provided you are running hadoop 0.20 ..
>
> Ashish
>
>  ------------------------------
> *From:* wd [mailto:[email protected]]
> *Sent:* Thursday, June 10, 2010 11:36 PM
> *To:* [email protected]
> *Subject:* Re: Dealing with large number of partitions
>
> Try set 
> hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; before 
> you query, this may be help.
>
>
>
> 2010/6/11 Sammy Yu <[email protected]>
>
>> Hi,
>>    I am having an issue with a large number of 4000 partitions (each being
>> very small <10k files).  Any queries that I do which involve these
>> partitions take an extremely long time to complete (10+ hours), I was
>> wondering if there was any easy way in hive without having to merge the
>> files improve it's performance.  I can see the map reduce jobs are taking a
>> long time due to the fact that there are so many separated raw data files
>> that need to be read.  I saw that HIVE-1332 dealt with using HAR files for
>> partitioning.  Could this perhaps help performance rather than hurt it,
>> given that the queries will be using all the partitions in the har file?
>>
>> Thanks,
>> Sammy
>>
>>
>>
>>
>>
>>
>
Unfortunately, I have/had the same issue. When you have that many partitions
the query planning phase NOT the query execution phase takes a long time.
This is an identical problem I had, connection pooling has been added to
trunk http://osdir.com/ml/hive-dev-hadoop-apache/2010-05/msg00023.html.
Without the connection pooling queries against tables with too many
partitions will eventually fail.

The only thing you can do to spead up the query planning phase is beef up
your metastore follow whatever tuning for your backend derby/mysql is out
there.

Other then that try to eliminate over-partitioning if possible. You really
do not want small partitions, just like you do not want small files.

Reply via email to