Hi,

I regenerated the entire base data, and added the following configuration
changes to hive-site.xml and mapred-site.xml and now there are multiple
mappers and reducers running for the same query. I am still not quite sure
how to go about using ORC file stripe size for increasing the number of
mappers.

Is there any other performance optimization that I could have done? Please
do advice.

=======================================
mapred-site.xml and hive-site.xml changes:
---------------------------------------------------------------
<property>
  <name>mapred.map.child.java.opts</name>
  <value>-Xmx1024M</value>
</property>
<property>
  <name>mapred.reduce.child.java.opts</name>
  <value>-Xmx1024M</value>
</property>
<property>
  <name>mapred.child.java.opts</name>
  <value>-Xmx1024M</value>
  <description>setting memory for child jobs</description>
</property>
=======================================

=======================================
hive run-time configuration:
---------------------------------------
"set mapred.reduce.tasks=4"
"set hive.auto.convert.join=true"
=======================================


Thanks and Regards,
Gourav Sengupta



On Thu, Oct 10, 2013 at 9:16 AM, Gourav Sengupta <gourav.had...@gmail.com>wrote:

> Hi,
>
> The entire table of 34 million records is in a single ORC file. and its
> around 7 GB in size. the other ORC file is a dimension table with less than
> 40 MB of records once again in a single ORC file.
>
> I do not remember setting anywhere ORC file stripe size.
>
> The problem that I am facing is the query is triggering only a single
> mapper though the cluster has three nodes. Unlike other posts here I need
> more mappers.
>
> The other mentioned properties are mentioned below from the job xml file:
>
> <property><name>mapred.min.split.size.per.node</name><value>1</value></property>
> and
>
> <property><name>mapred.max.split.size</name><value>256000000</value></property>
>
> I am sure that there is no issue with HADOOP configuration as with some
> other queries I am getting more than 24 mappers.
>
> Please accept my sincere regards for your kind help and insights.
>
>
> Thanks,
> Gourav Sengupta
>
>
>
> On Wed, Oct 9, 2013 at 6:22 PM, Prasanth Jayachandran <
> pjayachand...@hortonworks.com> wrote:
>
>> What is your ORC file stripe size? How many ORC files are there in each
>> of the tables? It could be possible that ORC compressed the file so much
>> that the file size is less than the HDFS block size. Can you please report
>> the file size of the two ORC files?
>>
>> Another possibility is that there are many small files. In that case by
>> default hive uses CombineHiveInputFormat which combines many small files
>> into a single large file. Hence you will see less number of mappers. If you
>> are expecting one mapper per hdfs file, then try disabling
>> CombineHiveInputFormat by "set
>> hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;". Another
>> way to control the number of mappers is by adjusting the min and max split
>> size.
>>
>> Thanks
>> Prasanth Jayachandran
>>
>> On Oct 9, 2013, at 10:03 AM, Nitin Pawar <nitinpawar...@gmail.com> wrote:
>>
>> > whats the size of the table? (in GBs? )
>> >
>> > Whats the max and min split sizes have you provied?
>> >
>> >
>> > On Wed, Oct 9, 2013 at 10:28 PM, Gourav Sengupta <
>> gourav.had...@gmail.com>wrote:
>> >
>> >> Hi,
>> >>
>> >> I am trying to run a join using two tables stored in ORC file format.
>> >>
>> >> The first table has 34 million records and the second has around
>> 300,000
>> >> records.
>> >>
>> >> Setting "set hive.auto.convert.join=true" makes the entire query run
>> via a
>> >> single mapper.
>> >> In case I am setting "set hive.auto.convert.join=false" then there are
>> two
>> >> mappers first one reads the second table and then the entire large
>> table
>> >> goes through the second mapper.
>> >>
>> >> Is there something that I am doing wrong because there are three nodes
>> in
>> >> the HADOOP cluster currently and I was expecting that at least 6
>> mappers
>> >> should have been used.
>> >>
>> >> Thanks and Regards,
>> >> Gourav
>> >>
>> >
>> >
>> >
>> > --
>> > Nitin Pawar
>>
>>
>> --
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity
>> to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified
>> that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender
>> immediately
>> and delete it from your system. Thank You.
>>
>
>

Reply via email to