The smaller the stripe size, smaller will be the HDFS block size and there will be more mappers. By default ORC chooses HDFS block size as 2 times stripe size.
Thanks Prasanth Jayachandran On Oct 11, 2013, at 1:42 AM, Gourav Sengupta <gourav.had...@gmail.com> wrote: > Hi, > > I regenerated the entire base data, and added the following configuration > changes to hive-site.xml and mapred-site.xml and now there are multiple > mappers and reducers running for the same query. I am still not quite sure > how to go about using ORC file stripe size for increasing the number of > mappers. > > Is there any other performance optimization that I could have done? Please > do advice. > > ======================================= > mapred-site.xml and hive-site.xml changes: > --------------------------------------------------------------- > <property> > <name>mapred.map.child.java.opts</name> > <value>-Xmx1024M</value> > </property> > <property> > <name>mapred.reduce.child.java.opts</name> > <value>-Xmx1024M</value> > </property> > <property> > <name>mapred.child.java.opts</name> > <value>-Xmx1024M</value> > <description>setting memory for child jobs</description> > </property> > ======================================= > > ======================================= > hive run-time configuration: > --------------------------------------- > "set mapred.reduce.tasks=4" > "set hive.auto.convert.join=true" > ======================================= > > > Thanks and Regards, > Gourav Sengupta > > > > On Thu, Oct 10, 2013 at 9:16 AM, Gourav Sengupta > <gourav.had...@gmail.com>wrote: > >> Hi, >> >> The entire table of 34 million records is in a single ORC file. and its >> around 7 GB in size. the other ORC file is a dimension table with less than >> 40 MB of records once again in a single ORC file. >> >> I do not remember setting anywhere ORC file stripe size. >> >> The problem that I am facing is the query is triggering only a single >> mapper though the cluster has three nodes. Unlike other posts here I need >> more mappers. >> >> The other mentioned properties are mentioned below from the job xml file: >> >> <property><name>mapred.min.split.size.per.node</name><value>1</value></property> >> and >> >> <property><name>mapred.max.split.size</name><value>256000000</value></property> >> >> I am sure that there is no issue with HADOOP configuration as with some >> other queries I am getting more than 24 mappers. >> >> Please accept my sincere regards for your kind help and insights. >> >> >> Thanks, >> Gourav Sengupta >> >> >> >> On Wed, Oct 9, 2013 at 6:22 PM, Prasanth Jayachandran < >> pjayachand...@hortonworks.com> wrote: >> >>> What is your ORC file stripe size? How many ORC files are there in each >>> of the tables? It could be possible that ORC compressed the file so much >>> that the file size is less than the HDFS block size. Can you please report >>> the file size of the two ORC files? >>> >>> Another possibility is that there are many small files. In that case by >>> default hive uses CombineHiveInputFormat which combines many small files >>> into a single large file. Hence you will see less number of mappers. If you >>> are expecting one mapper per hdfs file, then try disabling >>> CombineHiveInputFormat by "set >>> hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;". Another >>> way to control the number of mappers is by adjusting the min and max split >>> size. >>> >>> Thanks >>> Prasanth Jayachandran >>> >>> On Oct 9, 2013, at 10:03 AM, Nitin Pawar <nitinpawar...@gmail.com> wrote: >>> >>>> whats the size of the table? (in GBs? ) >>>> >>>> Whats the max and min split sizes have you provied? >>>> >>>> >>>> On Wed, Oct 9, 2013 at 10:28 PM, Gourav Sengupta < >>> gourav.had...@gmail.com>wrote: >>>> >>>>> Hi, >>>>> >>>>> I am trying to run a join using two tables stored in ORC file format. >>>>> >>>>> The first table has 34 million records and the second has around >>> 300,000 >>>>> records. >>>>> >>>>> Setting "set hive.auto.convert.join=true" makes the entire query run >>> via a >>>>> single mapper. >>>>> In case I am setting "set hive.auto.convert.join=false" then there are >>> two >>>>> mappers first one reads the second table and then the entire large >>> table >>>>> goes through the second mapper. >>>>> >>>>> Is there something that I am doing wrong because there are three nodes >>> in >>>>> the HADOOP cluster currently and I was expecting that at least 6 >>> mappers >>>>> should have been used. >>>>> >>>>> Thanks and Regards, >>>>> Gourav >>>>> >>>> >>>> >>>> >>>> -- >>>> Nitin Pawar >>> >>> >>> -- >>> CONFIDENTIALITY NOTICE >>> NOTICE: This message is intended for the use of the individual or entity >>> to >>> which it is addressed and may contain information that is confidential, >>> privileged and exempt from disclosure under applicable law. If the reader >>> of this message is not the intended recipient, you are hereby notified >>> that >>> any printing, copying, dissemination, distribution, disclosure or >>> forwarding of this communication is strictly prohibited. If you have >>> received this communication in error, please contact the sender >>> immediately >>> and delete it from your system. Thank You. >>> >> >> -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.