Hi, I regenerated the entire base data, and added the following configuration changes to hive-site.xml and mapred-site.xml and now there are multiple mappers and reducers running for the same query. I am still not quite sure how to go about using ORC file stripe size for increasing the number of mappers.
Is there any other performance optimization that I could have done? Please do advice. ======================================= mapred-site.xml and hive-site.xml changes: --------------------------------------------------------------- <property> <name>mapred.map.child.java.opts</name> <value>-Xmx1024M</value> </property> <property> <name>mapred.reduce.child.java.opts</name> <value>-Xmx1024M</value> </property> <property> <name>mapred.child.java.opts</name> <value>-Xmx1024M</value> <description>setting memory for child jobs</description> </property> ======================================= ======================================= hive run-time configuration: --------------------------------------- "set mapred.reduce.tasks=4" "set hive.auto.convert.join=true" ======================================= Thanks and Regards, Gourav Sengupta On Thu, Oct 10, 2013 at 9:16 AM, Gourav Sengupta <gourav.had...@gmail.com>wrote: > Hi, > > The entire table of 34 million records is in a single ORC file. and its > around 7 GB in size. the other ORC file is a dimension table with less than > 40 MB of records once again in a single ORC file. > > I do not remember setting anywhere ORC file stripe size. > > The problem that I am facing is the query is triggering only a single > mapper though the cluster has three nodes. Unlike other posts here I need > more mappers. > > The other mentioned properties are mentioned below from the job xml file: > > <property><name>mapred.min.split.size.per.node</name><value>1</value></property> > and > > <property><name>mapred.max.split.size</name><value>256000000</value></property> > > I am sure that there is no issue with HADOOP configuration as with some > other queries I am getting more than 24 mappers. > > Please accept my sincere regards for your kind help and insights. > > > Thanks, > Gourav Sengupta > > > > On Wed, Oct 9, 2013 at 6:22 PM, Prasanth Jayachandran < > pjayachand...@hortonworks.com> wrote: > >> What is your ORC file stripe size? How many ORC files are there in each >> of the tables? It could be possible that ORC compressed the file so much >> that the file size is less than the HDFS block size. Can you please report >> the file size of the two ORC files? >> >> Another possibility is that there are many small files. In that case by >> default hive uses CombineHiveInputFormat which combines many small files >> into a single large file. Hence you will see less number of mappers. If you >> are expecting one mapper per hdfs file, then try disabling >> CombineHiveInputFormat by "set >> hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;". Another >> way to control the number of mappers is by adjusting the min and max split >> size. >> >> Thanks >> Prasanth Jayachandran >> >> On Oct 9, 2013, at 10:03 AM, Nitin Pawar <nitinpawar...@gmail.com> wrote: >> >> > whats the size of the table? (in GBs? ) >> > >> > Whats the max and min split sizes have you provied? >> > >> > >> > On Wed, Oct 9, 2013 at 10:28 PM, Gourav Sengupta < >> gourav.had...@gmail.com>wrote: >> > >> >> Hi, >> >> >> >> I am trying to run a join using two tables stored in ORC file format. >> >> >> >> The first table has 34 million records and the second has around >> 300,000 >> >> records. >> >> >> >> Setting "set hive.auto.convert.join=true" makes the entire query run >> via a >> >> single mapper. >> >> In case I am setting "set hive.auto.convert.join=false" then there are >> two >> >> mappers first one reads the second table and then the entire large >> table >> >> goes through the second mapper. >> >> >> >> Is there something that I am doing wrong because there are three nodes >> in >> >> the HADOOP cluster currently and I was expecting that at least 6 >> mappers >> >> should have been used. >> >> >> >> Thanks and Regards, >> >> Gourav >> >> >> > >> > >> > >> > -- >> > Nitin Pawar >> >> >> -- >> CONFIDENTIALITY NOTICE >> NOTICE: This message is intended for the use of the individual or entity >> to >> which it is addressed and may contain information that is confidential, >> privileged and exempt from disclosure under applicable law. If the reader >> of this message is not the intended recipient, you are hereby notified >> that >> any printing, copying, dissemination, distribution, disclosure or >> forwarding of this communication is strictly prohibited. If you have >> received this communication in error, please contact the sender >> immediately >> and delete it from your system. Thank You. >> > >