Re: SequenceFile split question
Hi Mohit If you are using a stand alone client application to do the same definitely there is just one instance of the same running and you'd be writing the sequence file to one hdfs block at a time. Once it reaches hdfs block size the writing continues to next block, in the mean time the first block is replicated. If you are doing the same job distributed as map reduce you'd be writing to to n files at a time when n is the number of tasks in your map reduce job. AFAIK the data node where the blocks have to be placed is determined by hadoop it is not controlled by end user application. But if you are triggering the stand alone job on a particular data node and if it has space one replica would be stored in the same. Same applies in case of MR tasks as well. Regards Bejoy.K.S On Thu, Mar 15, 2012 at 6:17 AM, Mohit Anchlia mohitanch...@gmail.comwrote: I have a client program that creates sequencefile, which essentially merges small files into a big file. I was wondering how is sequence file splitting the data accross nodes. When I start the sequence file is empty. Does it get split when it reaches the dfs.block size? If so then does it mean that I am always writing to just one node at a given point in time? If I start a new client writing a new sequence file then is there a way to select a different data node?
slaves could not connect on 9000 and 9001 ports of master
Hi all, we made a pilot cluster in 3 machines and testing some accepts of hadoop. now trying to setup hadoop on 32 nodes, the problem is below: org.apache.hadoop.ipc.Client: Retrying connect to server: master/*.*.*.*:9000. Already tried 0 time(s). and even for 9001, we opened these port on master. We use NAT to setup our Linux network. Let me know your ideas, Thanks, Masoud
Re: Matrix Multiplication using Hadoop
Hi Shailesh, Please check the implementation. Other than zeros, say there are proper numbers in the input. Is it giving you the right output? Because long back i had taken code from the link which you had mentioned. But for some reason it was not working. Now i have written my own implementation for matrix multiplication on hadoop. Regards, Naveen On Thu, Mar 15, 2012 at 12:40 AM, Shailesh shailesh.shai...@gmail.comwrote: Hello, My question is posted in the link below: http://stackoverflow.com/q/9708427/1269809?sem=2 Any help or feedback would be very helpful. Regards, Shailesh
Re: Using a combiner
Another important note: the combiner runs can stack. Let's say Prashant is right that the default spill number that triggers the combiner is 3, and that we have a mapper that generates 9 spills. These spills will generate 3 combiner runs, which meets the threshold again, and so we get *another* combiner run on the outputs of the first round of combiners. The upshot is that you *must* make the input and output keys and values of a Combiner the same class, since the outputs of one combiner may well be run into the inputs of another. hth On 03/14/2012 06:32 PM, Prashant Kommireddi wrote: It is a function of the number of spills on map side and I believe the default is 3. So for every 3 times data is spilled the combiner is run. This number is configurable. Sent from my iPhone On Mar 14, 2012, at 3:26 PM, Gayatri Raorgayat...@gmail.com wrote: Hi all, I have a quick query on using a combiner in a MR job. Is it true the framework decides whether or not the combiner gets called? Can any one please give more information on how t his is done. Thanks, Gayatri
Best practice to setup Sqoop,Pig and Hive for a hadoop cluster ?
Greetings All !!! I am using Cloudera CDH3 for Hadoop deployment. We have 7 nodes, in which 5 are used for a fully distributed cluster, 1 for pseudo-distributed 1 as management-node. Fully distributed cluster: HDFS, Mapreduce Hbase cluster Pseudo distributed mode: All I had read about we can install Pig, hive Sqoop on the client node, no need to install it in cluster. What is the client node actually? Can I use my management-node as a client? What is the best practice to install Pig, Hive, Sqoop? For the fully distributed cluster do we need to install Pig, Hive, Sqoop in each nodes? Mysql is needed for Hive as a metastore and sqoop can import mysql database to HDFS or hive or pig, so can we make use of mysql DB's residing on another node? -- Thanks Regards Manu S SI Engineer - OpenSource HPC Wipro Infotech Mob: +91 8861302855Skype: manuspkd www.opensourcetalk.co.in
Re: Best practice to setup Sqoop,Pig and Hive for a hadoop cluster ?
Hi Manu Please find my responses inline I had read about we can install Pig, hive Sqoop on the client node, no need to install it in cluster. What is the client node actually? Can I use my management-node as a client? On larger clusters we have different node that is out of hadoop cluster and these stay in there. So user programs would be triggered from this node. This is the node refereed to as client node/ edge node etc . For your cluster management node and client node can be the same What is the best practice to install Pig, Hive, Sqoop? On a client node For the fully distributed cluster do we need to install Pig, Hive, Sqoop in each nodes? No, can be on a client node or on any of the nodes Mysql is needed for Hive as a metastore and sqoop can import mysql database to HDFS or hive or pig, so can we make use of mysql DB's residing on another node? Regarding your first point, SQOOP import is for different purpose, to get data from RDBNS into hdfs. But the meta stores is used by hive in framing the map reduce jobs corresponding to your hive query. Here SQOOP can't help you much Recommend to have the metastore db of hive on the same node where hive is installed as for execution hive queries there is meta data look up required much especially when your table has large number of partitions and all. Regards Bejoy.K.S On Thu, Mar 15, 2012 at 5:34 PM, Manu S manupk...@gmail.com wrote: Greetings All !!! I am using Cloudera CDH3 for Hadoop deployment. We have 7 nodes, in which 5 are used for a fully distributed cluster, 1 for pseudo-distributed 1 as management-node. Fully distributed cluster: HDFS, Mapreduce Hbase cluster Pseudo distributed mode: All I had read about we can install Pig, hive Sqoop on the client node, no need to install it in cluster. What is the client node actually? Can I use my management-node as a client? What is the best practice to install Pig, Hive, Sqoop? For the fully distributed cluster do we need to install Pig, Hive, Sqoop in each nodes? Mysql is needed for Hive as a metastore and sqoop can import mysql database to HDFS or hive or pig, so can we make use of mysql DB's residing on another node? -- Thanks Regards Manu S SI Engineer - OpenSource HPC Wipro Infotech Mob: +91 8861302855Skype: manuspkd www.opensourcetalk.co.in
Re: Best practice to setup Sqoop,Pig and Hive for a hadoop cluster ?
Thanks a lot Bijoy, that makes sense :) Suppose if I have Mysql database in some other node(not in hadoop cluster), can I import the tables using sqoop to my HDFS? On Thu, Mar 15, 2012 at 6:27 PM, Bejoy Ks bejoy.had...@gmail.com wrote: Hi Manu Please find my responses inline I had read about we can install Pig, hive Sqoop on the client node, no need to install it in cluster. What is the client node actually? Can I use my management-node as a client? On larger clusters we have different node that is out of hadoop cluster and these stay in there. So user programs would be triggered from this node. This is the node refereed to as client node/ edge node etc . For your cluster management node and client node can be the same What is the best practice to install Pig, Hive, Sqoop? On a client node For the fully distributed cluster do we need to install Pig, Hive, Sqoop in each nodes? No, can be on a client node or on any of the nodes Mysql is needed for Hive as a metastore and sqoop can import mysql database to HDFS or hive or pig, so can we make use of mysql DB's residing on another node? Regarding your first point, SQOOP import is for different purpose, to get data from RDBNS into hdfs. But the meta stores is used by hive in framing the map reduce jobs corresponding to your hive query. Here SQOOP can't help you much Recommend to have the metastore db of hive on the same node where hive is installed as for execution hive queries there is meta data look up required much especially when your table has large number of partitions and all. Regards Bejoy.K.S On Thu, Mar 15, 2012 at 5:34 PM, Manu S manupk...@gmail.com wrote: Greetings All !!! I am using Cloudera CDH3 for Hadoop deployment. We have 7 nodes, in which 5 are used for a fully distributed cluster, 1 for pseudo-distributed 1 as management-node. Fully distributed cluster: HDFS, Mapreduce Hbase cluster Pseudo distributed mode: All I had read about we can install Pig, hive Sqoop on the client node, no need to install it in cluster. What is the client node actually? Can I use my management-node as a client? What is the best practice to install Pig, Hive, Sqoop? For the fully distributed cluster do we need to install Pig, Hive, Sqoop in each nodes? Mysql is needed for Hive as a metastore and sqoop can import mysql database to HDFS or hive or pig, so can we make use of mysql DB's residing on another node? -- Thanks Regards Manu S SI Engineer - OpenSource HPC Wipro Infotech Mob: +91 8861302855Skype: manuspkd www.opensourcetalk.co.in -- Thanks Regards Manu S SI Engineer - OpenSource HPC Wipro Infotech Mob: +91 8861302855Skype: manuspkd www.opensourcetalk.co.in
Re: Best practice to setup Sqoop,Pig and Hive for a hadoop cluster ?
On 03/15/2012 09:22 AM, Manu S wrote: Thanks a lot Bijoy, that makes sense :) Suppose if I have Mysql database in some other node(not in hadoop cluster), can I import the tables using sqoop to my HDFS? Yes, this is the main purpose of Sqoop On the Cloudera site, you have the completed documentation for it Sqoop User Guide http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html Sqoop installation https://ccp.cloudera.com/display/CDHDOC/Sqoop+Installation Sqoop for MySQL http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html#_mysql Sqoop site on GitHub http://github.com/cloudera/sqoop Cloudera blog related post to Sqoop http://www.cloudera.com/blog/category/sqoop/ Best wishes On Thu, Mar 15, 2012 at 6:27 PM, Bejoy Ks bejoy.had...@gmail.com mailto:bejoy.had...@gmail.com wrote: Hi Manu Please find my responses inline I had read about we can install Pig, hive Sqoop on the client node, no need to install it in cluster. What is the client node actually? Can I use my management-node as a client? On larger clusters we have different node that is out of hadoop cluster and these stay in there. So user programs would be triggered from this node. This is the node refereed to as client node/ edge node etc . For your cluster management node and client node can be the same What is the best practice to install Pig, Hive, Sqoop? On a client node For the fully distributed cluster do we need to install Pig, Hive, Sqoop in each nodes? No, can be on a client node or on any of the nodes Mysql is needed for Hive as a metastore and sqoop can import mysql database to HDFS or hive or pig, so can we make use of mysql DB's residing on another node? Regarding your first point, SQOOP import is for different purpose, to get data from RDBNS into hdfs. But the meta stores is used by hive in framing the map reduce jobs corresponding to your hive query. Here SQOOP can't help you much Recommend to have the metastore db of hive on the same node where hive is installed as for execution hive queries there is meta data look up required much especially when your table has large number of partitions and all. Regards Bejoy.K.S On Thu, Mar 15, 2012 at 5:34 PM, Manu S manupk...@gmail.com mailto:manupk...@gmail.com wrote: Greetings All !!! I am using Cloudera CDH3 for Hadoop deployment. We have 7 nodes, in which 5 are used for a fully distributed cluster, 1 for pseudo-distributed 1 as management-node. Fully distributed cluster: HDFS, Mapreduce Hbase cluster Pseudo distributed mode: All I had read about we can install Pig, hive Sqoop on the client node, no need to install it in cluster. What is the client node actually? Can I use my management-node as a client? What is the best practice to install Pig, Hive, Sqoop? For the fully distributed cluster do we need to install Pig, Hive, Sqoop in each nodes? Mysql is needed for Hive as a metastore and sqoop can import mysql database to HDFS or hive or pig, so can we make use of mysql DB's residing on another node? -- Thanks Regards Manu S SI Engineer - OpenSource HPC Wipro Infotech Mob: +91 8861302855Skype: manuspkd www.opensourcetalk.co.in http://www.opensourcetalk.co.in -- Thanks Regards Manu S SI Engineer - OpenSource HPC Wipro Infotech Mob: +91 8861302855Skype: manuspkd www.opensourcetalk.co.in http://www.opensourcetalk.co.in -- Marcos Luis Ortíz Valmaseda Sr. Software Engineer (UCI) http://marcosluis2186.posterous.com http://postgresql.uci.cu/blog/38 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: SequenceFile split question
Thanks! that helps. I am reading small xml files from external file system and then writing to the SequenceFile. I made it stand alone client thinking that mapreduce may not be the best way to do this type of writing. My understanding was that map reduce is best suited for processing data within HDFS. Is map reduce also one of the options I should consider? On Thu, Mar 15, 2012 at 2:15 AM, Bejoy Ks bejoy.had...@gmail.com wrote: Hi Mohit If you are using a stand alone client application to do the same definitely there is just one instance of the same running and you'd be writing the sequence file to one hdfs block at a time. Once it reaches hdfs block size the writing continues to next block, in the mean time the first block is replicated. If you are doing the same job distributed as map reduce you'd be writing to to n files at a time when n is the number of tasks in your map reduce job. AFAIK the data node where the blocks have to be placed is determined by hadoop it is not controlled by end user application. But if you are triggering the stand alone job on a particular data node and if it has space one replica would be stored in the same. Same applies in case of MR tasks as well. Regards Bejoy.K.S On Thu, Mar 15, 2012 at 6:17 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I have a client program that creates sequencefile, which essentially merges small files into a big file. I was wondering how is sequence file splitting the data accross nodes. When I start the sequence file is empty. Does it get split when it reaches the dfs.block size? If so then does it mean that I am always writing to just one node at a given point in time? If I start a new client writing a new sequence file then is there a way to select a different data node?
Re: SequenceFile split question
Hi Mohit You are right. If your smaller XML files are in hdfs then MR would be the best approach to combine it to a sequence file. It'd do the job in parallel. Regards Bejoy.K.S On Thu, Mar 15, 2012 at 8:17 PM, Mohit Anchlia mohitanch...@gmail.comwrote: Thanks! that helps. I am reading small xml files from external file system and then writing to the SequenceFile. I made it stand alone client thinking that mapreduce may not be the best way to do this type of writing. My understanding was that map reduce is best suited for processing data within HDFS. Is map reduce also one of the options I should consider? On Thu, Mar 15, 2012 at 2:15 AM, Bejoy Ks bejoy.had...@gmail.com wrote: Hi Mohit If you are using a stand alone client application to do the same definitely there is just one instance of the same running and you'd be writing the sequence file to one hdfs block at a time. Once it reaches hdfs block size the writing continues to next block, in the mean time the first block is replicated. If you are doing the same job distributed as map reduce you'd be writing to to n files at a time when n is the number of tasks in your map reduce job. AFAIK the data node where the blocks have to be placed is determined by hadoop it is not controlled by end user application. But if you are triggering the stand alone job on a particular data node and if it has space one replica would be stored in the same. Same applies in case of MR tasks as well. Regards Bejoy.K.S On Thu, Mar 15, 2012 at 6:17 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I have a client program that creates sequencefile, which essentially merges small files into a big file. I was wondering how is sequence file splitting the data accross nodes. When I start the sequence file is empty. Does it get split when it reaches the dfs.block size? If so then does it mean that I am always writing to just one node at a given point in time? If I start a new client writing a new sequence file then is there a way to select a different data node?
Issue when starting services on CDH3
I have CDH3 installed in standalone mode. I have install all hadoop components. Now when I start services (namenode,secondary namenode,job tracker,task tracker) I can start gracefully from /usr/lib/hadoop/ ./bin/start-all.sh. But when start the same servises from /etc/init.d/hadoop-0.20-* then I unable to start. Why? Now I want to start Hue also which is in init.d that also I couldn't start. Here I suspect authentication issue. Because all the services in init.d are under root user and root group. Please suggest I am stuck here. I tried hive and it seems it running fine. Thanks Manish. Sent from my BlackBerry, pls excuse typo
Re: Best practice to setup Sqoop,Pig and Hive for a hadoop cluster ?
Thanks a lot all :-) On Mar 15, 2012 7:03 PM, Marcos Ortiz mlor...@uci.cu wrote: On 03/15/2012 09:22 AM, Manu S wrote: Thanks a lot Bijoy, that makes sense :) Suppose if I have Mysql database in some other node(not in hadoop cluster), can I import the tables using sqoop to my HDFS? Yes, this is the main purpose of Sqoop On the Cloudera site, you have the completed documentation for it Sqoop User Guide http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html Sqoop installation https://ccp.cloudera.com/display/CDHDOC/Sqoop+Installation Sqoop for MySQL http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html#_mysql Sqoop site on GitHub http://github.com/cloudera/sqoop Cloudera blog related post to Sqoop http://www.cloudera.com/blog/category/sqoop/ Best wishes On Thu, Mar 15, 2012 at 6:27 PM, Bejoy Ks bejoy.had...@gmail.com wrote: Hi Manu Please find my responses inline I had read about we can install Pig, hive Sqoop on the client node, no need to install it in cluster. What is the client node actually? Can I use my management-node as a client? On larger clusters we have different node that is out of hadoop cluster and these stay in there. So user programs would be triggered from this node. This is the node refereed to as client node/ edge node etc . For your cluster management node and client node can be the same What is the best practice to install Pig, Hive, Sqoop? On a client node For the fully distributed cluster do we need to install Pig, Hive, Sqoop in each nodes? No, can be on a client node or on any of the nodes Mysql is needed for Hive as a metastore and sqoop can import mysql database to HDFS or hive or pig, so can we make use of mysql DB's residing on another node? Regarding your first point, SQOOP import is for different purpose, to get data from RDBNS into hdfs. But the meta stores is used by hive in framing the map reduce jobs corresponding to your hive query. Here SQOOP can't help you much Recommend to have the metastore db of hive on the same node where hive is installed as for execution hive queries there is meta data look up required much especially when your table has large number of partitions and all. Regards Bejoy.K.S On Thu, Mar 15, 2012 at 5:34 PM, Manu S manupk...@gmail.com wrote: Greetings All !!! I am using Cloudera CDH3 for Hadoop deployment. We have 7 nodes, in which 5 are used for a fully distributed cluster, 1 for pseudo-distributed 1 as management-node. Fully distributed cluster: HDFS, Mapreduce Hbase cluster Pseudo distributed mode: All I had read about we can install Pig, hive Sqoop on the client node, no need to install it in cluster. What is the client node actually? Can I use my management-node as a client? What is the best practice to install Pig, Hive, Sqoop? For the fully distributed cluster do we need to install Pig, Hive, Sqoop in each nodes? Mysql is needed for Hive as a metastore and sqoop can import mysql database to HDFS or hive or pig, so can we make use of mysql DB's residing on another node? -- Thanks Regards Manu S SI Engineer - OpenSource HPC Wipro Infotech Mob: +91 8861302855Skype: manuspkd www.opensourcetalk.co.in -- Thanks Regards Manu S SI Engineer - OpenSource HPC Wipro Infotech Mob: +91 8861302855Skype: manuspkd www.opensourcetalk.co.in -- Marcos Luis Ortíz Valmaseda Sr. Software Engineer (UCI) http://marcosluis2186.posterous.com http://postgresql.uci.cu/blog/38
Re: Issue when starting services on CDH3
Dear manish Which daemons are not starting? On Mar 15, 2012 9:21 PM, Manish Bhoge manishbh...@rocketmail.com wrote: I have CDH3 installed in standalone mode. I have install all hadoop components. Now when I start services (namenode,secondary namenode,job tracker,task tracker) I can start gracefully from /usr/lib/hadoop/ ./bin/start-all.sh. But when start the same servises from /etc/init.d/hadoop-0.20-* then I unable to start. Why? Now I want to start Hue also which is in init.d that also I couldn't start. Here I suspect authentication issue. Because all the services in init.d are under root user and root group. Please suggest I am stuck here. I tried hive and it seems it running fine. Thanks Manish. Sent from my BlackBerry, pls excuse typo
Re: Issue when starting services on CDH3
Manu, None of the services getting up including namenode, second namenode, tasktracker, jobtracker Sent from my BlackBerry, pls excuse typo -Original Message- From: Manu S manupk...@gmail.com Date: Thu, 15 Mar 2012 21:31:34 To: common-user@hadoop.apache.org; manishbh...@rocketmail.com Subject: Re: Issue when starting services on CDH3 Dear manish Which daemons are not starting? On Mar 15, 2012 9:21 PM, Manish Bhoge manishbh...@rocketmail.com wrote: I have CDH3 installed in standalone mode. I have install all hadoop components. Now when I start services (namenode,secondary namenode,job tracker,task tracker) I can start gracefully from /usr/lib/hadoop/ ./bin/start-all.sh. But when start the same servises from /etc/init.d/hadoop-0.20-* then I unable to start. Why? Now I want to start Hue also which is in init.d that also I couldn't start. Here I suspect authentication issue. Because all the services in init.d are under root user and root group. Please suggest I am stuck here. I tried hive and it seems it running fine. Thanks Manish. Sent from my BlackBerry, pls excuse typo
Re: Issue when starting services on CDH3
Did you check the service status? Is it like dead, but pid exist? Did you check the ownership and permissions for the dfs.name.dir,dfs.data.dir,mapped.local.dir etc ? The order for starting daemons are like this: 1 namenode 2 datanode 3 jobtracker 4 tasktracker Did you format the namenode before starting? On Mar 15, 2012 9:31 PM, Manu S manupk...@gmail.com wrote: Dear manish Which daemons are not starting? On Mar 15, 2012 9:21 PM, Manish Bhoge manishbh...@rocketmail.com wrote: I have CDH3 installed in standalone mode. I have install all hadoop components. Now when I start services (namenode,secondary namenode,job tracker,task tracker) I can start gracefully from /usr/lib/hadoop/ ./bin/start-all.sh. But when start the same servises from /etc/init.d/hadoop-0.20-* then I unable to start. Why? Now I want to start Hue also which is in init.d that also I couldn't start. Here I suspect authentication issue. Because all the services in init.d are under root user and root group. Please suggest I am stuck here. I tried hive and it seems it running fine. Thanks Manish. Sent from my BlackBerry, pls excuse typo
Re: Issue when starting services on CDH3
Are you running the init.d scripts as root and what is order of the services you want to start. Sent from my iPhone On Mar 15, 2012, at 11:22 AM, Manish Bhoge manishbh...@rocketmail.com wrote: Ys, I understand the order and I formatted namenode before starting services. As I suspect there may be ownership and an access issue. Not able to nail down issue exactly. I also have question why there are 2 routes to start services. When we have start-all.sh script then why need to go to init.d to start services?? Thank you, Manish Sent from my BlackBerry, pls excuse typo -Original Message- From: Manu S manupk...@gmail.com Date: Thu, 15 Mar 2012 21:43:26 To: common-user@hadoop.apache.org; manishbh...@rocketmail.com Reply-To: common-user@hadoop.apache.org Subject: Re: Issue when starting services on CDH3 Did you check the service status? Is it like dead, but pid exist? Did you check the ownership and permissions for the dfs.name.dir,dfs.data.dir,mapped.local.dir etc ? The order for starting daemons are like this: 1 namenode 2 datanode 3 jobtracker 4 tasktracker Did you format the namenode before starting? On Mar 15, 2012 9:31 PM, Manu S manupk...@gmail.com wrote: Dear manish Which daemons are not starting? On Mar 15, 2012 9:21 PM, Manish Bhoge manishbh...@rocketmail.com wrote: I have CDH3 installed in standalone mode. I have install all hadoop components. Now when I start services (namenode,secondary namenode,job tracker,task tracker) I can start gracefully from /usr/lib/hadoop/ ./bin/start-all.sh. But when start the same servises from /etc/init.d/hadoop-0.20-* then I unable to start. Why? Now I want to start Hue also which is in init.d that also I couldn't start. Here I suspect authentication issue. Because all the services in init.d are under root user and root group. Please suggest I am stuck here. I tried hive and it seems it running fine. Thanks Manish. Sent from my BlackBerry, pls excuse typo
Re: Capacity Scheduler APIs
Does anybody have an answer to this question? Harshad On Wed, Mar 14, 2012 at 1:51 PM, hdev ml hde...@gmail.com wrote: Hi all, are there any capacity scheduler apis that I can use? e.g. adding, removing queues, tuning properties on the fly and so on. Any help is appreciated. Thanks Harshad
Mapper Only Job, Without Input or Output Path
Hi, I have a use case - I have files lying on the local disk of every node on my cluster. I want to write a Mapper only MapReduce job that reads the file off the local disk on every machine, applies some transformation and wrotes to HDFS. Specifically, 1. The Job shouldn't have any input/output paths, and null key value pairs. 2. Mapper Only 3. I want to be able to control the number of Mappers, depending on the size of my cluster. What's the best way to do this? I would appreciate any example code. Deepak
Re: Issue when starting services on CDH3
Because for large clusters we have to run namenode in a single node, datanode in another nodes So we can start namenode and jobtracker in master node and datanode n tasktracker in slave nodes For getting more clarity You can check the service status after starting Verify these: dfs.name.dir hdfs:hadoop drwx-- dfs.data.dir hdfs:hadoop drwx-- mapred.local.dir mapred:hadoop drwxr-xr-x Please follow each steps in this link https://ccp.cloudera.com/display/CDHDOC/CDH3+Deployment+on+a+Cluster On Mar 15, 2012 9:52 PM, Manish Bhoge manishbh...@rocketmail.com wrote: Ys, I understand the order and I formatted namenode before starting services. As I suspect there may be ownership and an access issue. Not able to nail down issue exactly. I also have question why there are 2 routes to start services. When we have start-all.sh script then why need to go to init.d to start services?? Thank you, Manish Sent from my BlackBerry, pls excuse typo -Original Message- From: Manu S manupk...@gmail.com Date: Thu, 15 Mar 2012 21:43:26 To: common-user@hadoop.apache.org; manishbh...@rocketmail.com Reply-To: common-user@hadoop.apache.org Subject: Re: Issue when starting services on CDH3 Did you check the service status? Is it like dead, but pid exist? Did you check the ownership and permissions for the dfs.name.dir,dfs.data.dir,mapped.local.dir etc ? The order for starting daemons are like this: 1 namenode 2 datanode 3 jobtracker 4 tasktracker Did you format the namenode before starting? On Mar 15, 2012 9:31 PM, Manu S manupk...@gmail.com wrote: Dear manish Which daemons are not starting? On Mar 15, 2012 9:21 PM, Manish Bhoge manishbh...@rocketmail.com wrote: I have CDH3 installed in standalone mode. I have install all hadoop components. Now when I start services (namenode,secondary namenode,job tracker,task tracker) I can start gracefully from /usr/lib/hadoop/ ./bin/start-all.sh. But when start the same servises from /etc/init.d/hadoop-0.20-* then I unable to start. Why? Now I want to start Hue also which is in init.d that also I couldn't start. Here I suspect authentication issue. Because all the services in init.d are under root user and root group. Please suggest I am stuck here. I tried hive and it seems it running fine. Thanks Manish. Sent from my BlackBerry, pls excuse typo
Re: Capacity Scheduler APIs
Hi Harshad, have you looked into the file conf/capacity-scheduler.xml? you can assign and change parameters like capacity of each queue, reclaim time and job priorities. Is that what you're looking for? Shailesh On Thu, Mar 15, 2012 at 12:57 PM, hdev ml hde...@gmail.com wrote: Does anybody have an answer to this question? Harshad On Wed, Mar 14, 2012 at 1:51 PM, hdev ml hde...@gmail.com wrote: Hi all, are there any capacity scheduler apis that I can use? e.g. adding, removing queues, tuning properties on the fly and so on. Any help is appreciated. Thanks Harshad
Re: Capacity Scheduler APIs
Thanks for the email Shailesh. I am looking for some Java API to manage queues. I have already defined queues in the capacity-scheduler.xml and everything works fine. But my question is, can the same thing be done without restarting the cluster or namenode? The only option I see is Java API, hence the question. Please let me know. Harshad On Thu, Mar 15, 2012 at 10:33 AM, Shailesh shailesh.shai...@gmail.comwrote: Hi Harshad, have you looked into the file conf/capacity-scheduler.xml? you can assign and change parameters like capacity of each queue, reclaim time and job priorities. Is that what you're looking for? Shailesh On Thu, Mar 15, 2012 at 12:57 PM, hdev ml hde...@gmail.com wrote: Does anybody have an answer to this question? Harshad On Wed, Mar 14, 2012 at 1:51 PM, hdev ml hde...@gmail.com wrote: Hi all, are there any capacity scheduler apis that I can use? e.g. adding, removing queues, tuning properties on the fly and so on. Any help is appreciated. Thanks Harshad
Re: Capacity Scheduler APIs
Hi Harshad, Have you looked into CapacitySchedulerConf.java class? http://www.java2s.com/Open-Source/Java/Database-DBMS/hadoop-0.20.1/org/apache/hadoop/mapred/CapacitySchedulerConf.java.htm I don't know whether it can be done without restarting the cluster or namenode. On Thu, Mar 15, 2012 at 2:03 PM, hdev ml hde...@gmail.com wrote: Thanks for the email Shailesh. I am looking for some Java API to manage queues. I have already defined queues in the capacity-scheduler.xml and everything works fine. But my question is, can the same thing be done without restarting the cluster or namenode? The only option I see is Java API, hence the question. Please let me know. Harshad On Thu, Mar 15, 2012 at 10:33 AM, Shailesh shailesh.shai...@gmail.com wrote: Hi Harshad, have you looked into the file conf/capacity-scheduler.xml? you can assign and change parameters like capacity of each queue, reclaim time and job priorities. Is that what you're looking for? Shailesh On Thu, Mar 15, 2012 at 12:57 PM, hdev ml hde...@gmail.com wrote: Does anybody have an answer to this question? Harshad On Wed, Mar 14, 2012 at 1:51 PM, hdev ml hde...@gmail.com wrote: Hi all, are there any capacity scheduler apis that I can use? e.g. adding, removing queues, tuning properties on the fly and so on. Any help is appreciated. Thanks Harshad
Re: Issue when starting services on CDH3
Guys, can you please take this up in CDH related mailing lists. On Thu, Mar 15, 2012 at 10:01 AM, Manu S manupk...@gmail.com wrote: Because for large clusters we have to run namenode in a single node, datanode in another nodes So we can start namenode and jobtracker in master node and datanode n tasktracker in slave nodes For getting more clarity You can check the service status after starting Verify these: dfs.name.dir hdfs:hadoop drwx-- dfs.data.dir hdfs:hadoop drwx-- mapred.local.dir mapred:hadoop drwxr-xr-x Please follow each steps in this link https://ccp.cloudera.com/display/CDHDOC/CDH3+Deployment+on+a+Cluster On Mar 15, 2012 9:52 PM, Manish Bhoge manishbh...@rocketmail.com wrote: Ys, I understand the order and I formatted namenode before starting services. As I suspect there may be ownership and an access issue. Not able to nail down issue exactly. I also have question why there are 2 routes to start services. When we have start-all.sh script then why need to go to init.d to start services?? Thank you, Manish Sent from my BlackBerry, pls excuse typo -Original Message- From: Manu S manupk...@gmail.com Date: Thu, 15 Mar 2012 21:43:26 To: common-user@hadoop.apache.org; manishbh...@rocketmail.com Reply-To: common-user@hadoop.apache.org Subject: Re: Issue when starting services on CDH3 Did you check the service status? Is it like dead, but pid exist? Did you check the ownership and permissions for the dfs.name.dir,dfs.data.dir,mapped.local.dir etc ? The order for starting daemons are like this: 1 namenode 2 datanode 3 jobtracker 4 tasktracker Did you format the namenode before starting? On Mar 15, 2012 9:31 PM, Manu S manupk...@gmail.com wrote: Dear manish Which daemons are not starting? On Mar 15, 2012 9:21 PM, Manish Bhoge manishbh...@rocketmail.com wrote: I have CDH3 installed in standalone mode. I have install all hadoop components. Now when I start services (namenode,secondary namenode,job tracker,task tracker) I can start gracefully from /usr/lib/hadoop/ ./bin/start-all.sh. But when start the same servises from /etc/init.d/hadoop-0.20-* then I unable to start. Why? Now I want to start Hue also which is in init.d that also I couldn't start. Here I suspect authentication issue. Because all the services in init.d are under root user and root group. Please suggest I am stuck here. I tried hive and it seems it running fine. Thanks Manish. Sent from my BlackBerry, pls excuse typo
Re: Capacity Scheduler APIs
To refresh your queues, you may do, as your MR admin user: $ hadoop mradmin -refreshQueues Am not sure if this covers CS config refreshes, but let us know if it does. The above command is present in Apache Hadoop 1.x. On Fri, Mar 16, 2012 at 12:08 AM, Shailesh shailesh.shai...@gmail.comwrote: Hi Harshad, Have you looked into CapacitySchedulerConf.java class? http://www.java2s.com/Open-Source/Java/Database-DBMS/hadoop-0.20.1/org/apache/hadoop/mapred/CapacitySchedulerConf.java.htm I don't know whether it can be done without restarting the cluster or namenode. On Thu, Mar 15, 2012 at 2:03 PM, hdev ml hde...@gmail.com wrote: Thanks for the email Shailesh. I am looking for some Java API to manage queues. I have already defined queues in the capacity-scheduler.xml and everything works fine. But my question is, can the same thing be done without restarting the cluster or namenode? The only option I see is Java API, hence the question. Please let me know. Harshad On Thu, Mar 15, 2012 at 10:33 AM, Shailesh shailesh.shai...@gmail.com wrote: Hi Harshad, have you looked into the file conf/capacity-scheduler.xml? you can assign and change parameters like capacity of each queue, reclaim time and job priorities. Is that what you're looking for? Shailesh On Thu, Mar 15, 2012 at 12:57 PM, hdev ml hde...@gmail.com wrote: Does anybody have an answer to this question? Harshad On Wed, Mar 14, 2012 at 1:51 PM, hdev ml hde...@gmail.com wrote: Hi all, are there any capacity scheduler apis that I can use? e.g. adding, removing queues, tuning properties on the fly and so on. Any help is appreciated. Thanks Harshad -- Harsh J
Re: Issue when starting services on CDH3
To add to Suresh's guideline since he may have missed providing a link, you can visit the CDH users community at https://groups.google.com/a/cloudera.org/forum/#!forum/cdh-user On Fri, Mar 16, 2012 at 12:13 AM, Suresh Srinivas sur...@hortonworks.comwrote: Guys, can you please take this up in CDH related mailing lists. On Thu, Mar 15, 2012 at 10:01 AM, Manu S manupk...@gmail.com wrote: Because for large clusters we have to run namenode in a single node, datanode in another nodes So we can start namenode and jobtracker in master node and datanode n tasktracker in slave nodes For getting more clarity You can check the service status after starting Verify these: dfs.name.dir hdfs:hadoop drwx-- dfs.data.dir hdfs:hadoop drwx-- mapred.local.dir mapred:hadoop drwxr-xr-x Please follow each steps in this link https://ccp.cloudera.com/display/CDHDOC/CDH3+Deployment+on+a+Cluster On Mar 15, 2012 9:52 PM, Manish Bhoge manishbh...@rocketmail.com wrote: Ys, I understand the order and I formatted namenode before starting services. As I suspect there may be ownership and an access issue. Not able to nail down issue exactly. I also have question why there are 2 routes to start services. When we have start-all.sh script then why need to go to init.d to start services?? Thank you, Manish Sent from my BlackBerry, pls excuse typo -Original Message- From: Manu S manupk...@gmail.com Date: Thu, 15 Mar 2012 21:43:26 To: common-user@hadoop.apache.org; manishbh...@rocketmail.com Reply-To: common-user@hadoop.apache.org Subject: Re: Issue when starting services on CDH3 Did you check the service status? Is it like dead, but pid exist? Did you check the ownership and permissions for the dfs.name.dir,dfs.data.dir,mapped.local.dir etc ? The order for starting daemons are like this: 1 namenode 2 datanode 3 jobtracker 4 tasktracker Did you format the namenode before starting? On Mar 15, 2012 9:31 PM, Manu S manupk...@gmail.com wrote: Dear manish Which daemons are not starting? On Mar 15, 2012 9:21 PM, Manish Bhoge manishbh...@rocketmail.com wrote: I have CDH3 installed in standalone mode. I have install all hadoop components. Now when I start services (namenode,secondary namenode,job tracker,task tracker) I can start gracefully from /usr/lib/hadoop/ ./bin/start-all.sh. But when start the same servises from /etc/init.d/hadoop-0.20-* then I unable to start. Why? Now I want to start Hue also which is in init.d that also I couldn't start. Here I suspect authentication issue. Because all the services in init.d are under root user and root group. Please suggest I am stuck here. I tried hive and it seems it running fine. Thanks Manish. Sent from my BlackBerry, pls excuse typo -- Harsh J
YARN applications not running
Hello all, When submitting an Hbase export job to YARN, I see it appearing on the web UI but for some reason, the job never starts; it constantly stays at 0% complete. I am using hadoop 0.23 and hbase 0.92 ( CDH4 beta 1 ) I see the NodeManagers connecting to the ResourceManager: 2012-03-15 19:36:10,585 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: torb1pnb001.dataraker.net:46696 Node Transitioned from NEW to RUNNING 2012-03-15 19:36:16,633 INFO org.apache.hadoop.yarn.util.RackResolver: Resolved torb1pnb002.dataraker.net to /default-rack 2012-03-15 19:36:16,633 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: NodeManager from node torb1pnb002.dataraker.net(cmPort: 35665 httpPort: ) registered with capability: 1000, assigned nodeId torb1pnb002.dataraker.net:35665 2012-03-15 19:36:16,634 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: torb1pnb002.dataraker.net:35665 Node Transitioned from NEW to RUNNING [ etc... ] and the job being submitted to the ResourceManager: 2012-03-15 19:40:29,248 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Allocated new applicationId: 1 2012-03-15 19:40:31,323 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1331840162147_0001 State change from NEW to SUBMITTED 2012-03-15 19:40:31,323 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering appattempt_1331840162147_0001_01 2012-03-15 19:40:31,323 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1331840162147_0001_01 State change from NEW to SUBMITTED 2012-03-15 19:40:31,327 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Application with id 1 submitted by user hdfs with application_id [..snip..] 2012-03-15 19:40:31,329 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hdfs IP=10.192.16.64 OPERATION=Submit Application Request TARGET=ClientRMService RESULT=SUCCESS APPID=application_1331840162147_0001 2012-03-15 19:40:31,333 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler: Application Submission: application_1331840162147_0001 from hdfs, currently active: 1 2012-03-15 19:40:31,336 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1331840162147_0001_01 State change from SUBMITTED to SCHEDULED 2012-03-15 19:40:31,336 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1331840162147_0001 State change from SUBMITTED to ACCEPTED but after the NodeManager starts, the log never indicates any requests from the ResourceManager 2012-03-15 19:36:16,604 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Connected to ResourceManager at torb1pna001:8025 2012-03-15 19:36:16,645 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered with ResourceManager as torb1pnb002.dataraker.net:35665 with total resource of memory: 1000 2012-03-15 19:36:16,645 INFO org.apache.hadoop.yarn.service.AbstractService: Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl is started. 2012-03-15 19:36:16,646 INFO org.apache.hadoop.yarn.service.AbstractService: Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is started. [ end of log ] I am seeing strange errors in Zookeeper when the job is submitted: 2012-03-15 16:58:00,216 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@213] - Accepted socket connection from /127.0.0.1:33262 2012-03-15 16:58:00,219 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@838] - Client attempting to establish new session at /127.0.0.1:33262 2012-03-15 16:58:00,229 - INFO [CommitProcessor:0:ZooKeeperServer@604] - Established session 0x35d53d539f0071 with negotiated timeout 4 for client /127.0.0.1:33262 2012-03-15 16:58:48,884 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@349] - caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x35d53d539f0071, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:224) at java.lang.Thread.run(Thread.java:662) 2012-03-15 16:58:48,885 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for client /127.0.0.1:33262 which had sessionid 0x35d53d539f0071 2012-03-15 17:02:59,968 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@213] - Accepted socket connection from /127.0.0.1:59652 2012-03-15 17:02:59,971 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@838] - Client attempting to
Re: EOFException
On 03/15/2012 03:06 PM, Mohit Anchlia wrote: When I start a job to read data from HDFS I start getting these errors. Does anyone know what this means and how to resolve it? 2012-03-15 10:41:31,402 [Thread-5] INFO org.apache.hadoop.hdfs.DFSClient - Exception in createBlockOutputStream 164.28.62.204:50010java.io.EOFException 2012-03-15 10:41:31,402 [Thread-5] INFO org.apache.hadoop.hdfs.DFSClient - Abandoning block blk_-6402969611996946639_11837 2012-03-15 10:41:31,403 [Thread-5] INFO org.apache.hadoop.hdfs.DFSClient - Excluding datanode 164.28.62.204:50010 2012-03-15 10:41:31,406 [Thread-5] INFO org.apache.hadoop.hdfs.DFSClient - Exception in createBlockOutputStream 164.28.62.198:50010java.io.EOFException 2012-03-15 10:41:31,406 [Thread-5] INFO org.apache.hadoop.hdfs.DFSClient - Abandoning block blk_-5442664108986165368_11838 2012-03-15 10:41:31,407 [Thread-5] INFO org.apache.hadoop.hdfs.DFSClient - Exception in createBlockOutputStream 164.28.62.197:50010java.io.EOFException 2012-03-15 10:41:31,407 [Thread-5] INFO org.apache.hadoop.hdfs.DFSClient - Abandoning block blk_-3373089616877234160_11838 2012-03-15 10:41:31,407 [Thread-5] INFO org.apache.hadoop.hdfs.DFSClient - Excluding datanode 164.28.62.198:50010 2012-03-15 10:41:31,409 [Thread-5] INFO org.apache.hadoop.hdfs.DFSClient - Excluding datanode 164.28.62.197:50010 2012-03-15 10:41:31,410 [Thread-5] INFO org.apache.hadoop.hdfs.DFSClient - Exception in createBlockOutputStream 164.28.62.204:50010java.io.EOFException 2012-03-15 10:41:31,410 [Thread-5] INFO org.apache.hadoop.hdfs.DFSClient - Abandoning block blk_4481292025401332278_11838 2012-03-15 10:41:31,411 [Thread-5] INFO org.apache.hadoop.hdfs.DFSClient - Excluding datanode 164.28.62.204:50010 2012-03-15 10:41:31,412 [Thread-5] INFO org.apache.hadoop.hdfs.DFSClient - Exception in createBlockOutputStream 164.28.62.200:50010java.io.EOFException 2012-03-15 10:41:31,412 [Thread-5] INFO org.apache.hadoop.hdfs.DFSClient - Abandoning block blk_-5326771177080888701_11838 2012-03-15 10:41:31,413 [Thread-5] INFO org.apache.hadoop.hdfs.DFSClient - Excluding datanode 164.28.62.200:50010 2012-03-15 10:41:31,414 [Thread-5] INFO org.apache.hadoop.hdfs.DFSClient - Exception in createBlockOutputStream 164.28.62.197:50010java.io.EOFException 2012-03-15 10:41:31,414 [Thread-5] INFO org.apache.hadoop.hdfs.DFSClient - Abandoning block blk_-8073750683705518772_11839 2012-03-15 10:41:31,415 [Thread-5] INFO org.apache.hadoop.hdfs.DFSClient - Excluding datanode 164.28.62.197:50010 2012-03-15 10:41:31,416 [Thread-5] INFO org.apache.hadoop.hdfs.DFSClient - Exception in createBlockOutputStream 164.28.62.199:50010java.io.EOFException 2012-03-15 10:41:31,416 [Thread-5] INFO org.apache.hadoop.hdfs.DFSClient - Exception in createBlockOutputStream 164.28.62.198:50010java.io.EOFException 2012-03-15 10:41:31,416 [Thread-5] INFO org.apache.hadoop.hdfs.DFSClient - Abandoning block blk_441003866688859169_11838 2012-03-15 10:41:31,416 [Thread-5] INFO org.apache.hadoop.hdfs.DFSClient - Abandoning block blk_-466858474055876377_11839 2012-03-15 10:41:31,417 [Thread-5] INFO org.apache.hadoop.hdfs.DFSClient - Excluding datanode 164.28.62.198:50010 2012-03-15 10:41:31,417 [Thread-5] WARN org.apache.hadoop.hdfs.DFSClient - Try shutting down and restarting hbase.
Re: EOFException
This is actually just hadoop job over HDFS. I am assuming you also know why this is erroring out? On Thu, Mar 15, 2012 at 1:02 PM, Gopal absoft...@gmail.com wrote: On 03/15/2012 03:06 PM, Mohit Anchlia wrote: When I start a job to read data from HDFS I start getting these errors. Does anyone know what this means and how to resolve it? 2012-03-15 10:41:31,402 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Exception in createBlockOutputStream 164.28.62.204:50010java.io.** EOFException 2012-03-15 10:41:31,402 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Abandoning block blk_-6402969611996946639_11837 2012-03-15 10:41:31,403 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Excluding datanode 164.28.62.204:50010 2012-03-15 10:41:31,406 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Exception in createBlockOutputStream 164.28.62.198:50010java.io.** EOFException 2012-03-15 10:41:31,406 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Abandoning block blk_-5442664108986165368_11838 2012-03-15 10:41:31,407 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Exception in createBlockOutputStream 164.28.62.197:50010java.io.** EOFException 2012-03-15 10:41:31,407 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Abandoning block blk_-3373089616877234160_11838 2012-03-15 10:41:31,407 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Excluding datanode 164.28.62.198:50010 2012-03-15 10:41:31,409 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Excluding datanode 164.28.62.197:50010 2012-03-15 10:41:31,410 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Exception in createBlockOutputStream 164.28.62.204:50010java.io.** EOFException 2012-03-15 10:41:31,410 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Abandoning block blk_4481292025401332278_11838 2012-03-15 10:41:31,411 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Excluding datanode 164.28.62.204:50010 2012-03-15 10:41:31,412 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Exception in createBlockOutputStream 164.28.62.200:50010java.io.** EOFException 2012-03-15 10:41:31,412 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Abandoning block blk_-5326771177080888701_11838 2012-03-15 10:41:31,413 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Excluding datanode 164.28.62.200:50010 2012-03-15 10:41:31,414 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Exception in createBlockOutputStream 164.28.62.197:50010java.io.** EOFException 2012-03-15 10:41:31,414 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Abandoning block blk_-8073750683705518772_11839 2012-03-15 10:41:31,415 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Excluding datanode 164.28.62.197:50010 2012-03-15 10:41:31,416 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Exception in createBlockOutputStream 164.28.62.199:50010java.io.** EOFException 2012-03-15 10:41:31,416 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Exception in createBlockOutputStream 164.28.62.198:50010java.io.** EOFException 2012-03-15 10:41:31,416 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Abandoning block blk_441003866688859169_11838 2012-03-15 10:41:31,416 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Abandoning block blk_-466858474055876377_11839 2012-03-15 10:41:31,417 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Excluding datanode 164.28.62.198:50010 2012-03-15 10:41:31,417 [Thread-5] WARN org.apache.hadoop.hdfs.**DFSClient - Try shutting down and restarting hbase.
Re: Capacity Scheduler APIs
Thanks Shailesh/Harsh, I will try the hadoop command first and then the internal code. Thanks again. Harshad. On Thu, Mar 15, 2012 at 12:06 PM, Harsh J ha...@cloudera.com wrote: To refresh your queues, you may do, as your MR admin user: $ hadoop mradmin -refreshQueues Am not sure if this covers CS config refreshes, but let us know if it does. The above command is present in Apache Hadoop 1.x. On Fri, Mar 16, 2012 at 12:08 AM, Shailesh shailesh.shai...@gmail.com wrote: Hi Harshad, Have you looked into CapacitySchedulerConf.java class? http://www.java2s.com/Open-Source/Java/Database-DBMS/hadoop-0.20.1/org/apache/hadoop/mapred/CapacitySchedulerConf.java.htm I don't know whether it can be done without restarting the cluster or namenode. On Thu, Mar 15, 2012 at 2:03 PM, hdev ml hde...@gmail.com wrote: Thanks for the email Shailesh. I am looking for some Java API to manage queues. I have already defined queues in the capacity-scheduler.xml and everything works fine. But my question is, can the same thing be done without restarting the cluster or namenode? The only option I see is Java API, hence the question. Please let me know. Harshad On Thu, Mar 15, 2012 at 10:33 AM, Shailesh shailesh.shai...@gmail.com wrote: Hi Harshad, have you looked into the file conf/capacity-scheduler.xml? you can assign and change parameters like capacity of each queue, reclaim time and job priorities. Is that what you're looking for? Shailesh On Thu, Mar 15, 2012 at 12:57 PM, hdev ml hde...@gmail.com wrote: Does anybody have an answer to this question? Harshad On Wed, Mar 14, 2012 at 1:51 PM, hdev ml hde...@gmail.com wrote: Hi all, are there any capacity scheduler apis that I can use? e.g. adding, removing queues, tuning properties on the fly and so on. Any help is appreciated. Thanks Harshad -- Harsh J
Suggestion for InputSplit and InputFormat - Split every line.
Hi, I have this use case - I need to spawn as many mappers as the number of lines in a file in HDFS. This file isn't big (only 10-50 lines). Actually each line represents the path of another data source that the Mappers will work on. So each mapper will read 1 line, (the map() method will need to be called only once), and work on the data source. What's the best way to construct InputSplit, InputFormat and RecordReader to achieve this? I would appreciate any example code :) Best, Deepak
Re: Suggestion for InputSplit and InputFormat - Split every line.
Have a look at NLineInputFormat class in Hadoop. It is build to split the input on the basis of number of lines. On Thu, Mar 15, 2012 at 6:13 PM, Deepak Nettem deepaknet...@gmail.comwrote: Hi, I have this use case - I need to spawn as many mappers as the number of lines in a file in HDFS. This file isn't big (only 10-50 lines). Actually each line represents the path of another data source that the Mappers will work on. So each mapper will read 1 line, (the map() method will need to be called only once), and work on the data source. What's the best way to construct InputSplit, InputFormat and RecordReader to achieve this? I would appreciate any example code :) Best, Deepak -- Thanks Regards, Anil Gupta