Re: SequenceFile split question

2012-03-15 Thread Bejoy Ks
Hi Mohit
  If you are using a stand alone client application to do the same
definitely there is just one instance of the same running and you'd be
writing the sequence file to one hdfs block at a time. Once it reaches hdfs
block size the writing continues to next block, in the mean time the first
block is replicated. If you are doing the same job distributed as map
reduce you'd be writing to to n files at a time when n is the number of
tasks in your map reduce job.
 AFAIK the data node where the blocks have to be placed is determined
by hadoop it is not controlled by end user application. But if you are
triggering the stand alone job on a particular data node and if it has
space one replica would be stored in the same. Same applies in case of MR
tasks as well.

Regards
Bejoy.K.S

On Thu, Mar 15, 2012 at 6:17 AM, Mohit Anchlia mohitanch...@gmail.comwrote:

 I have a client program that creates sequencefile, which essentially merges
 small files into a big file. I was wondering how is sequence file splitting
 the data accross nodes. When I start the sequence file is empty. Does it
 get split when it reaches the dfs.block size? If so then does it mean that
 I am always writing to just one node at a given point in time?

 If I start a new client writing a new sequence file then is there a way to
 select a different data node?



slaves could not connect on 9000 and 9001 ports of master

2012-03-15 Thread Masoud

Hi all,

we made a pilot cluster in 3 machines and testing some accepts of hadoop.
now trying to setup hadoop on 32 nodes, the problem is below:

org.apache.hadoop.ipc.Client: Retrying connect to server: 
master/*.*.*.*:9000. Already tried 0 time(s).


and even for 9001, we opened these port on master.
We use NAT to setup our Linux network.

Let me know your ideas,

Thanks,
Masoud


Re: Matrix Multiplication using Hadoop

2012-03-15 Thread Naveen Mahale
Hi Shailesh,

Please check the implementation. Other than zeros, say there are proper
numbers in the input. Is it giving you the right output? Because long back
i had taken code from the link which you had mentioned. But for some reason
it was not working. Now i have written my own implementation for matrix
multiplication on hadoop.

Regards,
Naveen

On Thu, Mar 15, 2012 at 12:40 AM, Shailesh shailesh.shai...@gmail.comwrote:

 Hello,

 My question is posted in the link below:

 http://stackoverflow.com/q/9708427/1269809?sem=2

 Any help or feedback would be very helpful.

 Regards,
 Shailesh



Re: Using a combiner

2012-03-15 Thread John Armstrong

Another important note: the combiner runs can stack.

Let's say Prashant is right that the default spill number that triggers 
the combiner is 3, and that we have a mapper that generates 9 spills. 
These spills will generate 3 combiner runs, which meets the threshold 
again, and so we get *another* combiner run on the outputs of the first 
round of combiners.


The upshot is that you *must* make the input and output keys and values 
of a Combiner the same class, since the outputs of one combiner may well 
be run into the inputs of another.


hth


On 03/14/2012 06:32 PM, Prashant Kommireddi wrote:

It is a function of the number of spills on map side and I believe
the default is 3. So for every 3 times data is spilled the combiner is
run. This number is configurable.

Sent from my iPhone

On Mar 14, 2012, at 3:26 PM, Gayatri Raorgayat...@gmail.com  wrote:


Hi all,

I have a quick query on using a combiner in a MR job. Is it true the
framework decides whether or not the combiner gets called?
Can any one please give more information on how t his is done.

Thanks,
Gayatri






Best practice to setup Sqoop,Pig and Hive for a hadoop cluster ?

2012-03-15 Thread Manu S
Greetings All !!!

I am using Cloudera CDH3 for Hadoop deployment. We have 7 nodes, in which 5
are used for a fully distributed cluster, 1 for pseudo-distributed  1 as
management-node.

Fully distributed cluster: HDFS, Mapreduce  Hbase cluster
Pseudo distributed mode: All

I had read about we can install Pig, hive  Sqoop on the client node, no
need to install it in cluster. What is the client node actually? Can I use
my management-node as a client?

What is the best practice to install Pig, Hive,  Sqoop?
For the fully distributed cluster do we need to install Pig, Hive,  Sqoop
in each nodes?

Mysql is needed for Hive as a metastore and sqoop can import mysql database
to HDFS or hive or pig, so can we make use of mysql DB's residing on
another node?

-- 
Thanks  Regards

Manu S
SI Engineer - OpenSource  HPC
Wipro Infotech
Mob: +91 8861302855Skype: manuspkd
www.opensourcetalk.co.in


Re: Best practice to setup Sqoop,Pig and Hive for a hadoop cluster ?

2012-03-15 Thread Bejoy Ks
Hi Manu
  Please find my responses inline

I had read about we can install Pig, hive  Sqoop on the client node, no
need to install it in cluster. What is the client node actually? Can I use
my management-node as a client?

On larger clusters we have different node that is out of hadoop cluster and
these stay in there. So user programs would be triggered from this node.
This is the node refereed to as client node/ edge node etc . For your
cluster management node and client node can be the same

What is the best practice to install Pig, Hive,  Sqoop?

On a client node

For the fully distributed cluster do we need to install Pig, Hive,  Sqoop
in each nodes?

No, can be on a client node or on any of the nodes

Mysql is needed for Hive as a metastore and sqoop can import mysql database
to HDFS or hive or pig, so can we make use of mysql DB's residing on
another node?
Regarding your first point, SQOOP import is for different purpose, to get
data from RDBNS into hdfs. But the meta stores is used by hive  in framing
the map reduce jobs corresponding to your hive query. Here SQOOP can't help
you much
Recommend to have the metastore db of hive on the same node where hive is
installed as for execution hive queries there is meta data look up required
much especially when your table has large number of partitions and all.

Regards
Bejoy.K.S

On Thu, Mar 15, 2012 at 5:34 PM, Manu S manupk...@gmail.com wrote:

 Greetings All !!!

 I am using Cloudera CDH3 for Hadoop deployment. We have 7 nodes, in which 5
 are used for a fully distributed cluster, 1 for pseudo-distributed  1 as
 management-node.

 Fully distributed cluster: HDFS, Mapreduce  Hbase cluster
 Pseudo distributed mode: All

 I had read about we can install Pig, hive  Sqoop on the client node, no
 need to install it in cluster. What is the client node actually? Can I use
 my management-node as a client?

 What is the best practice to install Pig, Hive,  Sqoop?
 For the fully distributed cluster do we need to install Pig, Hive,  Sqoop
 in each nodes?

 Mysql is needed for Hive as a metastore and sqoop can import mysql database
 to HDFS or hive or pig, so can we make use of mysql DB's residing on
 another node?

 --
 Thanks  Regards
 
 Manu S
 SI Engineer - OpenSource  HPC
 Wipro Infotech
 Mob: +91 8861302855Skype: manuspkd
 www.opensourcetalk.co.in



Re: Best practice to setup Sqoop,Pig and Hive for a hadoop cluster ?

2012-03-15 Thread Manu S
Thanks a lot Bijoy, that makes sense :)

Suppose if I have Mysql database in some other node(not in hadoop cluster),
can I import the tables using sqoop to my HDFS?


On Thu, Mar 15, 2012 at 6:27 PM, Bejoy Ks bejoy.had...@gmail.com wrote:

 Hi Manu
  Please find my responses inline

 I had read about we can install Pig, hive  Sqoop on the client node, no
 need to install it in cluster. What is the client node actually? Can I use
 my management-node as a client?

 On larger clusters we have different node that is out of hadoop cluster and
 these stay in there. So user programs would be triggered from this node.
 This is the node refereed to as client node/ edge node etc . For your
 cluster management node and client node can be the same

 What is the best practice to install Pig, Hive,  Sqoop?

 On a client node

 For the fully distributed cluster do we need to install Pig, Hive,  Sqoop
 in each nodes?

 No, can be on a client node or on any of the nodes

 Mysql is needed for Hive as a metastore and sqoop can import mysql
 database
 to HDFS or hive or pig, so can we make use of mysql DB's residing on
 another node?
 Regarding your first point, SQOOP import is for different purpose, to get
 data from RDBNS into hdfs. But the meta stores is used by hive  in framing
 the map reduce jobs corresponding to your hive query. Here SQOOP can't help
 you much
 Recommend to have the metastore db of hive on the same node where hive is
 installed as for execution hive queries there is meta data look up required
 much especially when your table has large number of partitions and all.

 Regards
 Bejoy.K.S

 On Thu, Mar 15, 2012 at 5:34 PM, Manu S manupk...@gmail.com wrote:

  Greetings All !!!
 
  I am using Cloudera CDH3 for Hadoop deployment. We have 7 nodes, in
 which 5
  are used for a fully distributed cluster, 1 for pseudo-distributed  1 as
  management-node.
 
  Fully distributed cluster: HDFS, Mapreduce  Hbase cluster
  Pseudo distributed mode: All
 
  I had read about we can install Pig, hive  Sqoop on the client node, no
  need to install it in cluster. What is the client node actually? Can I
 use
  my management-node as a client?
 
  What is the best practice to install Pig, Hive,  Sqoop?
  For the fully distributed cluster do we need to install Pig, Hive, 
 Sqoop
  in each nodes?
 
  Mysql is needed for Hive as a metastore and sqoop can import mysql
 database
  to HDFS or hive or pig, so can we make use of mysql DB's residing on
  another node?
 
  --
  Thanks  Regards
  
  Manu S
  SI Engineer - OpenSource  HPC
  Wipro Infotech
  Mob: +91 8861302855Skype: manuspkd
  www.opensourcetalk.co.in
 




-- 
Thanks  Regards

Manu S
SI Engineer - OpenSource  HPC
Wipro Infotech
Mob: +91 8861302855Skype: manuspkd
www.opensourcetalk.co.in


Re: Best practice to setup Sqoop,Pig and Hive for a hadoop cluster ?

2012-03-15 Thread Marcos Ortiz



On 03/15/2012 09:22 AM, Manu S wrote:

Thanks a lot Bijoy, that makes sense :)

Suppose if I have Mysql database in some other node(not in hadoop 
cluster), can I import the tables using sqoop to my HDFS?

Yes, this is the main purpose of Sqoop
On the Cloudera site, you have the completed documentation for it

Sqoop User Guide
http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html

Sqoop installation
https://ccp.cloudera.com/display/CDHDOC/Sqoop+Installation

Sqoop for MySQL
http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html#_mysql

Sqoop site on GitHub
http://github.com/cloudera/sqoop

Cloudera blog related post to Sqoop
http://www.cloudera.com/blog/category/sqoop/


Best wishes




On Thu, Mar 15, 2012 at 6:27 PM, Bejoy Ks bejoy.had...@gmail.com 
mailto:bejoy.had...@gmail.com wrote:


Hi Manu
 Please find my responses inline

I had read about we can install Pig, hive  Sqoop on the client
node, no
need to install it in cluster. What is the client node actually?
Can I use
my management-node as a client?

On larger clusters we have different node that is out of hadoop
cluster and
these stay in there. So user programs would be triggered from this
node.
This is the node refereed to as client node/ edge node etc . For your
cluster management node and client node can be the same

What is the best practice to install Pig, Hive,  Sqoop?

On a client node

For the fully distributed cluster do we need to install Pig,
Hive,  Sqoop
in each nodes?

No, can be on a client node or on any of the nodes

Mysql is needed for Hive as a metastore and sqoop can import
mysql database
to HDFS or hive or pig, so can we make use of mysql DB's residing on
another node?
Regarding your first point, SQOOP import is for different purpose,
to get
data from RDBNS into hdfs. But the meta stores is used by hive  in
framing
the map reduce jobs corresponding to your hive query. Here SQOOP
can't help
you much
Recommend to have the metastore db of hive on the same node where
hive is
installed as for execution hive queries there is meta data look up
required
much especially when your table has large number of partitions and
all.

Regards
Bejoy.K.S

On Thu, Mar 15, 2012 at 5:34 PM, Manu S manupk...@gmail.com
mailto:manupk...@gmail.com wrote:

 Greetings All !!!

 I am using Cloudera CDH3 for Hadoop deployment. We have 7 nodes,
in which 5
 are used for a fully distributed cluster, 1 for
pseudo-distributed  1 as
 management-node.

 Fully distributed cluster: HDFS, Mapreduce  Hbase cluster
 Pseudo distributed mode: All

 I had read about we can install Pig, hive  Sqoop on the client
node, no
 need to install it in cluster. What is the client node actually?
Can I use
 my management-node as a client?

 What is the best practice to install Pig, Hive,  Sqoop?
 For the fully distributed cluster do we need to install Pig,
Hive,  Sqoop
 in each nodes?

 Mysql is needed for Hive as a metastore and sqoop can import
mysql database
 to HDFS or hive or pig, so can we make use of mysql DB's residing on
 another node?

 --
 Thanks  Regards
 
 Manu S
 SI Engineer - OpenSource  HPC
 Wipro Infotech
 Mob: +91 8861302855Skype: manuspkd
 www.opensourcetalk.co.in http://www.opensourcetalk.co.in





--
Thanks  Regards

Manu S
SI Engineer - OpenSource  HPC
Wipro Infotech
Mob: +91 8861302855Skype: manuspkd
www.opensourcetalk.co.in http://www.opensourcetalk.co.in





--
Marcos Luis Ortíz Valmaseda
 Sr. Software Engineer (UCI)
 http://marcosluis2186.posterous.com
 http://postgresql.uci.cu/blog/38



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: SequenceFile split question

2012-03-15 Thread Mohit Anchlia
Thanks! that helps. I am reading small xml files from external file system
and then writing to the SequenceFile. I made it stand alone client thinking
that mapreduce may not be the best way to do this type of writing. My
understanding was that map reduce is best suited for processing data within
HDFS. Is map reduce also one of the options I should consider?

On Thu, Mar 15, 2012 at 2:15 AM, Bejoy Ks bejoy.had...@gmail.com wrote:

 Hi Mohit
  If you are using a stand alone client application to do the same
 definitely there is just one instance of the same running and you'd be
 writing the sequence file to one hdfs block at a time. Once it reaches hdfs
 block size the writing continues to next block, in the mean time the first
 block is replicated. If you are doing the same job distributed as map
 reduce you'd be writing to to n files at a time when n is the number of
 tasks in your map reduce job.
 AFAIK the data node where the blocks have to be placed is determined
 by hadoop it is not controlled by end user application. But if you are
 triggering the stand alone job on a particular data node and if it has
 space one replica would be stored in the same. Same applies in case of MR
 tasks as well.

 Regards
 Bejoy.K.S

 On Thu, Mar 15, 2012 at 6:17 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  I have a client program that creates sequencefile, which essentially
 merges
  small files into a big file. I was wondering how is sequence file
 splitting
  the data accross nodes. When I start the sequence file is empty. Does it
  get split when it reaches the dfs.block size? If so then does it mean
 that
  I am always writing to just one node at a given point in time?
 
  If I start a new client writing a new sequence file then is there a way
 to
  select a different data node?
 



Re: SequenceFile split question

2012-03-15 Thread Bejoy Ks
Hi Mohit
 You are right. If your smaller XML files are in hdfs then MR would be
the best approach to combine it to a sequence file. It'd do the job
in parallel.

Regards
Bejoy.K.S

On Thu, Mar 15, 2012 at 8:17 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 Thanks! that helps. I am reading small xml files from external file system
 and then writing to the SequenceFile. I made it stand alone client thinking
 that mapreduce may not be the best way to do this type of writing. My
 understanding was that map reduce is best suited for processing data within
 HDFS. Is map reduce also one of the options I should consider?

 On Thu, Mar 15, 2012 at 2:15 AM, Bejoy Ks bejoy.had...@gmail.com wrote:

  Hi Mohit
   If you are using a stand alone client application to do the same
  definitely there is just one instance of the same running and you'd be
  writing the sequence file to one hdfs block at a time. Once it reaches
 hdfs
  block size the writing continues to next block, in the mean time the
 first
  block is replicated. If you are doing the same job distributed as map
  reduce you'd be writing to to n files at a time when n is the number of
  tasks in your map reduce job.
  AFAIK the data node where the blocks have to be placed is determined
  by hadoop it is not controlled by end user application. But if you are
  triggering the stand alone job on a particular data node and if it has
  space one replica would be stored in the same. Same applies in case of MR
  tasks as well.
 
  Regards
  Bejoy.K.S
 
  On Thu, Mar 15, 2012 at 6:17 AM, Mohit Anchlia mohitanch...@gmail.com
  wrote:
 
   I have a client program that creates sequencefile, which essentially
  merges
   small files into a big file. I was wondering how is sequence file
  splitting
   the data accross nodes. When I start the sequence file is empty. Does
 it
   get split when it reaches the dfs.block size? If so then does it mean
  that
   I am always writing to just one node at a given point in time?
  
   If I start a new client writing a new sequence file then is there a way
  to
   select a different data node?
  
 



Issue when starting services on CDH3

2012-03-15 Thread Manish Bhoge
I have CDH3 installed in standalone mode. I have install all hadoop components. 
Now when I start services (namenode,secondary namenode,job tracker,task 
tracker) I can start gracefully from /usr/lib/hadoop/ ./bin/start-all.sh. But 
when start the same servises from /etc/init.d/hadoop-0.20-* then I unable to 
start. Why? Now I want to start Hue also which is in init.d that also I 
couldn't start. Here I suspect authentication issue. Because all the services 
in init.d are under root user and root group. Please suggest I am stuck here. I 
tried hive and it seems it running fine.
Thanks
Manish.
Sent from my BlackBerry, pls excuse typo



Re: Best practice to setup Sqoop,Pig and Hive for a hadoop cluster ?

2012-03-15 Thread Manu S
Thanks a lot all :-)
On Mar 15, 2012 7:03 PM, Marcos Ortiz mlor...@uci.cu wrote:



 On 03/15/2012 09:22 AM, Manu S wrote:

 Thanks a lot Bijoy, that makes sense :)

 Suppose if I have Mysql database in some other node(not in hadoop
cluster), can I import the tables using sqoop to my HDFS?

 Yes, this is the main purpose of Sqoop
 On the Cloudera site, you have the completed documentation for it

 Sqoop User Guide
 http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html

 Sqoop installation
 https://ccp.cloudera.com/display/CDHDOC/Sqoop+Installation

 Sqoop for MySQL
 http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html#_mysql

 Sqoop site on GitHub
 http://github.com/cloudera/sqoop

 Cloudera blog related post to Sqoop
 http://www.cloudera.com/blog/category/sqoop/


 Best wishes




 On Thu, Mar 15, 2012 at 6:27 PM, Bejoy Ks bejoy.had...@gmail.com wrote:

 Hi Manu
  Please find my responses inline

 I had read about we can install Pig, hive  Sqoop on the client node,
no
 need to install it in cluster. What is the client node actually? Can I
use
 my management-node as a client?

 On larger clusters we have different node that is out of hadoop cluster
and
 these stay in there. So user programs would be triggered from this node.
 This is the node refereed to as client node/ edge node etc . For your
 cluster management node and client node can be the same

 What is the best practice to install Pig, Hive,  Sqoop?

 On a client node

 For the fully distributed cluster do we need to install Pig, Hive, 
Sqoop
 in each nodes?

 No, can be on a client node or on any of the nodes

 Mysql is needed for Hive as a metastore and sqoop can import mysql
database
 to HDFS or hive or pig, so can we make use of mysql DB's residing on
 another node?
 Regarding your first point, SQOOP import is for different purpose, to
get
 data from RDBNS into hdfs. But the meta stores is used by hive  in
framing
 the map reduce jobs corresponding to your hive query. Here SQOOP can't
help
 you much
 Recommend to have the metastore db of hive on the same node where hive
is
 installed as for execution hive queries there is meta data look up
required
 much especially when your table has large number of partitions and all.

 Regards
 Bejoy.K.S

 On Thu, Mar 15, 2012 at 5:34 PM, Manu S manupk...@gmail.com wrote:

  Greetings All !!!
 
  I am using Cloudera CDH3 for Hadoop deployment. We have 7 nodes, in
which 5
  are used for a fully distributed cluster, 1 for pseudo-distributed 
1 as
  management-node.
 
  Fully distributed cluster: HDFS, Mapreduce  Hbase cluster
  Pseudo distributed mode: All
 
  I had read about we can install Pig, hive  Sqoop on the client node,
no
  need to install it in cluster. What is the client node actually? Can
I use
  my management-node as a client?
 
  What is the best practice to install Pig, Hive,  Sqoop?
  For the fully distributed cluster do we need to install Pig, Hive, 
Sqoop
  in each nodes?
 
  Mysql is needed for Hive as a metastore and sqoop can import mysql
database
  to HDFS or hive or pig, so can we make use of mysql DB's residing on
  another node?
 
  --
  Thanks  Regards
  
  Manu S
  SI Engineer - OpenSource  HPC
  Wipro Infotech
  Mob: +91 8861302855Skype: manuspkd
  www.opensourcetalk.co.in
 




 --
 Thanks  Regards
 
 Manu S
 SI Engineer - OpenSource  HPC
 Wipro Infotech
 Mob: +91 8861302855Skype: manuspkd
 www.opensourcetalk.co.in




 --
 Marcos Luis Ortíz Valmaseda
  Sr. Software Engineer (UCI)
  http://marcosluis2186.posterous.com
  http://postgresql.uci.cu/blog/38





Re: Issue when starting services on CDH3

2012-03-15 Thread Manu S
Dear manish
Which daemons are not starting?

On Mar 15, 2012 9:21 PM, Manish Bhoge manishbh...@rocketmail.com wrote:

 I have CDH3 installed in standalone mode. I have install all hadoop
components. Now when I start services (namenode,secondary namenode,job
tracker,task tracker) I can start gracefully from /usr/lib/hadoop/
./bin/start-all.sh. But when start the same servises from
/etc/init.d/hadoop-0.20-* then I unable to start. Why? Now I want to start
Hue also which is in init.d that also I couldn't start. Here I suspect
authentication issue. Because all the services in init.d are under root
user and root group. Please suggest I am stuck here. I tried hive and it
seems it running fine.
 Thanks
 Manish.
 Sent from my BlackBerry, pls excuse typo



Re: Issue when starting services on CDH3

2012-03-15 Thread Manish Bhoge
Manu,
None of the services getting up including namenode, second namenode, 
tasktracker, jobtracker

Sent from my BlackBerry, pls excuse typo

-Original Message-
From: Manu S manupk...@gmail.com
Date: Thu, 15 Mar 2012 21:31:34 
To: common-user@hadoop.apache.org; manishbh...@rocketmail.com
Subject: Re: Issue when starting services on CDH3

Dear manish
Which daemons are not starting?

On Mar 15, 2012 9:21 PM, Manish Bhoge manishbh...@rocketmail.com wrote:

 I have CDH3 installed in standalone mode. I have install all hadoop
components. Now when I start services (namenode,secondary namenode,job
tracker,task tracker) I can start gracefully from /usr/lib/hadoop/
./bin/start-all.sh. But when start the same servises from
/etc/init.d/hadoop-0.20-* then I unable to start. Why? Now I want to start
Hue also which is in init.d that also I couldn't start. Here I suspect
authentication issue. Because all the services in init.d are under root
user and root group. Please suggest I am stuck here. I tried hive and it
seems it running fine.
 Thanks
 Manish.
 Sent from my BlackBerry, pls excuse typo




Re: Issue when starting services on CDH3

2012-03-15 Thread Manu S
Did you check the service status?
Is it like dead, but pid exist?

Did you check the ownership and permissions for the
dfs.name.dir,dfs.data.dir,mapped.local.dir etc ?

The order for starting daemons are like this:
1 namenode
2 datanode
3 jobtracker
4 tasktracker

Did you format the namenode before starting?
On Mar 15, 2012 9:31 PM, Manu S manupk...@gmail.com wrote:

 Dear manish
 Which daemons are not starting?

 On Mar 15, 2012 9:21 PM, Manish Bhoge manishbh...@rocketmail.com
 wrote:
 
  I have CDH3 installed in standalone mode. I have install all hadoop
 components. Now when I start services (namenode,secondary namenode,job
 tracker,task tracker) I can start gracefully from /usr/lib/hadoop/
 ./bin/start-all.sh. But when start the same servises from
 /etc/init.d/hadoop-0.20-* then I unable to start. Why? Now I want to start
 Hue also which is in init.d that also I couldn't start. Here I suspect
 authentication issue. Because all the services in init.d are under root
 user and root group. Please suggest I am stuck here. I tried hive and it
 seems it running fine.
  Thanks
  Manish.
  Sent from my BlackBerry, pls excuse typo
 



Re: Issue when starting services on CDH3

2012-03-15 Thread Michael Segel
Are you running the init.d scripts as root and what is order of the services 
you want to start.


Sent from my iPhone

On Mar 15, 2012, at 11:22 AM, Manish Bhoge manishbh...@rocketmail.com wrote:

 Ys, I understand the order and I formatted namenode before starting services. 
 As I suspect there may be ownership and an access issue. Not able to nail 
 down issue exactly. I also have question why there are 2 routes to start 
 services. When we have start-all.sh script then why need to go to init.d to 
 start services??
 
 
 Thank you,
 Manish
 Sent from my BlackBerry, pls excuse typo
 
 -Original Message-
 From: Manu S manupk...@gmail.com
 Date: Thu, 15 Mar 2012 21:43:26 
 To: common-user@hadoop.apache.org; manishbh...@rocketmail.com
 Reply-To: common-user@hadoop.apache.org
 Subject: Re: Issue when starting services on CDH3
 
 Did you check the service status?
 Is it like dead, but pid exist?
 
 Did you check the ownership and permissions for the
 dfs.name.dir,dfs.data.dir,mapped.local.dir etc ?
 
 The order for starting daemons are like this:
 1 namenode
 2 datanode
 3 jobtracker
 4 tasktracker
 
 Did you format the namenode before starting?
 On Mar 15, 2012 9:31 PM, Manu S manupk...@gmail.com wrote:
 
 Dear manish
 Which daemons are not starting?
 
 On Mar 15, 2012 9:21 PM, Manish Bhoge manishbh...@rocketmail.com
 wrote:
 
 I have CDH3 installed in standalone mode. I have install all hadoop
 components. Now when I start services (namenode,secondary namenode,job
 tracker,task tracker) I can start gracefully from /usr/lib/hadoop/
 ./bin/start-all.sh. But when start the same servises from
 /etc/init.d/hadoop-0.20-* then I unable to start. Why? Now I want to start
 Hue also which is in init.d that also I couldn't start. Here I suspect
 authentication issue. Because all the services in init.d are under root
 user and root group. Please suggest I am stuck here. I tried hive and it
 seems it running fine.
 Thanks
 Manish.
 Sent from my BlackBerry, pls excuse typo
 
 
 


Re: Capacity Scheduler APIs

2012-03-15 Thread hdev ml
Does anybody have an answer to this question?

Harshad

On Wed, Mar 14, 2012 at 1:51 PM, hdev ml hde...@gmail.com wrote:

 Hi all,

 are there any capacity scheduler apis that I can use?

 e.g. adding, removing queues, tuning properties on the fly and so on.

 Any help is appreciated.

 Thanks

 Harshad



Mapper Only Job, Without Input or Output Path

2012-03-15 Thread Deepak Nettem
Hi,

I have a use case - I have  files lying on the local disk of every node on
my cluster. I want to write a Mapper only MapReduce job that reads the file
off the local disk on every machine, applies some transformation and wrotes
to HDFS.

Specifically,

1. The Job shouldn't have any input/output paths, and null key value pairs.
2. Mapper Only
3. I want to be able to control the number of Mappers, depending on the
size of my cluster.

What's the best way to do this? I would appreciate any example code.

Deepak


Re: Issue when starting services on CDH3

2012-03-15 Thread Manu S
Because for large clusters we have to run namenode in a single node,
datanode in another nodes
So we can start namenode and jobtracker in master node and datanode n
tasktracker in slave nodes

For getting more clarity You can check the service status after starting

Verify these:
dfs.name.dir hdfs:hadoop drwx--
dfs.data.dir hdfs:hadoop drwx--

mapred.local.dir mapred:hadoop drwxr-xr-x

Please follow each steps in this link
https://ccp.cloudera.com/display/CDHDOC/CDH3+Deployment+on+a+Cluster
 On Mar 15, 2012 9:52 PM, Manish Bhoge manishbh...@rocketmail.com wrote:

 Ys, I understand the order and I formatted namenode before starting
 services. As I suspect there may be ownership and an access issue. Not able
 to nail down issue exactly. I also have question why there are 2 routes to
 start services. When we have start-all.sh script then why need to go to
 init.d to start services??


 Thank you,
 Manish
 Sent from my BlackBerry, pls excuse typo

 -Original Message-
 From: Manu S manupk...@gmail.com
 Date: Thu, 15 Mar 2012 21:43:26
 To: common-user@hadoop.apache.org; manishbh...@rocketmail.com
 Reply-To: common-user@hadoop.apache.org
 Subject: Re: Issue when starting services on CDH3

 Did you check the service status?
 Is it like dead, but pid exist?

 Did you check the ownership and permissions for the
 dfs.name.dir,dfs.data.dir,mapped.local.dir etc ?

 The order for starting daemons are like this:
 1 namenode
 2 datanode
 3 jobtracker
 4 tasktracker

 Did you format the namenode before starting?
 On Mar 15, 2012 9:31 PM, Manu S manupk...@gmail.com wrote:

  Dear manish
  Which daemons are not starting?
 
  On Mar 15, 2012 9:21 PM, Manish Bhoge manishbh...@rocketmail.com
  wrote:
  
   I have CDH3 installed in standalone mode. I have install all hadoop
  components. Now when I start services (namenode,secondary namenode,job
  tracker,task tracker) I can start gracefully from /usr/lib/hadoop/
  ./bin/start-all.sh. But when start the same servises from
  /etc/init.d/hadoop-0.20-* then I unable to start. Why? Now I want to
 start
  Hue also which is in init.d that also I couldn't start. Here I suspect
  authentication issue. Because all the services in init.d are under root
  user and root group. Please suggest I am stuck here. I tried hive and it
  seems it running fine.
   Thanks
   Manish.
   Sent from my BlackBerry, pls excuse typo
  
 




Re: Capacity Scheduler APIs

2012-03-15 Thread Shailesh
Hi Harshad,
have you looked into the file conf/capacity-scheduler.xml? you can assign
and change parameters like capacity of each queue, reclaim time and job
priorities. Is that what you're looking for?

Shailesh

On Thu, Mar 15, 2012 at 12:57 PM, hdev ml hde...@gmail.com wrote:

 Does anybody have an answer to this question?

 Harshad

 On Wed, Mar 14, 2012 at 1:51 PM, hdev ml hde...@gmail.com wrote:

  Hi all,
 
  are there any capacity scheduler apis that I can use?
 
  e.g. adding, removing queues, tuning properties on the fly and so on.
 
  Any help is appreciated.
 
  Thanks
 
  Harshad
 



Re: Capacity Scheduler APIs

2012-03-15 Thread hdev ml
Thanks for the email Shailesh.

I am looking for some Java API to manage queues.

I have already defined queues in the capacity-scheduler.xml and everything
works fine.

But my question is, can the same thing be done without restarting the
cluster or namenode? The only option I see is Java API, hence the question.

Please let me know.

Harshad

On Thu, Mar 15, 2012 at 10:33 AM, Shailesh shailesh.shai...@gmail.comwrote:

 Hi Harshad,
 have you looked into the file conf/capacity-scheduler.xml? you can assign
 and change parameters like capacity of each queue, reclaim time and job
 priorities. Is that what you're looking for?

 Shailesh

 On Thu, Mar 15, 2012 at 12:57 PM, hdev ml hde...@gmail.com wrote:

  Does anybody have an answer to this question?
 
  Harshad
 
  On Wed, Mar 14, 2012 at 1:51 PM, hdev ml hde...@gmail.com wrote:
 
   Hi all,
  
   are there any capacity scheduler apis that I can use?
  
   e.g. adding, removing queues, tuning properties on the fly and so on.
  
   Any help is appreciated.
  
   Thanks
  
   Harshad
  
 



Re: Capacity Scheduler APIs

2012-03-15 Thread Shailesh
Hi Harshad,

Have you looked into CapacitySchedulerConf.java class?
http://www.java2s.com/Open-Source/Java/Database-DBMS/hadoop-0.20.1/org/apache/hadoop/mapred/CapacitySchedulerConf.java.htm

I don't know whether it can be done without restarting the cluster or
namenode.

On Thu, Mar 15, 2012 at 2:03 PM, hdev ml hde...@gmail.com wrote:

 Thanks for the email Shailesh.

 I am looking for some Java API to manage queues.

 I have already defined queues in the capacity-scheduler.xml and everything
 works fine.

 But my question is, can the same thing be done without restarting the
 cluster or namenode? The only option I see is Java API, hence the question.

 Please let me know.

 Harshad

 On Thu, Mar 15, 2012 at 10:33 AM, Shailesh shailesh.shai...@gmail.com
 wrote:

  Hi Harshad,
  have you looked into the file conf/capacity-scheduler.xml? you can assign
  and change parameters like capacity of each queue, reclaim time and job
  priorities. Is that what you're looking for?
 
  Shailesh
 
  On Thu, Mar 15, 2012 at 12:57 PM, hdev ml hde...@gmail.com wrote:
 
   Does anybody have an answer to this question?
  
   Harshad
  
   On Wed, Mar 14, 2012 at 1:51 PM, hdev ml hde...@gmail.com wrote:
  
Hi all,
   
are there any capacity scheduler apis that I can use?
   
e.g. adding, removing queues, tuning properties on the fly and so on.
   
Any help is appreciated.
   
Thanks
   
Harshad
   
  
 



Re: Issue when starting services on CDH3

2012-03-15 Thread Suresh Srinivas
Guys, can you please take this up in CDH related mailing lists.

On Thu, Mar 15, 2012 at 10:01 AM, Manu S manupk...@gmail.com wrote:

 Because for large clusters we have to run namenode in a single node,
 datanode in another nodes
 So we can start namenode and jobtracker in master node and datanode n
 tasktracker in slave nodes

 For getting more clarity You can check the service status after starting

 Verify these:
 dfs.name.dir hdfs:hadoop drwx--
 dfs.data.dir hdfs:hadoop drwx--

 mapred.local.dir mapred:hadoop drwxr-xr-x

 Please follow each steps in this link
 https://ccp.cloudera.com/display/CDHDOC/CDH3+Deployment+on+a+Cluster
  On Mar 15, 2012 9:52 PM, Manish Bhoge manishbh...@rocketmail.com
 wrote:

  Ys, I understand the order and I formatted namenode before starting
  services. As I suspect there may be ownership and an access issue. Not
 able
  to nail down issue exactly. I also have question why there are 2 routes
 to
  start services. When we have start-all.sh script then why need to go to
  init.d to start services??
 
 
  Thank you,
  Manish
  Sent from my BlackBerry, pls excuse typo
 
  -Original Message-
  From: Manu S manupk...@gmail.com
  Date: Thu, 15 Mar 2012 21:43:26
  To: common-user@hadoop.apache.org; manishbh...@rocketmail.com
  Reply-To: common-user@hadoop.apache.org
  Subject: Re: Issue when starting services on CDH3
 
  Did you check the service status?
  Is it like dead, but pid exist?
 
  Did you check the ownership and permissions for the
  dfs.name.dir,dfs.data.dir,mapped.local.dir etc ?
 
  The order for starting daemons are like this:
  1 namenode
  2 datanode
  3 jobtracker
  4 tasktracker
 
  Did you format the namenode before starting?
  On Mar 15, 2012 9:31 PM, Manu S manupk...@gmail.com wrote:
 
   Dear manish
   Which daemons are not starting?
  
   On Mar 15, 2012 9:21 PM, Manish Bhoge manishbh...@rocketmail.com
   wrote:
   
I have CDH3 installed in standalone mode. I have install all hadoop
   components. Now when I start services (namenode,secondary namenode,job
   tracker,task tracker) I can start gracefully from /usr/lib/hadoop/
   ./bin/start-all.sh. But when start the same servises from
   /etc/init.d/hadoop-0.20-* then I unable to start. Why? Now I want to
  start
   Hue also which is in init.d that also I couldn't start. Here I suspect
   authentication issue. Because all the services in init.d are under root
   user and root group. Please suggest I am stuck here. I tried hive and
 it
   seems it running fine.
Thanks
Manish.
Sent from my BlackBerry, pls excuse typo
   
  
 
 



Re: Capacity Scheduler APIs

2012-03-15 Thread Harsh J
To refresh your queues, you may do, as your MR admin user:

$ hadoop mradmin -refreshQueues

Am not sure if this covers CS config refreshes, but let us know if it does.
The above command is present in Apache Hadoop 1.x.

On Fri, Mar 16, 2012 at 12:08 AM, Shailesh shailesh.shai...@gmail.comwrote:

 Hi Harshad,

 Have you looked into CapacitySchedulerConf.java class?

 http://www.java2s.com/Open-Source/Java/Database-DBMS/hadoop-0.20.1/org/apache/hadoop/mapred/CapacitySchedulerConf.java.htm

 I don't know whether it can be done without restarting the cluster or
 namenode.

 On Thu, Mar 15, 2012 at 2:03 PM, hdev ml hde...@gmail.com wrote:

  Thanks for the email Shailesh.
 
  I am looking for some Java API to manage queues.
 
  I have already defined queues in the capacity-scheduler.xml and
 everything
  works fine.
 
  But my question is, can the same thing be done without restarting the
  cluster or namenode? The only option I see is Java API, hence the
 question.
 
  Please let me know.
 
  Harshad
 
  On Thu, Mar 15, 2012 at 10:33 AM, Shailesh shailesh.shai...@gmail.com
  wrote:
 
   Hi Harshad,
   have you looked into the file conf/capacity-scheduler.xml? you can
 assign
   and change parameters like capacity of each queue, reclaim time and job
   priorities. Is that what you're looking for?
  
   Shailesh
  
   On Thu, Mar 15, 2012 at 12:57 PM, hdev ml hde...@gmail.com wrote:
  
Does anybody have an answer to this question?
   
Harshad
   
On Wed, Mar 14, 2012 at 1:51 PM, hdev ml hde...@gmail.com wrote:
   
 Hi all,

 are there any capacity scheduler apis that I can use?

 e.g. adding, removing queues, tuning properties on the fly and so
 on.

 Any help is appreciated.

 Thanks

 Harshad

   
  
 




-- 
Harsh J


Re: Issue when starting services on CDH3

2012-03-15 Thread Harsh J
To add to Suresh's guideline since he may have missed providing a link, you
can visit the CDH users community at
https://groups.google.com/a/cloudera.org/forum/#!forum/cdh-user

On Fri, Mar 16, 2012 at 12:13 AM, Suresh Srinivas sur...@hortonworks.comwrote:

 Guys, can you please take this up in CDH related mailing lists.

 On Thu, Mar 15, 2012 at 10:01 AM, Manu S manupk...@gmail.com wrote:

  Because for large clusters we have to run namenode in a single node,
  datanode in another nodes
  So we can start namenode and jobtracker in master node and datanode n
  tasktracker in slave nodes
 
  For getting more clarity You can check the service status after starting
 
  Verify these:
  dfs.name.dir hdfs:hadoop drwx--
  dfs.data.dir hdfs:hadoop drwx--
 
  mapred.local.dir mapred:hadoop drwxr-xr-x
 
  Please follow each steps in this link
  https://ccp.cloudera.com/display/CDHDOC/CDH3+Deployment+on+a+Cluster
   On Mar 15, 2012 9:52 PM, Manish Bhoge manishbh...@rocketmail.com
  wrote:
 
   Ys, I understand the order and I formatted namenode before starting
   services. As I suspect there may be ownership and an access issue. Not
  able
   to nail down issue exactly. I also have question why there are 2 routes
  to
   start services. When we have start-all.sh script then why need to go to
   init.d to start services??
  
  
   Thank you,
   Manish
   Sent from my BlackBerry, pls excuse typo
  
   -Original Message-
   From: Manu S manupk...@gmail.com
   Date: Thu, 15 Mar 2012 21:43:26
   To: common-user@hadoop.apache.org; manishbh...@rocketmail.com
   Reply-To: common-user@hadoop.apache.org
   Subject: Re: Issue when starting services on CDH3
  
   Did you check the service status?
   Is it like dead, but pid exist?
  
   Did you check the ownership and permissions for the
   dfs.name.dir,dfs.data.dir,mapped.local.dir etc ?
  
   The order for starting daemons are like this:
   1 namenode
   2 datanode
   3 jobtracker
   4 tasktracker
  
   Did you format the namenode before starting?
   On Mar 15, 2012 9:31 PM, Manu S manupk...@gmail.com wrote:
  
Dear manish
Which daemons are not starting?
   
On Mar 15, 2012 9:21 PM, Manish Bhoge manishbh...@rocketmail.com
wrote:

 I have CDH3 installed in standalone mode. I have install all hadoop
components. Now when I start services (namenode,secondary
 namenode,job
tracker,task tracker) I can start gracefully from /usr/lib/hadoop/
./bin/start-all.sh. But when start the same servises from
/etc/init.d/hadoop-0.20-* then I unable to start. Why? Now I want to
   start
Hue also which is in init.d that also I couldn't start. Here I
 suspect
authentication issue. Because all the services in init.d are under
 root
user and root group. Please suggest I am stuck here. I tried hive and
  it
seems it running fine.
 Thanks
 Manish.
 Sent from my BlackBerry, pls excuse typo

   
  
  
 




-- 
Harsh J


YARN applications not running

2012-03-15 Thread Peter Naudus

Hello all,

When submitting an Hbase export job to YARN, I see it appearing on the web  
UI but for some reason, the job never starts; it constantly stays at 0%  
complete. I am using hadoop 0.23 and hbase 0.92 ( CDH4 beta 1 )


I see the NodeManagers connecting to the ResourceManager:

2012-03-15 19:36:10,585 INFO  
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl:  
torb1pnb001.dataraker.net:46696 Node Transitioned from NEW to RUNNING
2012-03-15 19:36:16,633 INFO org.apache.hadoop.yarn.util.RackResolver:  
Resolved torb1pnb002.dataraker.net to /default-rack
2012-03-15 19:36:16,633 INFO  
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService:  
NodeManager from node torb1pnb002.dataraker.net(cmPort: 35665 httpPort:  
) registered with capability: 1000, assigned nodeId  
torb1pnb002.dataraker.net:35665
2012-03-15 19:36:16,634 INFO  
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl:  
torb1pnb002.dataraker.net:35665 Node Transitioned from NEW to RUNNING

[ etc... ]

and the job being submitted to the ResourceManager:

2012-03-15 19:40:29,248 INFO  
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Allocated  
new applicationId: 1
2012-03-15 19:40:31,323 INFO  
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:  
application_1331840162147_0001 State change from NEW to SUBMITTED
2012-03-15 19:40:31,323 INFO  
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService:  
Registering appattempt_1331840162147_0001_01
2012-03-15 19:40:31,323 INFO  
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:  
appattempt_1331840162147_0001_01 State change from NEW to SUBMITTED
2012-03-15 19:40:31,327 INFO  
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Application  
with id 1 submitted by user hdfs with application_id [..snip..]
2012-03-15 19:40:31,329 INFO  
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hdfs  
IP=10.192.16.64 OPERATION=Submit Application Request 
TARGET=ClientRMService  RESULT=SUCCESS   
APPID=application_1331840162147_0001
2012-03-15 19:40:31,333 INFO  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler:  
Application Submission: application_1331840162147_0001 from hdfs,  
currently active: 1
2012-03-15 19:40:31,336 INFO  
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:  
appattempt_1331840162147_0001_01 State change from SUBMITTED to  
SCHEDULED
2012-03-15 19:40:31,336 INFO  
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:  
application_1331840162147_0001 State change from SUBMITTED to ACCEPTED


but after the NodeManager starts, the log never indicates any requests  
from the ResourceManager


2012-03-15 19:36:16,604 INFO  
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Connected  
to ResourceManager at torb1pna001:8025
2012-03-15 19:36:16,645 INFO  
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl:  
Registered with ResourceManager as torb1pnb002.dataraker.net:35665 with  
total resource of memory: 1000
2012-03-15 19:36:16,645 INFO  
org.apache.hadoop.yarn.service.AbstractService:  
Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl is  
started.
2012-03-15 19:36:16,646 INFO  
org.apache.hadoop.yarn.service.AbstractService:  
Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is started.

[ end of log ]

I am seeing strange errors in Zookeeper when the job is submitted:

2012-03-15 16:58:00,216 - INFO   
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@213] -  
Accepted socket connection from /127.0.0.1:33262
2012-03-15 16:58:00,219 - INFO   
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@838] - Client  
attempting to establish new session at /127.0.0.1:33262
2012-03-15 16:58:00,229 - INFO  [CommitProcessor:0:ZooKeeperServer@604] -  
Established session 0x35d53d539f0071 with negotiated timeout 4 for  
client /127.0.0.1:33262
2012-03-15 16:58:48,884 - WARN   
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@349] - caught end  
of stream exception
EndOfStreamException: Unable to read additional data from client sessionid  
0x35d53d539f0071, likely client has closed socket
at  
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
at  
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:224)

at java.lang.Thread.run(Thread.java:662)
2012-03-15 16:58:48,885 - INFO   
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed  
socket connection for client /127.0.0.1:33262 which had sessionid  
0x35d53d539f0071
2012-03-15 17:02:59,968 - INFO   
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@213] -  
Accepted socket connection from /127.0.0.1:59652
2012-03-15 17:02:59,971 - INFO   
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@838] - Client  
attempting to 

Re: EOFException

2012-03-15 Thread Gopal

On 03/15/2012 03:06 PM, Mohit Anchlia wrote:

When I start a job to read data from HDFS I start getting these errors.
Does anyone know what this means and how to resolve it?

2012-03-15 10:41:31,402 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
Exception in createBlockOutputStream 164.28.62.204:50010java.io.EOFException
2012-03-15 10:41:31,402 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
Abandoning block blk_-6402969611996946639_11837
2012-03-15 10:41:31,403 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
Excluding datanode 164.28.62.204:50010
2012-03-15 10:41:31,406 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
Exception in createBlockOutputStream 164.28.62.198:50010java.io.EOFException
2012-03-15 10:41:31,406 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
Abandoning block blk_-5442664108986165368_11838
2012-03-15 10:41:31,407 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
Exception in createBlockOutputStream 164.28.62.197:50010java.io.EOFException
2012-03-15 10:41:31,407 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
Abandoning block blk_-3373089616877234160_11838
2012-03-15 10:41:31,407 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
Excluding datanode 164.28.62.198:50010
2012-03-15 10:41:31,409 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
Excluding datanode 164.28.62.197:50010
2012-03-15 10:41:31,410 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
Exception in createBlockOutputStream 164.28.62.204:50010java.io.EOFException
2012-03-15 10:41:31,410 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
Abandoning block blk_4481292025401332278_11838
2012-03-15 10:41:31,411 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
Excluding datanode 164.28.62.204:50010
2012-03-15 10:41:31,412 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
Exception in createBlockOutputStream 164.28.62.200:50010java.io.EOFException
2012-03-15 10:41:31,412 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
Abandoning block blk_-5326771177080888701_11838
2012-03-15 10:41:31,413 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
Excluding datanode 164.28.62.200:50010
2012-03-15 10:41:31,414 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
Exception in createBlockOutputStream 164.28.62.197:50010java.io.EOFException
2012-03-15 10:41:31,414 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
Abandoning block blk_-8073750683705518772_11839
2012-03-15 10:41:31,415 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
Excluding datanode 164.28.62.197:50010
2012-03-15 10:41:31,416 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
Exception in createBlockOutputStream 164.28.62.199:50010java.io.EOFException
2012-03-15 10:41:31,416 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
Exception in createBlockOutputStream 164.28.62.198:50010java.io.EOFException
2012-03-15 10:41:31,416 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
Abandoning block blk_441003866688859169_11838
2012-03-15 10:41:31,416 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
Abandoning block blk_-466858474055876377_11839
2012-03-15 10:41:31,417 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
Excluding datanode 164.28.62.198:50010
2012-03-15 10:41:31,417 [Thread-5] WARN  org.apache.hadoop.hdfs.DFSClient -
   

Try shutting down and  restarting hbase.


Re: EOFException

2012-03-15 Thread Mohit Anchlia
This is actually just hadoop job over HDFS. I am assuming you also know why
this is erroring out?

On Thu, Mar 15, 2012 at 1:02 PM, Gopal absoft...@gmail.com wrote:

  On 03/15/2012 03:06 PM, Mohit Anchlia wrote:

 When I start a job to read data from HDFS I start getting these errors.
 Does anyone know what this means and how to resolve it?

 2012-03-15 10:41:31,402 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Exception in createBlockOutputStream 164.28.62.204:50010java.io.**
 EOFException
 2012-03-15 10:41:31,402 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Abandoning block blk_-6402969611996946639_11837
 2012-03-15 10:41:31,403 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Excluding datanode 164.28.62.204:50010
 2012-03-15 10:41:31,406 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Exception in createBlockOutputStream 164.28.62.198:50010java.io.**
 EOFException
 2012-03-15 10:41:31,406 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Abandoning block blk_-5442664108986165368_11838
 2012-03-15 10:41:31,407 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Exception in createBlockOutputStream 164.28.62.197:50010java.io.**
 EOFException
 2012-03-15 10:41:31,407 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Abandoning block blk_-3373089616877234160_11838
 2012-03-15 10:41:31,407 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Excluding datanode 164.28.62.198:50010
 2012-03-15 10:41:31,409 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Excluding datanode 164.28.62.197:50010
 2012-03-15 10:41:31,410 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Exception in createBlockOutputStream 164.28.62.204:50010java.io.**
 EOFException
 2012-03-15 10:41:31,410 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Abandoning block blk_4481292025401332278_11838
 2012-03-15 10:41:31,411 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Excluding datanode 164.28.62.204:50010
 2012-03-15 10:41:31,412 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Exception in createBlockOutputStream 164.28.62.200:50010java.io.**
 EOFException
 2012-03-15 10:41:31,412 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Abandoning block blk_-5326771177080888701_11838
 2012-03-15 10:41:31,413 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Excluding datanode 164.28.62.200:50010
 2012-03-15 10:41:31,414 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Exception in createBlockOutputStream 164.28.62.197:50010java.io.**
 EOFException
 2012-03-15 10:41:31,414 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Abandoning block blk_-8073750683705518772_11839
 2012-03-15 10:41:31,415 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Excluding datanode 164.28.62.197:50010
 2012-03-15 10:41:31,416 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Exception in createBlockOutputStream 164.28.62.199:50010java.io.**
 EOFException
 2012-03-15 10:41:31,416 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Exception in createBlockOutputStream 164.28.62.198:50010java.io.**
 EOFException
 2012-03-15 10:41:31,416 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Abandoning block blk_441003866688859169_11838
 2012-03-15 10:41:31,416 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Abandoning block blk_-466858474055876377_11839
 2012-03-15 10:41:31,417 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Excluding datanode 164.28.62.198:50010
 2012-03-15 10:41:31,417 [Thread-5] WARN  org.apache.hadoop.hdfs.**DFSClient
 -


 Try shutting down and  restarting hbase.



Re: Capacity Scheduler APIs

2012-03-15 Thread hdev ml
Thanks Shailesh/Harsh,

I will try the hadoop command first and then the internal code.

Thanks again.

Harshad.

On Thu, Mar 15, 2012 at 12:06 PM, Harsh J ha...@cloudera.com wrote:

 To refresh your queues, you may do, as your MR admin user:

 $ hadoop mradmin -refreshQueues

 Am not sure if this covers CS config refreshes, but let us know if it does.
 The above command is present in Apache Hadoop 1.x.

 On Fri, Mar 16, 2012 at 12:08 AM, Shailesh shailesh.shai...@gmail.com
 wrote:

  Hi Harshad,
 
  Have you looked into CapacitySchedulerConf.java class?
 
 
 http://www.java2s.com/Open-Source/Java/Database-DBMS/hadoop-0.20.1/org/apache/hadoop/mapred/CapacitySchedulerConf.java.htm
 
  I don't know whether it can be done without restarting the cluster or
  namenode.
 
  On Thu, Mar 15, 2012 at 2:03 PM, hdev ml hde...@gmail.com wrote:
 
   Thanks for the email Shailesh.
  
   I am looking for some Java API to manage queues.
  
   I have already defined queues in the capacity-scheduler.xml and
  everything
   works fine.
  
   But my question is, can the same thing be done without restarting the
   cluster or namenode? The only option I see is Java API, hence the
  question.
  
   Please let me know.
  
   Harshad
  
   On Thu, Mar 15, 2012 at 10:33 AM, Shailesh shailesh.shai...@gmail.com
   wrote:
  
Hi Harshad,
have you looked into the file conf/capacity-scheduler.xml? you can
  assign
and change parameters like capacity of each queue, reclaim time and
 job
priorities. Is that what you're looking for?
   
Shailesh
   
On Thu, Mar 15, 2012 at 12:57 PM, hdev ml hde...@gmail.com wrote:
   
 Does anybody have an answer to this question?

 Harshad

 On Wed, Mar 14, 2012 at 1:51 PM, hdev ml hde...@gmail.com wrote:

  Hi all,
 
  are there any capacity scheduler apis that I can use?
 
  e.g. adding, removing queues, tuning properties on the fly and so
  on.
 
  Any help is appreciated.
 
  Thanks
 
  Harshad
 

   
  
 



 --
 Harsh J



Suggestion for InputSplit and InputFormat - Split every line.

2012-03-15 Thread Deepak Nettem
Hi,

I have this use case - I need to spawn as many mappers as the number of
lines in a file in HDFS. This file isn't big (only 10-50 lines). Actually
each line represents the path of another data source that the Mappers will
work on. So each mapper will read 1 line, (the map() method will need to be
called only once), and work on the data source.

What's the best way to construct InputSplit, InputFormat and RecordReader
to achieve this? I would appreciate any example code :)

Best,
Deepak


Re: Suggestion for InputSplit and InputFormat - Split every line.

2012-03-15 Thread anil gupta
Have a look at NLineInputFormat class in Hadoop. It is build to split the
input on the basis of number of lines.

On Thu, Mar 15, 2012 at 6:13 PM, Deepak Nettem deepaknet...@gmail.comwrote:

 Hi,

 I have this use case - I need to spawn as many mappers as the number of
 lines in a file in HDFS. This file isn't big (only 10-50 lines). Actually
 each line represents the path of another data source that the Mappers will
 work on. So each mapper will read 1 line, (the map() method will need to be
 called only once), and work on the data source.

 What's the best way to construct InputSplit, InputFormat and RecordReader
 to achieve this? I would appreciate any example code :)

 Best,
 Deepak




-- 
Thanks  Regards,
Anil Gupta