Re: Getting job progress in java application

2012-04-30 Thread Ondřej Klimpera

Thanks a lot, checked the Docs and submitJob() method did the job.

Two more question please:)

[1] My app is running on Hadoop 0.20.203, if I upgrade the libraries to 
1.0.X, will the old API work, or it is necessary to rewrite map() and 
reduce() functions to new API?


[2] Does the new API support MultipleOutputs?

Thanks again.



On 04/30/2012 12:32 AM, Bill Graham wrote:

Take a look at the JobClient API. You can use that to get the current
progress of a running job.

On Sunday, April 29, 2012, Ondřej Klimpera wrote:


Hello I'd like to ask you what is the preferred way of getting running
jobs progress from Java application, that has executed them.

Im using Hadoop 0.20.203, tried job.end.notification.url property that
works well, but as the property name says, it sends only job end
notifications.

What I need is to get updates on map() and reduce() progress.

Please help how to do this.

Thanks.
Ondrej Klimpera






Can't construct instance of class org.apache.hadoop.conf.Configuration

2012-04-30 Thread Ryan Cole
Hello,

I'm trying to run an application, written in C++, that uses libhdfs. I have
compiled the code and get an error when I attempt to run the application.
The error that I am getting is as follows: Can't construct instance of
class org.apache.hadoop.conf.Configuration.

Initially, I was receiving an error saying that CLASSPATH was not set. That
was easy, so I set CLASSPATH to include the following three directories, in
this order:


   1. $HADOOP_HOME
   2. $HADOOP_HOME/lib
   3. $HADOOP_HOME/conf

The CLASSPATH not set error went away, and now I receive the error about
the Configuration class. I'm assuming that I do not have something on the
path that I need to, but everything I have read says to simply include
these three directories.

Does anybody have any idea what I might be missing? Full exception pasted
below.

Thanks,
Ryan

Exception in thread main java.lang.NoClassDefFoundError:
 org/apache/hadoop/conf/Configuration
 Caused by: java.lang.ClassNotFoundException:
 org.apache.hadoop.conf.Configuration
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
 Can't construct instance of class org.apache.hadoop.conf.Configuration
 node: /home/ryan/.node-gyp/0.7.8/src/node_object_wrap.h:61: void
 node::ObjectWrap::Wrap(v8::Handlev8::Object): Assertion
 `handle_.IsEmpty()' failed.
 Aborted (core dumped)


Re: Can't construct instance of class org.apache.hadoop.conf.Configuration

2012-04-30 Thread Brock Noland
Hi,

I would try this:

export CLASSPATH=$(hadoop classpath)

Brock

On Mon, Apr 30, 2012 at 10:15 AM, Ryan Cole r...@rycole.com wrote:
 Hello,

 I'm trying to run an application, written in C++, that uses libhdfs. I have
 compiled the code and get an error when I attempt to run the application.
 The error that I am getting is as follows: Can't construct instance of
 class org.apache.hadoop.conf.Configuration.

 Initially, I was receiving an error saying that CLASSPATH was not set. That
 was easy, so I set CLASSPATH to include the following three directories, in
 this order:


   1. $HADOOP_HOME
   2. $HADOOP_HOME/lib
   3. $HADOOP_HOME/conf

 The CLASSPATH not set error went away, and now I receive the error about
 the Configuration class. I'm assuming that I do not have something on the
 path that I need to, but everything I have read says to simply include
 these three directories.

 Does anybody have any idea what I might be missing? Full exception pasted
 below.

 Thanks,
 Ryan

 Exception in thread main java.lang.NoClassDefFoundError:
 org/apache/hadoop/conf/Configuration
 Caused by: java.lang.ClassNotFoundException:
 org.apache.hadoop.conf.Configuration
         at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
         at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
         at java.security.AccessController.doPrivileged(Native Method)
         at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
         at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
         at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
         at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
 Can't construct instance of class org.apache.hadoop.conf.Configuration
 node: /home/ryan/.node-gyp/0.7.8/src/node_object_wrap.h:61: void
 node::ObjectWrap::Wrap(v8::Handlev8::Object): Assertion
 `handle_.IsEmpty()' failed.
 Aborted (core dumped)



-- 
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/


Re: Can't construct instance of class org.apache.hadoop.conf.Configuration

2012-04-30 Thread Ryan Cole
Brock,

Ah, thanks. I did not realize that you could do that. That sets the correct
CLASSPATH! Thanks.

Ryan

On Mon, Apr 30, 2012 at 10:22 AM, Brock Noland br...@cloudera.com wrote:

 Hi,

 I would try this:

 export CLASSPATH=$(hadoop classpath)

 Brock

 On Mon, Apr 30, 2012 at 10:15 AM, Ryan Cole r...@rycole.com wrote:
  Hello,
 
  I'm trying to run an application, written in C++, that uses libhdfs. I
 have
  compiled the code and get an error when I attempt to run the application.
  The error that I am getting is as follows: Can't construct instance of
  class org.apache.hadoop.conf.Configuration.
 
  Initially, I was receiving an error saying that CLASSPATH was not set.
 That
  was easy, so I set CLASSPATH to include the following three directories,
 in
  this order:
 
 
1. $HADOOP_HOME
2. $HADOOP_HOME/lib
3. $HADOOP_HOME/conf
 
  The CLASSPATH not set error went away, and now I receive the error about
  the Configuration class. I'm assuming that I do not have something on the
  path that I need to, but everything I have read says to simply include
  these three directories.
 
  Does anybody have any idea what I might be missing? Full exception pasted
  below.
 
  Thanks,
  Ryan
 
  Exception in thread main java.lang.NoClassDefFoundError:
  org/apache/hadoop/conf/Configuration
  Caused by: java.lang.ClassNotFoundException:
  org.apache.hadoop.conf.Configuration
  at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
  at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
  at java.security.AccessController.doPrivileged(Native Method)
  at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
  Can't construct instance of class org.apache.hadoop.conf.Configuration
  node: /home/ryan/.node-gyp/0.7.8/src/node_object_wrap.h:61: void
  node::ObjectWrap::Wrap(v8::Handlev8::Object): Assertion
  `handle_.IsEmpty()' failed.
  Aborted (core dumped)



 --
 Apache MRUnit - Unit testing MapReduce -
 http://incubator.apache.org/mrunit/



Re: KMeans clustering on Hadoop infrastructure

2012-04-30 Thread Robert Evans
You are likely going to get more help from talking to the Mahout mailing list.

https://cwiki.apache.org/confluence/display/MAHOUT/Mailing+Lists,+IRC+and+Archives

--Bobby Evans

On 4/28/12 7:45 AM, Lukáš Kryške lu...@hotmail.cz wrote:






Hello,
I am successfully running K-Means clustering sample from the 'Mahout In Action' 
book (example in Chapter 7.3) in my Hadoop environment.Now I need to extend the 
program to take the vectors from a file located in my HDFS. I need to process 
clustering of millions or billions of vectors which are represented by 
comma-separated values in a .txt file in HDFS. Data are stored in this pattern:
x1,y1x2,y2xn,yn
As I understood from the book, I need to transform my .txt file with vectors 
into Hadoop's SequenceFile first - how to do it most efficiently? And how to 
tell to the KMeansDriver that the input path contains SequenceFile with vectors?

Thanks for help.

_Best Regards,Lukas Kryske




Re: Node-wide Combiner

2012-04-30 Thread Robert Evans
Do you mean that when multiple map jobs run on the same node, that there is a 
combiner that will run across all of that code.  There is nothing for that 
right now.  It seems like it could be somewhat difficult to get right given the 
current architecture.

--Bobby Evans


On 4/27/12 11:13 PM, Superymk superymk...@hotmail.com wrote:

Hi all,

I am a newbie in Hadoop and I like the system. I have one question: Is
there a node-wide combiner or something similar in Hadoop? I think it
can reduce the number of intermediate results in further. Any hint?

Thanks a lot!

Superymk



Re: EMR Hadoop

2012-04-30 Thread Arun C Murthy

On Apr 30, 2012, at 10:27 AM, Jay Vyas wrote:

 Hi guys :
 
 1) Does anybody know if there is a VM out there which runs EMR hadoop ?  I
 would like to have a
 local vm for dev purposes that mirrored the EMR hadoop instances.
 
 2) How does EMR's hadoop differ from apache hadoop and Cloudera's hadoop ?

EMR runs Apache Hadoop 0.20.205 which is very, very close to Apache Hadoop 
1.0.x.

Arun

 
 -- 
 Jay Vyas
 MMSB/UCHC

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/




EOFException

2012-04-30 Thread Keith Thompson
I have been running several MapReduce jobs on some input text files. They
were working fine earlier and then I suddenly started getting EOFException
every time. Even the jobs that ran fine before (on the exact same input
files) aren't running now. I am a bit perplexed as to what is causing this
error. Here is the error:

12/04/30 12:55:55 INFO mapred.JobClient: Task Id :
attempt_201202240659_6328_m_01_1, Status : FAILED
java.lang.RuntimeException: java.io.EOFException
at
org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:128)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:967)
at org.apache.hadoop.util.QuickSort.fix(QuickSort.java:30)
at org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:83)
at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1253)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1154)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:375)
at com.xerox.twitter.bin.UserTime.readFields(UserTime.java:31)
at
org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:122)

Since the compare function seems to be involved, here is my custom key
class. Note: I did not include year in the key because all keys have the
same year.

public class UserTime implements WritableComparableUserTime {

int id, month, day, year, hour, min, sec;
 public UserTime() {

}
 public UserTime(int u, int mon, int d, int y, int h, int m, int s) {
id = u;
month = mon;
day = d;
year = y;
hour = h;
min = m;
sec = s;
}
 @Override
public void readFields(DataInput in) throws IOException {
// TODO Auto-generated method stub
id = in.readInt();
month = in.readInt();
day = in.readInt();
year = in.readInt();
hour = in.readInt();
min = in.readInt();
sec = in.readInt();
}

@Override
public void write(DataOutput out) throws IOException {
// TODO Auto-generated method stub
out.write(id);
out.write(month);
out.write(day);
out.write(year);
out.write(hour);
out.write(min);
out.write(sec);
}

@Override
public int compareTo(UserTime that) {
// TODO Auto-generated method stub
if(compareUser(that) == 0)
return (compareTime(that));
else if(compareUser(that) == 1)
return 1;
else return -1;
}
 private int compareUser(UserTime that) {
if(id  that.id)
return 1;
else if(id == that.id)
return 0;
else return -1;
}
 //assumes all are from the same year
private int compareTime(UserTime that) {
if(month  that.month ||
(month == that.month  day  that.day) ||
(month == that.month  day == that.day  hour  that.hour) ||
(month == that.month  day == that.day  hour == that.hour  min 
that.min) ||
(month == that.month  day == that.day  hour == that.hour  min ==
that.min  sec  that.sec))
return 1;
else if(month == that.month  day == that.day  hour == that.hour  min
== that.min  sec == that.sec)
return 0;
else return -1;
}
 public String toString() {
String h, m, s;
if(hour  10)
h = 0+hour;
else
h = Integer.toString(hour);
if(min  10)
m = 0+min;
else
m = Integer.toString(hour);
if(sec  10)
s = 0+min;
else
s = Integer.toString(hour);
return (id+\t+month+/+day+/+year+\t+h+:+m+:+s);
}
}

Thanks for any help.

Regards,
Keith


Weird error starting up pseudo-dist cluster.

2012-04-30 Thread Keith Wiley
Here's an error I've never seen before.  I rebooted my machine sometime last 
week, so obviously when I tried to run a hadoop job this morning, the first 
thing I was quickly reminded of was that the pseudo-distributed cluster wasn't 
running.  I started it up only to watch the job tracker appear in the browser 
briefly and then go away (typical error complaining that the port was closed, 
as if the jobtracker is gone).  The namenode, interestingly, never came up 
during this time.  I tried stopping and starting all a few times but to no 
avail.

I inspected the logs and saw this:

java.io.IOException: Missing directory /tmp/hadoop-keithw/dfs/name

Sure enough, it isn't there.  I'm not familiar with this directory, so I can't 
say whether it was ever there before, but presumably it was.

Now, I assume I could get around this by formatting a new namenode, but then I 
would have to copy my data back into HDFS from scratch.

So, two questions:

(1) Any idea what the heck is going on here, how this happened, what it means?

(2) Is there any way to recover without starting over from scratch?

Thanks.


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

And what if we picked the wrong religion?  Every week, we're just making God
madder and madder!
   --  Homer Simpson




Re: Weird error starting up pseudo-dist cluster.

2012-04-30 Thread Andrzej Bialecki

On 30/04/2012 19:48, Keith Wiley wrote:

Here's an error I've never seen before.  I rebooted my machine sometime last week, so 
obviously when I tried to run a hadoop job this morning, the first thing I was quickly 
reminded of was that the pseudo-distributed cluster wasn't running.  I started it up only 
to watch the job tracker appear in the browser briefly and then go away (typical error 
complaining that the port was closed, as if the jobtracker is gone).  The namenode, 
interestingly, never came up during this time.  I tried stopping and starting 
all a few times but to no avail.

I inspected the logs and saw this:

java.io.IOException: Missing directory /tmp/hadoop-keithw/dfs/name

Sure enough, it isn't there.  I'm not familiar with this directory, so I can't 
say whether it was ever there before, but presumably it was.

Now, I assume I could get around this by formatting a new namenode, but then I 
would have to copy my data back into HDFS from scratch.

So, two questions:

(1) Any idea what the heck is going on here, how this happened, what it means?


The default hdfs config puts the namenode data in /tmp. This may be ok 
for casual testing, but in all other situations it's the worst location 
imaginable - for example, linux cleans this directory on reboot, and I 
think that's what happened here. Your HDFS data is gone to a better world...





(2) Is there any way to recover without starting over from scratch?


Regretfully, no. The lesson is: don't put precious files in /tmp.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Weird error starting up pseudo-dist cluster.

2012-04-30 Thread Keith Wiley
On Apr 30, 2012, at 11:10 , Andrzej Bialecki wrote:

 On 30/04/2012 19:48, Keith Wiley wrote:
 
 (1) Any idea what the heck is going on here, how this happened, what it 
 means?
 
 The default hdfs config puts the namenode data in /tmp. This may be ok for 
 casual testing, but in all other situations it's the worst location 
 imaginable - for example, linux cleans this directory on reboot, and I think 
 that's what happened here. Your HDFS data is gone to a better world...
 
 
 (2) Is there any way to recover without starting over from scratch?
 
 Regretfully, no. The lesson is: don't put precious files in /tmp.


Ah, okay, so, when setting up a single-machine, just a pseudo-dist cluster, 
what is a better way to do it?  Where would one put the temp directories in 
order to gain improved robustness of the hadoop system?  Is this the sort of 
thing to put in a home directory?  I never really conceptualized it that way; I 
always thought HDFS and hadoop in general were sort of system-level concepts.  
This is a single-user machine, I have full root/admin control over it, so it's 
not a permissions issue, I'm just asking at a philosophical level how to set up 
a pseudo-dist cluster in the most effective way?

Thanks.



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

I used to be with it, but then they changed what it was.  Now, what I'm with
isn't it, and what's it seems weird and scary to me.
   --  Abe (Grandpa) Simpson




reducer not seeing external jars

2012-04-30 Thread Keith Wiley
I'm trying to use -libjars to load an external jar along with the job jar, but 
the reducer still fails with a ClassNotFoundException against a class from the 
external jar (JFreeChart).  I'm not really sure how to approach this.  It 
either works or it doesn't...and so far it doesn't.

Can I make the mapper or reducer dump the class path so I can see what it 
thinks it has access to?

Aside from exploring the issue, like investigating the classpath, etc., why 
might -libjars not work as expected in the first place?

Thanks.


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

What I primarily learned in grad school is how much I *don't* know.
Consequently, I left grad school with a higher ignorance to knowledge ratio than
when I entered.
   --  Keith Wiley




Compressing map only output

2012-04-30 Thread Mohit Anchlia
Is there a way to compress map only jobs to compress map output that gets
stored on hdfs as part-m-* files? In pig I used :

Would these work form plain map reduce jobs as well?


set output.compression.enabled true;

set output.compression.codec org.apache.hadoop.io.compress.SnappyCodec;


Re: Compressing map only output

2012-04-30 Thread Prashant Kommireddi
Yes. These are hadoop properties - using set is just a way for Pig to set
those properties in your job conf.


On Mon, Apr 30, 2012 at 5:25 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 Is there a way to compress map only jobs to compress map output that gets
 stored on hdfs as part-m-* files? In pig I used :

 Would these work form plain map reduce jobs as well?


 set output.compression.enabled true;

 set output.compression.codec org.apache.hadoop.io.compress.SnappyCodec;



Re: Compressing map only output

2012-04-30 Thread Mohit Anchlia
Thanks! When I tried to search for this property I couldn't find it. Is
there a page that has complete list of properties and it's usage?

On Mon, Apr 30, 2012 at 5:44 PM, Prashant Kommireddi prash1...@gmail.comwrote:

 Yes. These are hadoop properties - using set is just a way for Pig to set
 those properties in your job conf.


 On Mon, Apr 30, 2012 at 5:25 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  Is there a way to compress map only jobs to compress map output that gets
  stored on hdfs as part-m-* files? In pig I used :
 
  Would these work form plain map reduce jobs as well?
 
 
  set output.compression.enabled true;
 
  set output.compression.codec org.apache.hadoop.io.compress.SnappyCodec;
 



adding or restarting a data node in a hadoop cluster

2012-04-30 Thread sumadhur
 
I am on hadoop 0.20.
 
To add a data node to a cluster, if we do not use the include/exclude/slaves 
files, do we need to  do anything other than configuring the hdfs-site.xml to 
point to name node and the mapred-site.xml to point to job tracker?
 
For example, should the job tracker and name node be restarted always?  
 
On a related note, if we restart a data node(that has some blocks on it) and 
the data node now has new IP address, Should we restart namenode/job tracker 
for hdfs and map-reduce to function correctly? 
Would the blocks on the restarted data node be detected or would hdfs think 
that these blocks were lost and start replicating them?
 
Thanks,
Sumadhur

Re: adding or restarting a data node in a hadoop cluster

2012-04-30 Thread Harsh J
Sumadhur,

(Inline)

On Tue, May 1, 2012 at 8:28 AM, sumadhur sumadhur_i...@yahoo.com wrote:

 I am on hadoop 0.20.

 To add a data node to a cluster, if we do not use the include/exclude/slaves 
 files, do we need to  do anything other than configuring the hdfs-site.xml to 
 point to name node and the mapred-site.xml to point to job tracker?

 For example, should the job tracker and name node be restarted always?

Just booting up the DN service with the right config and a configured
network for proper communication should suffice.

In case you're using rack-awareness, ensure you update the
rack-awareness script for your new node and refresh the NN before you
start your DN.

A restart isn't required for adding new nodes to the cluster.

 On a related note, if we restart a data node(that has some blocks on it) and 
 the data node now has new IP address, Should we restart namenode/job tracker 
 for hdfs and map-reduce to function correctly?
 Would the blocks on the restarted data node be detected or would hdfs think 
 that these blocks were lost and start replicating them?

Stopping, changing the IP/Hostname cleanly and restarting the DN back
up should not cause any block movement.

-- 
Harsh J


RE: adding or restarting a data node in a hadoop cluster

2012-04-30 Thread Amith D K
Hi sumadhur,

As u mentioned configureg the NN and JT ip would be enough.

I am not able to understand how on DN restart its IP get changed?


From: sumadhur [sumadhur_i...@yahoo.com]
Sent: Tuesday, May 01, 2012 10:58 AM
To: common-user@hadoop.apache.org
Subject: adding or restarting a data node in a hadoop cluster

I am on hadoop 0.20.

To add a data node to a cluster, if we do not use the include/exclude/slaves 
files, do we need to  do anything other than configuring the hdfs-site.xml to 
point to name node and the mapred-site.xml to point to job tracker?

For example, should the job tracker and name node be restarted always?

On a related note, if we restart a data node(that has some blocks on it) and 
the data node now has new IP address, Should we restart namenode/job tracker 
for hdfs and map-reduce to function correctly?
Would the blocks on the restarted data node be detected or would hdfs think 
that these blocks were lost and start replicating them?

Thanks,
Sumadhur


Re: Compressing map only output

2012-04-30 Thread Harsh J
Hey Mohit,

Most of what you need to know for jobs is available at
http://hadoop.apache.org/common/docs/current/mapred_tutorial.html

A more complete, mostly unseparated list of config params are also
available at: http://hadoop.apache.org/common/docs/current/mapred-default.html
(core-default.html, hdfs-default.html)

On Tue, May 1, 2012 at 6:36 AM, Mohit Anchlia mohitanch...@gmail.com wrote:
 Thanks! When I tried to search for this property I couldn't find it. Is
 there a page that has complete list of properties and it's usage?

 On Mon, Apr 30, 2012 at 5:44 PM, Prashant Kommireddi 
 prash1...@gmail.comwrote:

 Yes. These are hadoop properties - using set is just a way for Pig to set
 those properties in your job conf.


 On Mon, Apr 30, 2012 at 5:25 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  Is there a way to compress map only jobs to compress map output that gets
  stored on hdfs as part-m-* files? In pig I used :
 
  Would these work form plain map reduce jobs as well?
 
 
  set output.compression.enabled true;
 
  set output.compression.codec org.apache.hadoop.io.compress.SnappyCodec;
 




-- 
Harsh J


Re: Compressing map only output

2012-04-30 Thread Mohit Anchlia
Thanks a lot for the link!

On Mon, Apr 30, 2012 at 8:22 PM, Harsh J ha...@cloudera.com wrote:

 Hey Mohit,

 Most of what you need to know for jobs is available at
 http://hadoop.apache.org/common/docs/current/mapred_tutorial.html

 A more complete, mostly unseparated list of config params are also
 available at:
 http://hadoop.apache.org/common/docs/current/mapred-default.html
 (core-default.html, hdfs-default.html)

 On Tue, May 1, 2012 at 6:36 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
  Thanks! When I tried to search for this property I couldn't find it. Is
  there a page that has complete list of properties and it's usage?
 
  On Mon, Apr 30, 2012 at 5:44 PM, Prashant Kommireddi 
 prash1...@gmail.comwrote:
 
  Yes. These are hadoop properties - using set is just a way for Pig to
 set
  those properties in your job conf.
 
 
  On Mon, Apr 30, 2012 at 5:25 PM, Mohit Anchlia mohitanch...@gmail.com
  wrote:
 
   Is there a way to compress map only jobs to compress map output that
 gets
   stored on hdfs as part-m-* files? In pig I used :
  
   Would these work form plain map reduce jobs as well?
  
  
   set output.compression.enabled true;
  
   set output.compression.codec
 org.apache.hadoop.io.compress.SnappyCodec;
  
 



 --
 Harsh J



Re: adding or restarting a data node in a hadoop cluster

2012-04-30 Thread Anil Gupta
@amit: if the DN is getting the IP from dhcp then the ip address might change 
after a reboot. 
Dynamic ip's in the cluster are not a good choice. IMO

Best Regards,
Anil

On Apr 30, 2012, at 8:22 PM, Amith D K amit...@huawei.com wrote:

 Hi sumadhur,
 
 As u mentioned configureg the NN and JT ip would be enough.
 
 I am not able to understand how on DN restart its IP get changed?
 
 
 From: sumadhur [sumadhur_i...@yahoo.com]
 Sent: Tuesday, May 01, 2012 10:58 AM
 To: common-user@hadoop.apache.org
 Subject: adding or restarting a data node in a hadoop cluster
 
 I am on hadoop 0.20.
 
 To add a data node to a cluster, if we do not use the include/exclude/slaves 
 files, do we need to  do anything other than configuring the hdfs-site.xml to 
 point to name node and the mapred-site.xml to point to job tracker?
 
 For example, should the job tracker and name node be restarted always?
 
 On a related note, if we restart a data node(that has some blocks on it) and 
 the data node now has new IP address, Should we restart namenode/job tracker 
 for hdfs and map-reduce to function correctly?
 Would the blocks on the restarted data node be detected or would hdfs think 
 that these blocks were lost and start replicating them?
 
 Thanks,
 Sumadhur