Re: Getting job progress in java application
Thanks a lot, checked the Docs and submitJob() method did the job. Two more question please:) [1] My app is running on Hadoop 0.20.203, if I upgrade the libraries to 1.0.X, will the old API work, or it is necessary to rewrite map() and reduce() functions to new API? [2] Does the new API support MultipleOutputs? Thanks again. On 04/30/2012 12:32 AM, Bill Graham wrote: Take a look at the JobClient API. You can use that to get the current progress of a running job. On Sunday, April 29, 2012, Ondřej Klimpera wrote: Hello I'd like to ask you what is the preferred way of getting running jobs progress from Java application, that has executed them. Im using Hadoop 0.20.203, tried job.end.notification.url property that works well, but as the property name says, it sends only job end notifications. What I need is to get updates on map() and reduce() progress. Please help how to do this. Thanks. Ondrej Klimpera
Can't construct instance of class org.apache.hadoop.conf.Configuration
Hello, I'm trying to run an application, written in C++, that uses libhdfs. I have compiled the code and get an error when I attempt to run the application. The error that I am getting is as follows: Can't construct instance of class org.apache.hadoop.conf.Configuration. Initially, I was receiving an error saying that CLASSPATH was not set. That was easy, so I set CLASSPATH to include the following three directories, in this order: 1. $HADOOP_HOME 2. $HADOOP_HOME/lib 3. $HADOOP_HOME/conf The CLASSPATH not set error went away, and now I receive the error about the Configuration class. I'm assuming that I do not have something on the path that I need to, but everything I have read says to simply include these three directories. Does anybody have any idea what I might be missing? Full exception pasted below. Thanks, Ryan Exception in thread main java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:423) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:356) Can't construct instance of class org.apache.hadoop.conf.Configuration node: /home/ryan/.node-gyp/0.7.8/src/node_object_wrap.h:61: void node::ObjectWrap::Wrap(v8::Handlev8::Object): Assertion `handle_.IsEmpty()' failed. Aborted (core dumped)
Re: Can't construct instance of class org.apache.hadoop.conf.Configuration
Hi, I would try this: export CLASSPATH=$(hadoop classpath) Brock On Mon, Apr 30, 2012 at 10:15 AM, Ryan Cole r...@rycole.com wrote: Hello, I'm trying to run an application, written in C++, that uses libhdfs. I have compiled the code and get an error when I attempt to run the application. The error that I am getting is as follows: Can't construct instance of class org.apache.hadoop.conf.Configuration. Initially, I was receiving an error saying that CLASSPATH was not set. That was easy, so I set CLASSPATH to include the following three directories, in this order: 1. $HADOOP_HOME 2. $HADOOP_HOME/lib 3. $HADOOP_HOME/conf The CLASSPATH not set error went away, and now I receive the error about the Configuration class. I'm assuming that I do not have something on the path that I need to, but everything I have read says to simply include these three directories. Does anybody have any idea what I might be missing? Full exception pasted below. Thanks, Ryan Exception in thread main java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:423) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:356) Can't construct instance of class org.apache.hadoop.conf.Configuration node: /home/ryan/.node-gyp/0.7.8/src/node_object_wrap.h:61: void node::ObjectWrap::Wrap(v8::Handlev8::Object): Assertion `handle_.IsEmpty()' failed. Aborted (core dumped) -- Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/
Re: Can't construct instance of class org.apache.hadoop.conf.Configuration
Brock, Ah, thanks. I did not realize that you could do that. That sets the correct CLASSPATH! Thanks. Ryan On Mon, Apr 30, 2012 at 10:22 AM, Brock Noland br...@cloudera.com wrote: Hi, I would try this: export CLASSPATH=$(hadoop classpath) Brock On Mon, Apr 30, 2012 at 10:15 AM, Ryan Cole r...@rycole.com wrote: Hello, I'm trying to run an application, written in C++, that uses libhdfs. I have compiled the code and get an error when I attempt to run the application. The error that I am getting is as follows: Can't construct instance of class org.apache.hadoop.conf.Configuration. Initially, I was receiving an error saying that CLASSPATH was not set. That was easy, so I set CLASSPATH to include the following three directories, in this order: 1. $HADOOP_HOME 2. $HADOOP_HOME/lib 3. $HADOOP_HOME/conf The CLASSPATH not set error went away, and now I receive the error about the Configuration class. I'm assuming that I do not have something on the path that I need to, but everything I have read says to simply include these three directories. Does anybody have any idea what I might be missing? Full exception pasted below. Thanks, Ryan Exception in thread main java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:423) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:356) Can't construct instance of class org.apache.hadoop.conf.Configuration node: /home/ryan/.node-gyp/0.7.8/src/node_object_wrap.h:61: void node::ObjectWrap::Wrap(v8::Handlev8::Object): Assertion `handle_.IsEmpty()' failed. Aborted (core dumped) -- Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/
Re: KMeans clustering on Hadoop infrastructure
You are likely going to get more help from talking to the Mahout mailing list. https://cwiki.apache.org/confluence/display/MAHOUT/Mailing+Lists,+IRC+and+Archives --Bobby Evans On 4/28/12 7:45 AM, Lukáš Kryške lu...@hotmail.cz wrote: Hello, I am successfully running K-Means clustering sample from the 'Mahout In Action' book (example in Chapter 7.3) in my Hadoop environment.Now I need to extend the program to take the vectors from a file located in my HDFS. I need to process clustering of millions or billions of vectors which are represented by comma-separated values in a .txt file in HDFS. Data are stored in this pattern: x1,y1x2,y2xn,yn As I understood from the book, I need to transform my .txt file with vectors into Hadoop's SequenceFile first - how to do it most efficiently? And how to tell to the KMeansDriver that the input path contains SequenceFile with vectors? Thanks for help. _Best Regards,Lukas Kryske
Re: Node-wide Combiner
Do you mean that when multiple map jobs run on the same node, that there is a combiner that will run across all of that code. There is nothing for that right now. It seems like it could be somewhat difficult to get right given the current architecture. --Bobby Evans On 4/27/12 11:13 PM, Superymk superymk...@hotmail.com wrote: Hi all, I am a newbie in Hadoop and I like the system. I have one question: Is there a node-wide combiner or something similar in Hadoop? I think it can reduce the number of intermediate results in further. Any hint? Thanks a lot! Superymk
Re: EMR Hadoop
On Apr 30, 2012, at 10:27 AM, Jay Vyas wrote: Hi guys : 1) Does anybody know if there is a VM out there which runs EMR hadoop ? I would like to have a local vm for dev purposes that mirrored the EMR hadoop instances. 2) How does EMR's hadoop differ from apache hadoop and Cloudera's hadoop ? EMR runs Apache Hadoop 0.20.205 which is very, very close to Apache Hadoop 1.0.x. Arun -- Jay Vyas MMSB/UCHC -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/
EOFException
I have been running several MapReduce jobs on some input text files. They were working fine earlier and then I suddenly started getting EOFException every time. Even the jobs that ran fine before (on the exact same input files) aren't running now. I am a bit perplexed as to what is causing this error. Here is the error: 12/04/30 12:55:55 INFO mapred.JobClient: Task Id : attempt_201202240659_6328_m_01_1, Status : FAILED java.lang.RuntimeException: java.io.EOFException at org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:128) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:967) at org.apache.hadoop.util.QuickSort.fix(QuickSort.java:30) at org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:83) at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1253) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1154) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115) at org.apache.hadoop.mapred.Child.main(Child.java:262) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at com.xerox.twitter.bin.UserTime.readFields(UserTime.java:31) at org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:122) Since the compare function seems to be involved, here is my custom key class. Note: I did not include year in the key because all keys have the same year. public class UserTime implements WritableComparableUserTime { int id, month, day, year, hour, min, sec; public UserTime() { } public UserTime(int u, int mon, int d, int y, int h, int m, int s) { id = u; month = mon; day = d; year = y; hour = h; min = m; sec = s; } @Override public void readFields(DataInput in) throws IOException { // TODO Auto-generated method stub id = in.readInt(); month = in.readInt(); day = in.readInt(); year = in.readInt(); hour = in.readInt(); min = in.readInt(); sec = in.readInt(); } @Override public void write(DataOutput out) throws IOException { // TODO Auto-generated method stub out.write(id); out.write(month); out.write(day); out.write(year); out.write(hour); out.write(min); out.write(sec); } @Override public int compareTo(UserTime that) { // TODO Auto-generated method stub if(compareUser(that) == 0) return (compareTime(that)); else if(compareUser(that) == 1) return 1; else return -1; } private int compareUser(UserTime that) { if(id that.id) return 1; else if(id == that.id) return 0; else return -1; } //assumes all are from the same year private int compareTime(UserTime that) { if(month that.month || (month == that.month day that.day) || (month == that.month day == that.day hour that.hour) || (month == that.month day == that.day hour == that.hour min that.min) || (month == that.month day == that.day hour == that.hour min == that.min sec that.sec)) return 1; else if(month == that.month day == that.day hour == that.hour min == that.min sec == that.sec) return 0; else return -1; } public String toString() { String h, m, s; if(hour 10) h = 0+hour; else h = Integer.toString(hour); if(min 10) m = 0+min; else m = Integer.toString(hour); if(sec 10) s = 0+min; else s = Integer.toString(hour); return (id+\t+month+/+day+/+year+\t+h+:+m+:+s); } } Thanks for any help. Regards, Keith
Weird error starting up pseudo-dist cluster.
Here's an error I've never seen before. I rebooted my machine sometime last week, so obviously when I tried to run a hadoop job this morning, the first thing I was quickly reminded of was that the pseudo-distributed cluster wasn't running. I started it up only to watch the job tracker appear in the browser briefly and then go away (typical error complaining that the port was closed, as if the jobtracker is gone). The namenode, interestingly, never came up during this time. I tried stopping and starting all a few times but to no avail. I inspected the logs and saw this: java.io.IOException: Missing directory /tmp/hadoop-keithw/dfs/name Sure enough, it isn't there. I'm not familiar with this directory, so I can't say whether it was ever there before, but presumably it was. Now, I assume I could get around this by formatting a new namenode, but then I would have to copy my data back into HDFS from scratch. So, two questions: (1) Any idea what the heck is going on here, how this happened, what it means? (2) Is there any way to recover without starting over from scratch? Thanks. Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com And what if we picked the wrong religion? Every week, we're just making God madder and madder! -- Homer Simpson
Re: Weird error starting up pseudo-dist cluster.
On 30/04/2012 19:48, Keith Wiley wrote: Here's an error I've never seen before. I rebooted my machine sometime last week, so obviously when I tried to run a hadoop job this morning, the first thing I was quickly reminded of was that the pseudo-distributed cluster wasn't running. I started it up only to watch the job tracker appear in the browser briefly and then go away (typical error complaining that the port was closed, as if the jobtracker is gone). The namenode, interestingly, never came up during this time. I tried stopping and starting all a few times but to no avail. I inspected the logs and saw this: java.io.IOException: Missing directory /tmp/hadoop-keithw/dfs/name Sure enough, it isn't there. I'm not familiar with this directory, so I can't say whether it was ever there before, but presumably it was. Now, I assume I could get around this by formatting a new namenode, but then I would have to copy my data back into HDFS from scratch. So, two questions: (1) Any idea what the heck is going on here, how this happened, what it means? The default hdfs config puts the namenode data in /tmp. This may be ok for casual testing, but in all other situations it's the worst location imaginable - for example, linux cleans this directory on reboot, and I think that's what happened here. Your HDFS data is gone to a better world... (2) Is there any way to recover without starting over from scratch? Regretfully, no. The lesson is: don't put precious files in /tmp. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Weird error starting up pseudo-dist cluster.
On Apr 30, 2012, at 11:10 , Andrzej Bialecki wrote: On 30/04/2012 19:48, Keith Wiley wrote: (1) Any idea what the heck is going on here, how this happened, what it means? The default hdfs config puts the namenode data in /tmp. This may be ok for casual testing, but in all other situations it's the worst location imaginable - for example, linux cleans this directory on reboot, and I think that's what happened here. Your HDFS data is gone to a better world... (2) Is there any way to recover without starting over from scratch? Regretfully, no. The lesson is: don't put precious files in /tmp. Ah, okay, so, when setting up a single-machine, just a pseudo-dist cluster, what is a better way to do it? Where would one put the temp directories in order to gain improved robustness of the hadoop system? Is this the sort of thing to put in a home directory? I never really conceptualized it that way; I always thought HDFS and hadoop in general were sort of system-level concepts. This is a single-user machine, I have full root/admin control over it, so it's not a permissions issue, I'm just asking at a philosophical level how to set up a pseudo-dist cluster in the most effective way? Thanks. Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com I used to be with it, but then they changed what it was. Now, what I'm with isn't it, and what's it seems weird and scary to me. -- Abe (Grandpa) Simpson
reducer not seeing external jars
I'm trying to use -libjars to load an external jar along with the job jar, but the reducer still fails with a ClassNotFoundException against a class from the external jar (JFreeChart). I'm not really sure how to approach this. It either works or it doesn't...and so far it doesn't. Can I make the mapper or reducer dump the class path so I can see what it thinks it has access to? Aside from exploring the issue, like investigating the classpath, etc., why might -libjars not work as expected in the first place? Thanks. Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com What I primarily learned in grad school is how much I *don't* know. Consequently, I left grad school with a higher ignorance to knowledge ratio than when I entered. -- Keith Wiley
Compressing map only output
Is there a way to compress map only jobs to compress map output that gets stored on hdfs as part-m-* files? In pig I used : Would these work form plain map reduce jobs as well? set output.compression.enabled true; set output.compression.codec org.apache.hadoop.io.compress.SnappyCodec;
Re: Compressing map only output
Yes. These are hadoop properties - using set is just a way for Pig to set those properties in your job conf. On Mon, Apr 30, 2012 at 5:25 PM, Mohit Anchlia mohitanch...@gmail.comwrote: Is there a way to compress map only jobs to compress map output that gets stored on hdfs as part-m-* files? In pig I used : Would these work form plain map reduce jobs as well? set output.compression.enabled true; set output.compression.codec org.apache.hadoop.io.compress.SnappyCodec;
Re: Compressing map only output
Thanks! When I tried to search for this property I couldn't find it. Is there a page that has complete list of properties and it's usage? On Mon, Apr 30, 2012 at 5:44 PM, Prashant Kommireddi prash1...@gmail.comwrote: Yes. These are hadoop properties - using set is just a way for Pig to set those properties in your job conf. On Mon, Apr 30, 2012 at 5:25 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Is there a way to compress map only jobs to compress map output that gets stored on hdfs as part-m-* files? In pig I used : Would these work form plain map reduce jobs as well? set output.compression.enabled true; set output.compression.codec org.apache.hadoop.io.compress.SnappyCodec;
adding or restarting a data node in a hadoop cluster
I am on hadoop 0.20. To add a data node to a cluster, if we do not use the include/exclude/slaves files, do we need to do anything other than configuring the hdfs-site.xml to point to name node and the mapred-site.xml to point to job tracker? For example, should the job tracker and name node be restarted always? On a related note, if we restart a data node(that has some blocks on it) and the data node now has new IP address, Should we restart namenode/job tracker for hdfs and map-reduce to function correctly? Would the blocks on the restarted data node be detected or would hdfs think that these blocks were lost and start replicating them? Thanks, Sumadhur
Re: adding or restarting a data node in a hadoop cluster
Sumadhur, (Inline) On Tue, May 1, 2012 at 8:28 AM, sumadhur sumadhur_i...@yahoo.com wrote: I am on hadoop 0.20. To add a data node to a cluster, if we do not use the include/exclude/slaves files, do we need to do anything other than configuring the hdfs-site.xml to point to name node and the mapred-site.xml to point to job tracker? For example, should the job tracker and name node be restarted always? Just booting up the DN service with the right config and a configured network for proper communication should suffice. In case you're using rack-awareness, ensure you update the rack-awareness script for your new node and refresh the NN before you start your DN. A restart isn't required for adding new nodes to the cluster. On a related note, if we restart a data node(that has some blocks on it) and the data node now has new IP address, Should we restart namenode/job tracker for hdfs and map-reduce to function correctly? Would the blocks on the restarted data node be detected or would hdfs think that these blocks were lost and start replicating them? Stopping, changing the IP/Hostname cleanly and restarting the DN back up should not cause any block movement. -- Harsh J
RE: adding or restarting a data node in a hadoop cluster
Hi sumadhur, As u mentioned configureg the NN and JT ip would be enough. I am not able to understand how on DN restart its IP get changed? From: sumadhur [sumadhur_i...@yahoo.com] Sent: Tuesday, May 01, 2012 10:58 AM To: common-user@hadoop.apache.org Subject: adding or restarting a data node in a hadoop cluster I am on hadoop 0.20. To add a data node to a cluster, if we do not use the include/exclude/slaves files, do we need to do anything other than configuring the hdfs-site.xml to point to name node and the mapred-site.xml to point to job tracker? For example, should the job tracker and name node be restarted always? On a related note, if we restart a data node(that has some blocks on it) and the data node now has new IP address, Should we restart namenode/job tracker for hdfs and map-reduce to function correctly? Would the blocks on the restarted data node be detected or would hdfs think that these blocks were lost and start replicating them? Thanks, Sumadhur
Re: Compressing map only output
Hey Mohit, Most of what you need to know for jobs is available at http://hadoop.apache.org/common/docs/current/mapred_tutorial.html A more complete, mostly unseparated list of config params are also available at: http://hadoop.apache.org/common/docs/current/mapred-default.html (core-default.html, hdfs-default.html) On Tue, May 1, 2012 at 6:36 AM, Mohit Anchlia mohitanch...@gmail.com wrote: Thanks! When I tried to search for this property I couldn't find it. Is there a page that has complete list of properties and it's usage? On Mon, Apr 30, 2012 at 5:44 PM, Prashant Kommireddi prash1...@gmail.comwrote: Yes. These are hadoop properties - using set is just a way for Pig to set those properties in your job conf. On Mon, Apr 30, 2012 at 5:25 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Is there a way to compress map only jobs to compress map output that gets stored on hdfs as part-m-* files? In pig I used : Would these work form plain map reduce jobs as well? set output.compression.enabled true; set output.compression.codec org.apache.hadoop.io.compress.SnappyCodec; -- Harsh J
Re: Compressing map only output
Thanks a lot for the link! On Mon, Apr 30, 2012 at 8:22 PM, Harsh J ha...@cloudera.com wrote: Hey Mohit, Most of what you need to know for jobs is available at http://hadoop.apache.org/common/docs/current/mapred_tutorial.html A more complete, mostly unseparated list of config params are also available at: http://hadoop.apache.org/common/docs/current/mapred-default.html (core-default.html, hdfs-default.html) On Tue, May 1, 2012 at 6:36 AM, Mohit Anchlia mohitanch...@gmail.com wrote: Thanks! When I tried to search for this property I couldn't find it. Is there a page that has complete list of properties and it's usage? On Mon, Apr 30, 2012 at 5:44 PM, Prashant Kommireddi prash1...@gmail.comwrote: Yes. These are hadoop properties - using set is just a way for Pig to set those properties in your job conf. On Mon, Apr 30, 2012 at 5:25 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Is there a way to compress map only jobs to compress map output that gets stored on hdfs as part-m-* files? In pig I used : Would these work form plain map reduce jobs as well? set output.compression.enabled true; set output.compression.codec org.apache.hadoop.io.compress.SnappyCodec; -- Harsh J
Re: adding or restarting a data node in a hadoop cluster
@amit: if the DN is getting the IP from dhcp then the ip address might change after a reboot. Dynamic ip's in the cluster are not a good choice. IMO Best Regards, Anil On Apr 30, 2012, at 8:22 PM, Amith D K amit...@huawei.com wrote: Hi sumadhur, As u mentioned configureg the NN and JT ip would be enough. I am not able to understand how on DN restart its IP get changed? From: sumadhur [sumadhur_i...@yahoo.com] Sent: Tuesday, May 01, 2012 10:58 AM To: common-user@hadoop.apache.org Subject: adding or restarting a data node in a hadoop cluster I am on hadoop 0.20. To add a data node to a cluster, if we do not use the include/exclude/slaves files, do we need to do anything other than configuring the hdfs-site.xml to point to name node and the mapred-site.xml to point to job tracker? For example, should the job tracker and name node be restarted always? On a related note, if we restart a data node(that has some blocks on it) and the data node now has new IP address, Should we restart namenode/job tracker for hdfs and map-reduce to function correctly? Would the blocks on the restarted data node be detected or would hdfs think that these blocks were lost and start replicating them? Thanks, Sumadhur