Java inputformat for pipes job
Hi, I implemented a customized input format in Java for a Map Reduce job. The mapper and reducer classes are implemented in C++, using the Hadoop Pipes API. The package documentation for org.apache.hadoop.mapred.pipes states that The job may consist of any combination of Java and C++ RecordReaders, Mappers, Paritioner, Combiner, Reducer, and RecordWriter I packaged the input format class in a jar file and ran the job invocation command: hadoop pipes -jar mytest.jar -inputformat mytest.PriceInputFormat -conf conf/mytest.xml -input mgr/in -output mgr/out -program mgr/bin/TestMgr It keeps failing with error ClassNotFoundException Although I've specified the jar file name with the -jar parameter, the input format class still cannot be located. Is there any other means to specify the input format class, or the job jar file, for a Pipes job ? Stack trace: Exception in thread main java.lang.ClassNotFoundException: mytest.PriceInputFormat at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276) at java.lang.ClassLoader.loadClass(ClassLoader.java:251) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:524) at org.apache.hadoop.mapred.pipes.Submitter.getClass(Submitter.java:309) at org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:357) Thanks, Rahul Sood [EMAIL PROTECTED]
Re: Java inputformat for pipes job
You should use the -pipes option in the command. For the input format, you can pack it into the hadoop core class jar file, or put it into the cache file. 2008/4/8, Rahul Sood [EMAIL PROTECTED]: Hi, I implemented a customized input format in Java for a Map Reduce job. The mapper and reducer classes are implemented in C++, using the Hadoop Pipes API. The package documentation for org.apache.hadoop.mapred.pipes states that The job may consist of any combination of Java and C++ RecordReaders, Mappers, Paritioner, Combiner, Reducer, and RecordWriter I packaged the input format class in a jar file and ran the job invocation command: hadoop pipes -jar mytest.jar -inputformat mytest.PriceInputFormat -conf conf/mytest.xml -input mgr/in -output mgr/out -program mgr/bin/TestMgr It keeps failing with error ClassNotFoundException Although I've specified the jar file name with the -jar parameter, the input format class still cannot be located. Is there any other means to specify the input format class, or the job jar file, for a Pipes job ? Stack trace: Exception in thread main java.lang.ClassNotFoundException: mytest.PriceInputFormat at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276) at java.lang.ClassLoader.loadClass(ClassLoader.java:251) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:524) at org.apache.hadoop.mapred.pipes.Submitter.getClass(Submitter.java:309) at org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:357) Thanks, Rahul Sood [EMAIL PROTECTED]
Re: Can't get DFS file size
Maybe because you pass Strings to the LongWritables? micha 11 Nov. wrote: Hi folks, I'm writing a little test programm to check the writing speed of DFS file system, but can't get the file size using fs.getFileStatus(file).getLen() or fs.getContentLength(file). Here is my code: package org.apache.hadoop.fs; import java.io.IOException; import java.io.OutputStream; import junit.framework.TestCase; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.SequenceFile; import org.apache.hadoop.io.SequenceFile.CompressionType; import org.apache.hadoop.io.Text; public class TestDFSWrite extends TestCase { static String ROOT = System.getProperty(test.build.data,fs_test); static Path DATA_DIR = new Path(ROOT, fs_data); static long MEGA = 1024 * 1024; static int BUFFER_SIZE = 4096; static Configuration conf = new Configuration(); static FileSystem fs; static byte[] buffer = new byte[BUFFER_SIZE]; static boolean finished; public class FileStatusChecker extends Thread{ Path file; Path resultFile; int interval; public FileStatusChecker(Path file, Path resultFile, int interval){ this.file = file; this.resultFile = resultFile; this.interval = interval; } public void run(){ System.out.println(Here is the checker running!!); System.out.println(file.toString()); SequenceFile.Writer writer = null; try { long lastLen = 1; long thisLen = 1; writer = SequenceFile.createWriter(fs, conf, resultFile, Text.class, Text.class, CompressionType.NONE); while(!finished){ lastLen = thisLen; if(fs.exists(file)){ System.out.println(File exists!); thisLen = fs.getContentLength(file); System.out.println(@@+thisLen); } else{ sleep(interval * 10); continue; } long length = thisLen -lastLen; System.out.println(thisLen is + thisLen + lastLen is + lastLen); long cur = System.currentTimeMillis(); cur = cur - (cur % 10); LongWritable current = new LongWritable(cur); writer.append(new Text(current.toString()), new Text(new LongWritable(length).toString())); sleep(interval * 10); } return; }catch (Exception e) { e.printStackTrace(); }finally { try { writer.close(); } catch (Exception e) { e.printStackTrace(); } } } } public static void main(String[] args) throws Exception { { try { fs = FileSystem.get(conf); } catch (IOException e) { throw new RuntimeException(e); } } String testFunc = ; String fileName = ; String resultFileName = ; int fileSize = 0; int interval = 0; String usage = Usage: TestDFSWrite -testfunc [read/write] -file foo -size M -interval MS -result resultFile; if (args.length == 0) { System.err.println(usage); System.exit(-1); } for (int i = 0; i args.length; i++) { // parse command line if (args[i].equals(-testfunc)) { testFunc = args[++i]; } else if (args[i].equals(-file)) { fileName = args[++i]; } else if (args[i].equals(-size)) { fileSize = Integer.parseInt(args[++i]); } else if (args[i].equals(-interval)) { interval = Integer.parseInt(args[++i]); } else if (args[i].equals(-result)) { resultFileName = args[++i]; } } long total = fileSize * MEGA; OutputStream out; Path file, resultFile; fs.delete(DATA_DIR); if(testFunc.equals(read)){ System.out.println(This option is not ready.); return; }else if(testFunc.equals(write)){ file = new Path(DATA_DIR, fileName); resultFile = new Path(DATA_DIR, resultFileName); }else{ System.out.println(Invalid command line option.); return; } FileStatusChecker checker = new TestDFSWrite().new FileStatusChecker(file, resultFile, interval); System.out.println(file.toString()); //System.out.println(F L: + newLongWritable(fs.getContentLength(file)).toString()); out = fs.create(file); checker.start(); long written = 0; try { finished = false; while (written total) { long remains = total - written; int length = (remains=buffer.length) ? (int)remains : buffer.length; out.write(buffer, 0, length); out.flush(); written += length; System.out.println(One segment done!); System.out.println(F L: + new
Re: Can't get DFS file size
I tried to play with the little test by attaching eclipse on when it started, what surprised me is that the size could be got in eclipse, and the result file is witten as expected. Can anybody explain this? 2008/4/8, 11 Nov. [EMAIL PROTECTED]: Hi folks, I'm writing a little test programm to check the writing speed of DFS file system, but can't get the file size using fs.getFileStatus(file).getLen() or fs.getContentLength(file). Here is my code: package org.apache.hadoop.fs; import java.io.IOException; import java.io.OutputStream; import junit.framework.TestCase; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.SequenceFile; import org.apache.hadoop.io.SequenceFile.CompressionType; import org.apache.hadoop.io.Text; public class TestDFSWrite extends TestCase { static String ROOT = System.getProperty(test.build.data,fs_test); static Path DATA_DIR = new Path(ROOT, fs_data); static long MEGA = 1024 * 1024; static int BUFFER_SIZE = 4096; static Configuration conf = new Configuration(); static FileSystem fs; static byte[] buffer = new byte[BUFFER_SIZE]; static boolean finished; public class FileStatusChecker extends Thread{ Path file; Path resultFile; int interval; public FileStatusChecker(Path file, Path resultFile, int interval){ this.file = file; this.resultFile = resultFile; this.interval = interval; } public void run(){ System.out.println(Here is the checker running!!); System.out.println(file.toString()); SequenceFile.Writer writer = null; try { long lastLen = 1; long thisLen = 1; writer = SequenceFile.createWriter(fs, conf, resultFile, Text.class, Text.class, CompressionType.NONE); while(!finished){ lastLen = thisLen; if(fs.exists(file)){ System.out.println(File exists!); thisLen = fs.getContentLength(file); System.out.println(@@+thisLen); } else{ sleep(interval * 10); continue; } long length = thisLen -lastLen; System.out.println(thisLen is + thisLen + lastLen is + lastLen); long cur = System.currentTimeMillis(); cur = cur - (cur % 10); LongWritable current = new LongWritable(cur); writer.append(new Text(current.toString()), new Text(new LongWritable(length).toString())); sleep(interval * 10); } return; }catch (Exception e) { e.printStackTrace(); }finally { try { writer.close(); } catch (Exception e) { e.printStackTrace(); } } } } public static void main(String[] args) throws Exception { { try { fs = FileSystem.get(conf); } catch (IOException e) { throw new RuntimeException(e); } } String testFunc = ; String fileName = ; String resultFileName = ; int fileSize = 0; int interval = 0; String usage = Usage: TestDFSWrite -testfunc [read/write] -file foo -size M -interval MS -result resultFile; if (args.length == 0) { System.err.println(usage); System.exit(-1); } for (int i = 0; i args.length; i++) { // parse command line if (args[i].equals(-testfunc)) { testFunc = args[++i]; } else if (args[i].equals(-file)) { fileName = args[++i]; } else if (args[i].equals(-size)) { fileSize = Integer.parseInt(args[++i]); } else if (args[i].equals(-interval)) { interval = Integer.parseInt(args[++i]); } else if (args[i].equals(-result)) { resultFileName = args[++i]; } } long total = fileSize * MEGA; OutputStream out; Path file, resultFile; fs.delete(DATA_DIR); if(testFunc.equals(read)){ System.out.println(This option is not ready.); return; }else if(testFunc.equals(write)){ file = new Path(DATA_DIR, fileName); resultFile = new Path(DATA_DIR, resultFileName); }else{ System.out.println(Invalid command line option.); return; } FileStatusChecker checker = new TestDFSWrite().new FileStatusChecker(file, resultFile, interval); System.out.println(file.toString()); //System.out.println(F L: + newLongWritable(fs.getContentLength(file)).toString()); out = fs.create(file); checker.start(); long written = 0; try { finished = false; while (written total) { long remains = total - written; int length = (remains=buffer.length) ? (int)remains : buffer.length; out.write(buffer, 0, length);
Sorting the OutputCollector
Hi, I have implemented Key and value pairs in the following way: Key (Text class) Value(Custom class) word1 word2 class Custom{ int freq; TreeMapString, ArrayListString } I construct this type of key, value pairs in the outputcollector of reduce phase. Now I want to SORT this outputcollector in decreasing order of the value of frequency in Custom class. Could some one suggest the possible way to do this? Thanks, Aayush
Newbie asking: ordinary filesystem above Hadoop
Hi! Yes, I'm aware that it's not good idea build ordinary filesystem above Hadoop. Let's say that I try to build system for my users where is 500 GB space for every user. It seems that Hadoop can write/store 500 GB fine, but reading and altering data later isn't easy (at least not altering). How the big boys do this? E.g. Google filesystem, Gmail is above that (and still latency time seems fine for the remote enduser)? How about Amazon S3? Do the big players implement some caching layers above Hadoop like system? My dream is to have system with easy to add more space when needed, with all those automatic features: balancing, recovery of data (keeping it really there no matter what happens) etc. I guess I'm not alone there. BR, -- -- MJo
Re: Newbie asking: ordinary filesystem above Hadoop
HDFS has slightly different design goals. It's not meant as a general purpose filesystem, it's meant as the fast sequential input/output storage thing meant for hadoops map/reduce. Andreas Am Dienstag, den 08.04.2008, 16:24 +0300 schrieb Mika Joukainen: Hi! Yes, I'm aware that it's not good idea build ordinary filesystem above Hadoop. Let's say that I try to build system for my users where is 500 GB space for every user. It seems that Hadoop can write/store 500 GB fine, but reading and altering data later isn't easy (at least not altering). How the big boys do this? E.g. Google filesystem, Gmail is above that (and still latency time seems fine for the remote enduser)? How about Amazon S3? Do the big players implement some caching layers above Hadoop like system? My dream is to have system with easy to add more space when needed, with all those automatic features: balancing, recovery of data (keeping it really there no matter what happens) etc. I guess I'm not alone there. BR, signature.asc Description: Dies ist ein digital signierter Nachrichtenteil
DFS behavior when the disk goes bad
Hi, We had a bad disk issue in one of the box and I am seeing some strange behaviour. Just wanted to confirm whether this is expected.. * We are running a small cluster with 10 data nodes and a name node * Each data node has 6 disks * While a job was running, one of the disk in one data node got corrupted and the node got blacklisted * The job got killed because there was some space issue in the entire cluster and it couldn't continue * Now I tried removing some unnecessary data, the disk usages started coming down in all nodes except the node which got blacklisted in the last job (is this expected?) * I restarted the entire cluster, after some time the disk usage started coming down on that corrupted disk node and it went very low. Essentially, it has removed everything from that node. (Does the dfs remove the data from the all disks from the node if one of the disk was bad? And why it didn't do before restarting?) Thanks, Murali
Re: Java inputformat for pipes job
I'm invoking hadoop with pipes command: hadoop pipes -jar mytest.jar -inputformat mytest.PriceInputFormat -conf conf/mytest.xml -input mgr/in -output mgr/out -program mgr/bin/TestMgr I tried the -file and -cacheFile options but when either of these is passed to hadoop pipes, the command just exits with a usage message. There must be a way to specify a jar for a job implemented in C++ with the hadoop Pipes API. The documentation states that record readers and writers for Pipes jobs can be implemented in java. I looked at the source code of org.apache.hadoop.mapred.pipes.Submitter and it's doing the following: /** * The main entry point and job submitter. It may either be used as * a command line-based or API-based method to launch Pipes jobs. */ public class Submitter { /** * Submit a pipes job based on the command line arguments. * @param args */ public static void main(String[] args) throws Exception { CommandLineParser cli = new CommandLineParser(); //... if (results.hasOption(-inputformat)) { setIsJavaRecordReader(conf, true); conf.setInputFormat(getClass(results, -inputformat, conf, InputFormat.class)); } } } It is loading the input format class based on the value of the -inputformat cmdline parameter. That means there should be some way to package the input format class along with the program binary and other supporting files. -Rahul Sood [EMAIL PROTECTED] You should use the -pipes option in the command. For the input format, you can pack it into the hadoop core class jar file, or put it into the cache file. 2008/4/8, Rahul Sood [EMAIL PROTECTED]: Hi, I implemented a customized input format in Java for a Map Reduce job. The mapper and reducer classes are implemented in C++, using the Hadoop Pipes API. The package documentation for org.apache.hadoop.mapred.pipes states that The job may consist of any combination of Java and C++ RecordReaders, Mappers, Paritioner, Combiner, Reducer, and RecordWriter I packaged the input format class in a jar file and ran the job invocation command: hadoop pipes -jar mytest.jar -inputformat mytest.PriceInputFormat -conf conf/mytest.xml -input mgr/in -output mgr/out -program mgr/bin/TestMgr It keeps failing with error ClassNotFoundException Although I've specified the jar file name with the -jar parameter, the input format class still cannot be located. Is there any other means to specify the input format class, or the job jar file, for a Pipes job ? Stack trace: Exception in thread main java.lang.ClassNotFoundException: mytest.PriceInputFormat at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276) at java.lang.ClassLoader.loadClass(ClassLoader.java:251) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:524) at org.apache.hadoop.mapred.pipes.Submitter.getClass(Submitter.java:309) at org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:357) Thanks, Rahul Sood [EMAIL PROTECTED]
incorrect data check
running a job on my 5 node cluster, i get these intermittent exceptions in my logs: java.io.IOException: incorrect data check at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native Method) at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:218) at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:80) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74) at java.io.InputStream.read(InputStream.java:89) at org.apache.hadoop.mapred.LineRecordReader$LineReader.backfill(LineRecordReader.java:88) at org.apache.hadoop.mapred.LineRecordReader$LineReader.readLine(LineRecordReader.java:114) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:215) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:37) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:147) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2084 they occur accross all the nodes, but i can't figure out which file is causing the problem. i'm working on the assumption it's a specific file because it's precisely the same error that occurs on each node. i've scoured the logs and can't find any reference to which file caused the hiccup. but this is causing the job to fail. other files are processed without a problem. the files are 720 .gz files, ~100mb each. other files are processed on each node without a problem. i'm in the middle testing the .gz files, but i don't think the problem is necessarily in the source data, as much as in when i copied it into hdfs. so my questions are these: is this a known issue? is there some way to determine which file or files are causing these exceptions? is there a way to run something like gzip -t blah.gz on the file in hdfs? or maybe a checksum? is there a reason other than a corrupt datafile that would be causing this? in the original mapreduce paper, they talk about a mechanism to skip records that cause problems. is there a way to have hadoop skip these problematic files and the associated records and continue with the job? thanks, colin
Re: secondary namenode web interface
I'd be happy to file a JIRA for the bug, I just want to make sure I understand what the bug is: is it the misleading null pointer message or is it that someone is listening on this port and not doing anything useful? I mean, what is the configuration parameter dfs.secondary.http.address for? Unless there are plans to make this interface work, this config parameter should go away, and so should the listening thread, shouldn't they? Thanks, -Yuri On Friday 04 April 2008 03:30:46 pm dhruba Borthakur wrote: Your configuration is good. The secondary Namenode does not publish a web interface. The null pointer message in the secondary Namenode log is a harmless bug but should be fixed. It would be nice if you can open a JIRA for it. Thanks, Dhruba -Original Message- From: Yuri Pradkin [mailto:[EMAIL PROTECTED] Sent: Friday, April 04, 2008 2:45 PM To: core-user@hadoop.apache.org Subject: Re: secondary namenode web interface I'm re-posting this in hope that someone would help. Thanks! On Wednesday 02 April 2008 01:29:45 pm Yuri Pradkin wrote: Hi, I'm running Hadoop (latest snapshot) on several machines and in our setup namenode and secondarynamenode are on different systems. I see from the logs than secondary namenode regularly checkpoints fs from primary namenode. But when I go to the secondary namenode HTTP (dfs.secondary.http.address) in my browser I see something like this: HTTP ERROR: 500 init RequestURI=/dfshealth.jsp Powered by Jetty:// And in secondary's log I find these lines: 2008-04-02 11:26:25,357 WARN /: /dfshealth.jsp: java.lang.NullPointerException at org.apache.hadoop.dfs.dfshealth_jsp.init(dfshealth_jsp.java:21) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorA cce ssorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCons tru ctorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:539) at java.lang.Class.newInstance0(Class.java:373) at java.lang.Class.newInstance(Class.java:326) at org.mortbay.jetty.servlet.Holder.newInstance(Holder.java:199) at org.mortbay.jetty.servlet.ServletHolder.getServlet(ServletHolder.java:32 6) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:405) at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationH and ler.java:475) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567) at org.mortbay.http.HttpContext.handle(HttpContext.java:1565) at org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationCon tex t.java:635) at org.mortbay.http.HttpContext.handle(HttpContext.java:1517) at org.mortbay.http.HttpServer.service(HttpServer.java:954) at org.mortbay.http.HttpConnection.service(HttpConnection.java:814) at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981) at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831) at org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244 ) at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357) at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534) Is something missing from my configuration? Anybody else seen these? Thanks, -Yuri
RE: secondary namenode web interface
The secondary Namenode uses the HTTP interface to pull the fsimage from the primary. Similarly, the primary Namenode uses the dfs.secondary.http.address to pull the checkpointed-fsimage back from the secondary to the primary. So, the definition of dfs.secondary.http.address is needed. However, the servlet dfshealth.jsp should not be served from the secondary Namenode. This servet should be setup in such a way that only the primary Namenode invokes this servlet. Thanks, dhruba -Original Message- From: Yuri Pradkin [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 08, 2008 10:11 AM To: core-user@hadoop.apache.org Subject: Re: secondary namenode web interface I'd be happy to file a JIRA for the bug, I just want to make sure I understand what the bug is: is it the misleading null pointer message or is it that someone is listening on this port and not doing anything useful? I mean, what is the configuration parameter dfs.secondary.http.address for? Unless there are plans to make this interface work, this config parameter should go away, and so should the listening thread, shouldn't they? Thanks, -Yuri On Friday 04 April 2008 03:30:46 pm dhruba Borthakur wrote: Your configuration is good. The secondary Namenode does not publish a web interface. The null pointer message in the secondary Namenode log is a harmless bug but should be fixed. It would be nice if you can open a JIRA for it. Thanks, Dhruba -Original Message- From: Yuri Pradkin [mailto:[EMAIL PROTECTED] Sent: Friday, April 04, 2008 2:45 PM To: core-user@hadoop.apache.org Subject: Re: secondary namenode web interface I'm re-posting this in hope that someone would help. Thanks! On Wednesday 02 April 2008 01:29:45 pm Yuri Pradkin wrote: Hi, I'm running Hadoop (latest snapshot) on several machines and in our setup namenode and secondarynamenode are on different systems. I see from the logs than secondary namenode regularly checkpoints fs from primary namenode. But when I go to the secondary namenode HTTP (dfs.secondary.http.address) in my browser I see something like this: HTTP ERROR: 500 init RequestURI=/dfshealth.jsp Powered by Jetty:// And in secondary's log I find these lines: 2008-04-02 11:26:25,357 WARN /: /dfshealth.jsp: java.lang.NullPointerException at org.apache.hadoop.dfs.dfshealth_jsp.init(dfshealth_jsp.java:21) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorA cce ssorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCons tru ctorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:539) at java.lang.Class.newInstance0(Class.java:373) at java.lang.Class.newInstance(Class.java:326) at org.mortbay.jetty.servlet.Holder.newInstance(Holder.java:199) at org.mortbay.jetty.servlet.ServletHolder.getServlet(ServletHolder.java:32 6) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:405) at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationH and ler.java:475) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567) at org.mortbay.http.HttpContext.handle(HttpContext.java:1565) at org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationCon tex t.java:635) at org.mortbay.http.HttpContext.handle(HttpContext.java:1517) at org.mortbay.http.HttpServer.service(HttpServer.java:954) at org.mortbay.http.HttpConnection.service(HttpConnection.java:814) at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981) at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831) at org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244 ) at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357) at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534) Is something missing from my configuration? Anybody else seen these? Thanks, -Yuri
Re: Reduce Sort
On 4/8/08 10:43 AM, Natarajan, Senthil [EMAIL PROTECTED] wrote: I would like to try using Hadoop. That is good for education, probably bad for run time. It could take SECONDS longer to run (oh my). Do you mean to write another MapReduce program which takes the output of the first MapReduce (the already existing file of this format) Yes. And use count as the key and IP Address as the value. Yes. Is it possible to do this in the same program instead of writing another one. No. If it is not possible, is it something available in Hadoop once the first program is done, can I call Second program to do the sorting. Yes. If you are using Java, just create a second configuration and do the same thing as you did the first time to run the program. If I set the number of reducer to 1, then it will take more time to reduce all the maps and hence affect the performance right? Not really. Most of the sorting work will be done by the mappers. The reducer will only be merging the data so it will be pretty fast. The largest cost will be startup time, the second largest will be network transfer time.
RE: Reduce Sort
Thanks Ted. I would like to try using Hadoop. Do you mean to write another MapReduce program which takes the output of the first MapReduce (the already existing file of this format) IP Add Count 1.2. 5. 42 27 2.8. 6. 6 24 7.9.24.13 8 7.9. 6. 9201 And use count as the key and IP Address as the value. Is it possible to do this in the same program instead of writing another one. If it is not possible, is it something available in Hadoop once the first program is done, can I call Second program to do the sorting. If I set the number of reducer to 1, then it will take more time to reduce all the maps and hence affect the performance right? Thanks, Senthil -Original Message- From: Ted Dunning [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 08, 2008 11:53 AM To: core-user@hadoop.apache.org; '[EMAIL PROTECTED]' Subject: Re: Reduce Sort There are two ways to do this. Both of them assume that you have counted the addresses using map-reduce and the results are in HDFS. First, since the number of unique IP address is likely to be relatively small, simply sorting the results using conventional sort is probably as good as it gets. This will take just a few lines of scripting code: base=http://your-nameserver-here/data wget $base/your-results-directory/part-0 --output-document=a wget $base/your-results-directory/part-1 --output-document=b sort -k1nr a b where-you-want-the-output It would be convenient if there were a URL that would allow you to retrieve the concatenation of a wild-carded list of files, but the method I show above isn't bad. You are likely to be unhappy at the perceived impurity of this approach, but I would ask to think about why one might use hadoop at all. The best reason is to get high performance on large problems. The sorting part of this problem is not all that big a deal and using a conventional sort is probably the most effective approach here. You can also do the sorting using hadoop. Just use a mapper that moves the count to the key and keeps the IP as the value. I think that if you use an IntWritable or LongWritable as the key then the default sorting would give you ascending order. You can also define the sort order so that you get descending order. Make sure you set the number of reducers to 1 so that you only get a single output file. If you have less than 10 million values, the conventional sort is likely to be faster simply because of hadoop's startup time. On 4/8/08 8:37 AM, Natarajan, Senthil [EMAIL PROTECTED] wrote: Hi, I am new to MapReduce. After slightly modifying the example wordcount, to count the IP Address. I have two files part-0 and part-1 with the contents something like. IP Add Count 1.2. 5. 42 27 2.8. 6. 6 24 7.9.24.13 8 7.9. 6. 9201 I want to sort it by IP Address count in descending order(i.e.) I would expect to see 7.9. 6. 9201 1.2. 5. 42 27 2.8. 6. 6 24 7.9.24.13 8 Could you please suggest how to do this. And to merge both the partitions (part-0 and part-1) in to one output file, is there any functions already available in MapReduce Framework. Or we need to use Java IO to do this. Thanks, Senthil
Re: secondary namenode web interface
Yuri, The NullPointerException should be fixed as Dhruba proposed. We do not have any secondary nn web interface as of today. The http server is used for transferring data between the primary and the secondary. I don't see we can display anything useful on the secondary web UI except for the current status, config values, and the last checkpoint date/time. If you have anything in mind that can be displayed on the UI please let us know. You can also find a jira for the issue, it would be good if this discussion is reflected in it. Thanks, --Konstantin dhruba Borthakur wrote: The secondary Namenode uses the HTTP interface to pull the fsimage from the primary. Similarly, the primary Namenode uses the dfs.secondary.http.address to pull the checkpointed-fsimage back from the secondary to the primary. So, the definition of dfs.secondary.http.address is needed. However, the servlet dfshealth.jsp should not be served from the secondary Namenode. This servet should be setup in such a way that only the primary Namenode invokes this servlet. Thanks, dhruba -Original Message- From: Yuri Pradkin [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 08, 2008 10:11 AM To: core-user@hadoop.apache.org Subject: Re: secondary namenode web interface I'd be happy to file a JIRA for the bug, I just want to make sure I understand what the bug is: is it the misleading null pointer message or is it that someone is listening on this port and not doing anything useful? I mean, what is the configuration parameter dfs.secondary.http.address for? Unless there are plans to make this interface work, this config parameter should go away, and so should the listening thread, shouldn't they? Thanks, -Yuri On Friday 04 April 2008 03:30:46 pm dhruba Borthakur wrote: Your configuration is good. The secondary Namenode does not publish a web interface. The null pointer message in the secondary Namenode log is a harmless bug but should be fixed. It would be nice if you can open a JIRA for it. Thanks, Dhruba -Original Message- From: Yuri Pradkin [mailto:[EMAIL PROTECTED] Sent: Friday, April 04, 2008 2:45 PM To: core-user@hadoop.apache.org Subject: Re: secondary namenode web interface I'm re-posting this in hope that someone would help. Thanks! On Wednesday 02 April 2008 01:29:45 pm Yuri Pradkin wrote: Hi, I'm running Hadoop (latest snapshot) on several machines and in our setup namenode and secondarynamenode are on different systems. I see from the logs than secondary namenode regularly checkpoints fs from primary namenode. But when I go to the secondary namenode HTTP (dfs.secondary.http.address) in my browser I see something like this: HTTP ERROR: 500 init RequestURI=/dfshealth.jsp Powered by Jetty:// And in secondary's log I find these lines: 2008-04-02 11:26:25,357 WARN /: /dfshealth.jsp: java.lang.NullPointerException at org.apache.hadoop.dfs.dfshealth_jsp.init(dfshealth_jsp.java:21) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorA cce ssorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCons tru ctorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:539) at java.lang.Class.newInstance0(Class.java:373) at java.lang.Class.newInstance(Class.java:326) at org.mortbay.jetty.servlet.Holder.newInstance(Holder.java:199) at org.mortbay.jetty.servlet.ServletHolder.getServlet(ServletHolder.java:32 6) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:405) at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationH and ler.java:475) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567) at org.mortbay.http.HttpContext.handle(HttpContext.java:1565) at org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationCon tex t.java:635) at org.mortbay.http.HttpContext.handle(HttpContext.java:1517) at org.mortbay.http.HttpServer.service(HttpServer.java:954) at org.mortbay.http.HttpConnection.service(HttpConnection.java:814) at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981) at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831) at org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244 ) at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357) at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534) Is something missing from my configuration? Anybody else seen these? Thanks, -Yuri
New user, several questions/comments (MaxMapTaskFailuresPercent in particular)
The wiki has been down for more than a day, any ETA? I was going to search the archives for the status, but I'm getting 403's for each of the Archive links on the mailing list page: http://hadoop.apache.org/core/mailing_lists.html My original question was about specifying MaxMapTaskFailuresPercent as a job conf parameter on the command line for streaming jobs. Is there a conf setting like the following? mapred.taskfailure.percent My use case is log parsing files in DFS whose replication is necessarily set to 1; I'm seeing block retrieval failures that kill my job (I don't care if ~10% fail). Thanks, -- Ian Tegebo
Headers and footers on Hadoop output results
Hello. I'm using Hadoop to process several XML files, each with several XML records, through a group of Linux servers. I am using an XMLInputFormat that I found http://www.nabble.com/map-reduce-function-on-xml-string-td15816818.html here in Nabble , and I'm using the TextOutputFormat with an overrided write funcion, to output XML. Yet, the XML needs its root tag and the ?xml line. Where is the best place to place two writing functions like header() and footer()? I've tried, but all I manage was to write in the local task, not in the synchronized part- file. -- View this message in context: http://www.nabble.com/Headers-and-footers-on-Hadoop-output-results-tp16571385p16571385.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: secondary namenode web interface
On Tuesday 08 April 2008 11:54:35 am Konstantin Shvachko wrote: If you have anything in mind that can be displayed on the UI please let us know. You can also find a jira for the issue, it would be good if this discussion is reflected in it. Well, I guess we could have interface to browse the checkpointed image (actually this is what I was expecting to see), but it's not that big of a deal. Filed https://issues.apache.org/jira/browse/HADOOP-3212 Thanks, -Yuri
Sorting the OutputCollector
Hi, I have implemented Key and value pairs in the following way: Key (Text class) Value(Custom class) word1 word2 class Custom{ int freq; TreeMapString, ArrayListString } I construct this type of key, value pairs in the outputcollector of reduce phase. Now I want to SORT this outputcollector in decreasing order of the value of frequency in Custom class. Could some one suggest the possible way to do this? Thanks, Aayush
Re: New user, several questions/comments (MaxMapTaskFailuresPercent in particular)
Looks like it is up to me. On 4/8/08 12:36 PM, Ian Tegebo [EMAIL PROTECTED] wrote: The wiki has been down for more than a day, any ETA? I was going to search the archives for the status, but I'm getting 403's for each of the Archive links on the mailing list page: http://hadoop.apache.org/core/mailing_lists.html My original question was about specifying MaxMapTaskFailuresPercent as a job conf parameter on the command line for streaming jobs. Is there a conf setting like the following? mapred.taskfailure.percent My use case is log parsing files in DFS whose replication is necessarily set to 1; I'm seeing block retrieval failures that kill my job (I don't care if ~10% fail). Thanks,
Re: DFS behavior when the disk goes bad
The behavior seems correct. Assuming blacklisted to mean NameNode marked this node 'dead' : Murali Krishna wrote: * We are running a small cluster with 10 data nodes and a name node * Each data node has 6 disks * While a job was running, one of the disk in one data node got corrupted and the node got blacklisted * The job got killed because there was some space issue in the entire cluster and it couldn't continue * Now I tried removing some unnecessary data, the disk usages started coming down in all nodes except the node which got blacklisted in the last job (is this expected?) DataNode will be marked dead if it does not heartbeat with NameNode for 10min or so. A bad disk could cause that (a thread trying delete block file from the bad disk might be hung, for e.g.) Once a datanode is marked dead, NameNode does not interact with is.. so it did not remove any files from that node. * I restarted the entire cluster, after some time the disk usage started coming down on that corrupted disk node and it went very low. Essentially, it has removed everything from that node. (Does the dfs remove the data from the all disks from the node if one of the disk was bad? And why it didn't do before restarting?) By the time this node came and functioned well, NameNode has already re-replicated the valid data that was sitting on this node.. so when the node came up, NameNode deletes most of the files the datanode since that data is either already deleted or replicated on other nodes.
Re: incorrect data check
so, in an attempt to track down this problem, i've stripped out most of the files for input, trying to identify which ones are causing the problem. i've narrowed it down, but i can't pinpoint it. i keep getting these incorrect data check errors below, but the .gz files test fine with gzip. is there some way to run an md5 or something on the files in hdfs and compare it to the checksum of the files on my local machine? i've looked around the lists and through the various options to send to .../bin/hadoop, but nothing is jumping out at me. this is particularly frustrating because it's causing my jobs to fail, rather than skipping the problematic input files. i've also looked through the conf file and don't see anything similar about skipping bad files without killing the job. -colin On Tue, Apr 8, 2008 at 11:53 AM, Colin Freas [EMAIL PROTECTED] wrote: running a job on my 5 node cluster, i get these intermittent exceptions in my logs: java.io.IOException: incorrect data check at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native Method) at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:218) at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:80) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74) at java.io.InputStream.read(InputStream.java:89) at org.apache.hadoop.mapred.LineRecordReader$LineReader.backfill(LineRecordReader.java:88) at org.apache.hadoop.mapred.LineRecordReader$LineReader.readLine(LineRecordReader.java:114) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:215) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:37) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:147) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2084 they occur accross all the nodes, but i can't figure out which file is causing the problem. i'm working on the assumption it's a specific file because it's precisely the same error that occurs on each node. i've scoured the logs and can't find any reference to which file caused the hiccup. but this is causing the job to fail. other files are processed without a problem. the files are 720 .gz files, ~100mb each. other files are processed on each node without a problem. i'm in the middle testing the .gz files, but i don't think the problem is necessarily in the source data, as much as in when i copied it into hdfs. so my questions are these: is this a known issue? is there some way to determine which file or files are causing these exceptions? is there a way to run something like gzip -t blah.gz on the file in hdfs? or maybe a checksum? is there a reason other than a corrupt datafile that would be causing this? in the original mapreduce paper, they talk about a mechanism to skip records that cause problems. is there a way to have hadoop skip these problematic files and the associated records and continue with the job? thanks, colin
Re: incorrect data check
Colin, how about writing a streaming mapper which simply runs md5sum on each file it gets as input? Run this task along with the identity reducer, and you should be able to identify pretty quickly if there's HDFS corruption issue. Norbert On Tue, Apr 8, 2008 at 5:50 PM, Colin Freas [EMAIL PROTECTED] wrote: so, in an attempt to track down this problem, i've stripped out most of the files for input, trying to identify which ones are causing the problem. i've narrowed it down, but i can't pinpoint it. i keep getting these incorrect data check errors below, but the .gz files test fine with gzip. is there some way to run an md5 or something on the files in hdfs and compare it to the checksum of the files on my local machine? i've looked around the lists and through the various options to send to .../bin/hadoop, but nothing is jumping out at me. this is particularly frustrating because it's causing my jobs to fail, rather than skipping the problematic input files. i've also looked through the conf file and don't see anything similar about skipping bad files without killing the job. -colin On Tue, Apr 8, 2008 at 11:53 AM, Colin Freas [EMAIL PROTECTED] wrote: running a job on my 5 node cluster, i get these intermittent exceptions in my logs: java.io.IOException: incorrect data check at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native Method) at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:218) at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:80) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74) at java.io.InputStream.read(InputStream.java:89) at org.apache.hadoop.mapred.LineRecordReader$LineReader.backfill(LineRecordReader.java:88) at org.apache.hadoop.mapred.LineRecordReader$LineReader.readLine(LineRecordReader.java:114) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:215) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:37) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:147) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2084 they occur accross all the nodes, but i can't figure out which file is causing the problem. i'm working on the assumption it's a specific file because it's precisely the same error that occurs on each node. i've scoured the logs and can't find any reference to which file caused the hiccup. but this is causing the job to fail. other files are processed without a problem. the files are 720 .gz files, ~100mb each. other files are processed on each node without a problem. i'm in the middle testing the .gz files, but i don't think the problem is necessarily in the source data, as much as in when i copied it into hdfs. so my questions are these: is this a known issue? is there some way to determine which file or files are causing these exceptions? is there a way to run something like gzip -t blah.gz on the file in hdfs? or maybe a checksum? is there a reason other than a corrupt datafile that would be causing this? in the original mapreduce paper, they talk about a mechanism to skip records that cause problems. is there a way to have hadoop skip these problematic files and the associated records and continue with the job? thanks, colin
Re: New user, several questions/comments (MaxMapTaskFailuresPercent in particular)
On Tue, Apr 8, 2008 at 12:36 PM, Ian Tegebo [EMAIL PROTECTED] wrote: My original question was about specifying MaxMapTaskFailuresPercent as a job conf parameter on the command line for streaming jobs. Is there a conf setting like the following? mapred.taskfailure.percent The job conf settings to control this are: mapred.max.map.failures.percent mapred.max.reduce.failures.percent Both have a default of 0, meaning any failed task makes for a failed job (according to JobConf.java). rick
Re: secondary namenode web interface
Unfortunately we do not have an api for the secondary nn that would allow browsing the checkpoint. I agree it would be nice to have one. Thanks for filing the issue. --Konstantin Yuri Pradkin wrote: On Tuesday 08 April 2008 11:54:35 am Konstantin Shvachko wrote: If you have anything in mind that can be displayed on the UI please let us know. You can also find a jira for the issue, it would be good if this discussion is reflected in it. Well, I guess we could have interface to browse the checkpointed image (actually this is what I was expecting to see), but it's not that big of a deal. Filed https://issues.apache.org/jira/browse/HADOOP-3212 Thanks, -Yuri
Fuse-j-hadoopfs
Hi everybody, I have a question about fuse-j-hadoopfs. Do it handles the hadoop permissions ? I'm using hadoop.0.16.3 Thanks X