Java inputformat for pipes job

2008-04-08 Thread Rahul Sood
Hi,

I implemented a customized input format in Java for a Map Reduce job.
The mapper and reducer classes are implemented in C++, using the Hadoop
Pipes API. 

The package documentation for org.apache.hadoop.mapred.pipes states that
The job may consist of any combination of Java and C++ RecordReaders,
Mappers, Paritioner, Combiner, Reducer, and RecordWriter

I packaged the input format class in a jar file and ran the job
invocation command:

hadoop pipes -jar mytest.jar -inputformat mytest.PriceInputFormat -conf
conf/mytest.xml -input mgr/in -output mgr/out -program mgr/bin/TestMgr

It keeps failing with error ClassNotFoundException
Although I've specified the jar file name with the -jar parameter, the
input format class still cannot be located. Is there any other means to
specify the input format class, or the job jar file, for a Pipes job ?

Stack trace:

Exception in thread main java.lang.ClassNotFoundException:
mytest.PriceInputFormat
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276)
at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:524)
at
org.apache.hadoop.mapred.pipes.Submitter.getClass(Submitter.java:309)
at
org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:357)

Thanks,

Rahul Sood
[EMAIL PROTECTED]




Re: Java inputformat for pipes job

2008-04-08 Thread 11 Nov.
You should use the -pipes option in the command.
For the input format, you can pack it into the hadoop core class jar file,
or put it into the cache file.

2008/4/8, Rahul Sood [EMAIL PROTECTED]:

 Hi,

 I implemented a customized input format in Java for a Map Reduce job.
 The mapper and reducer classes are implemented in C++, using the Hadoop
 Pipes API.

 The package documentation for org.apache.hadoop.mapred.pipes states that
 The job may consist of any combination of Java and C++ RecordReaders,
 Mappers, Paritioner, Combiner, Reducer, and RecordWriter

 I packaged the input format class in a jar file and ran the job
 invocation command:

 hadoop pipes -jar mytest.jar -inputformat mytest.PriceInputFormat -conf
 conf/mytest.xml -input mgr/in -output mgr/out -program mgr/bin/TestMgr

 It keeps failing with error ClassNotFoundException
 Although I've specified the jar file name with the -jar parameter, the
 input format class still cannot be located. Is there any other means to
 specify the input format class, or the job jar file, for a Pipes job ?

 Stack trace:

 Exception in thread main java.lang.ClassNotFoundException:
 mytest.PriceInputFormat
 at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
 at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:247)
 at

 org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:524)
 at
 org.apache.hadoop.mapred.pipes.Submitter.getClass(Submitter.java:309)
 at
 org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:357)

 Thanks,


 Rahul Sood
 [EMAIL PROTECTED]





Re: Can't get DFS file size

2008-04-08 Thread Michaela Buergle
Maybe because you pass Strings to the LongWritables?

micha

11 Nov. wrote:
 Hi folks,
 I'm writing a little test programm to check the writing speed of DFS
 file system, but can't get the file size using
 fs.getFileStatus(file).getLen() or fs.getContentLength(file). Here is my
 code:
 
 package org.apache.hadoop.fs;
 import java.io.IOException;
 import java.io.OutputStream;
 
 import junit.framework.TestCase;
 
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.FileSystem;
 import org.apache.hadoop.io.LongWritable;
 import org.apache.hadoop.io.SequenceFile;
 import org.apache.hadoop.io.SequenceFile.CompressionType;
 import org.apache.hadoop.io.Text;
 
 public class TestDFSWrite extends TestCase {
   static String ROOT = System.getProperty(test.build.data,fs_test);
   static Path DATA_DIR = new Path(ROOT, fs_data);
   static long MEGA = 1024 * 1024;
   static int BUFFER_SIZE = 4096;
   static Configuration conf = new Configuration();
   static FileSystem fs;
 
   static byte[] buffer = new byte[BUFFER_SIZE];
   static boolean finished;
 
   public class FileStatusChecker extends Thread{
 Path file;
 Path resultFile;
 int interval;
 
 public FileStatusChecker(Path file, Path resultFile, int interval){
 this.file = file;
 this.resultFile = resultFile;
 this.interval = interval;
   }
 
   public void run(){
 System.out.println(Here is the checker running!!);
 System.out.println(file.toString());
 SequenceFile.Writer writer = null;
 try {
   long lastLen = 1;
   long thisLen = 1;
 writer = SequenceFile.createWriter(fs, conf, resultFile,
   Text.class, Text.class,
 CompressionType.NONE);
 
 while(!finished){
   lastLen = thisLen;
   if(fs.exists(file)){
 System.out.println(File exists!);
 thisLen = fs.getContentLength(file);
 System.out.println(@@+thisLen);
   }
   else{
 sleep(interval * 10);
 continue;
   }
   long length = thisLen -lastLen;
   System.out.println(thisLen is + thisLen + lastLen is  +
 lastLen);
   long cur = System.currentTimeMillis();
   cur = cur - (cur % 10);
   LongWritable current = new LongWritable(cur);
   writer.append(new Text(current.toString()), new Text(new
 LongWritable(length).toString()));
   sleep(interval * 10);
 
 }
 return;
 
   }catch (Exception e) {
 e.printStackTrace();
   }finally {
   try {
   writer.close();
   } catch (Exception e) {
   e.printStackTrace();
   }
   }
 }
   }
 
   public static void main(String[] args) throws Exception {
 {
   try {
 fs = FileSystem.get(conf);
   } catch (IOException e) {
 throw new RuntimeException(e);
   }
 }
 String testFunc = ;
 String fileName = ;
 String resultFileName = ;
 int fileSize = 0;
 int interval = 0;
 
 String usage = Usage: TestDFSWrite -testfunc [read/write] -file foo
 -size M -interval MS -result resultFile;
 if (args.length == 0) {
 System.err.println(usage);
 System.exit(-1);
   }
 
 for (int i = 0; i  args.length; i++) {   // parse command line
 if (args[i].equals(-testfunc)) {
   testFunc = args[++i];
 } else if (args[i].equals(-file)) {
   fileName = args[++i];
 } else if (args[i].equals(-size)) {
   fileSize = Integer.parseInt(args[++i]);
 } else if (args[i].equals(-interval)) {
   interval = Integer.parseInt(args[++i]);
 } else if (args[i].equals(-result)) {
   resultFileName = args[++i];
 }
   }
 
 long total = fileSize * MEGA;
 OutputStream out;
 Path file, resultFile;
 fs.delete(DATA_DIR);
 
 if(testFunc.equals(read)){
   System.out.println(This option is not ready.);
   return;
 }else if(testFunc.equals(write)){
   file = new Path(DATA_DIR, fileName);
   resultFile = new Path(DATA_DIR, resultFileName);
 }else{
   System.out.println(Invalid command line option.);
   return;
 }
 
 FileStatusChecker checker = new TestDFSWrite().new
 FileStatusChecker(file, resultFile, interval);
 System.out.println(file.toString());
 //System.out.println(F L: +
 newLongWritable(fs.getContentLength(file)).toString());
 out = fs.create(file);
 checker.start();
 long written = 0;
 try {
   finished = false;
   while (written  total) {
 long remains = total - written;
 int length = (remains=buffer.length) ? (int)remains :
 buffer.length;
 out.write(buffer, 0, length);
 out.flush();
 written += length;
 System.out.println(One segment done!);
 System.out.println(F L: + new
 

Re: Can't get DFS file size

2008-04-08 Thread 11 Nov.
I tried to play with the little test by attaching eclipse on when it
started, what surprised me is that the size could be got in eclipse, and the
result file is witten as expected. Can anybody explain this?

2008/4/8, 11 Nov. [EMAIL PROTECTED]:

 Hi folks,
 I'm writing a little test programm to check the writing speed of DFS
 file system, but can't get the file size using
 fs.getFileStatus(file).getLen() or fs.getContentLength(file). Here is my
 code:

 package org.apache.hadoop.fs;
 import java.io.IOException;
 import java.io.OutputStream;

 import junit.framework.TestCase;

 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.FileSystem;
 import org.apache.hadoop.io.LongWritable;
 import org.apache.hadoop.io.SequenceFile;
 import org.apache.hadoop.io.SequenceFile.CompressionType;
 import org.apache.hadoop.io.Text;

 public class TestDFSWrite extends TestCase {
   static String ROOT = System.getProperty(test.build.data,fs_test);
   static Path DATA_DIR = new Path(ROOT, fs_data);
   static long MEGA = 1024 * 1024;
   static int BUFFER_SIZE = 4096;
   static Configuration conf = new Configuration();
   static FileSystem fs;

   static byte[] buffer = new byte[BUFFER_SIZE];
   static boolean finished;

   public class FileStatusChecker extends Thread{
 Path file;
 Path resultFile;
 int interval;

 public FileStatusChecker(Path file, Path resultFile, int interval){
 this.file = file;
 this.resultFile = resultFile;
 this.interval = interval;
   }

   public void run(){
 System.out.println(Here is the checker running!!);
 System.out.println(file.toString());
 SequenceFile.Writer writer = null;
 try {
   long lastLen = 1;
   long thisLen = 1;
 writer = SequenceFile.createWriter(fs, conf, resultFile,
   Text.class, Text.class,
 CompressionType.NONE);

 while(!finished){
   lastLen = thisLen;
   if(fs.exists(file)){
 System.out.println(File exists!);
 thisLen = fs.getContentLength(file);
 System.out.println(@@+thisLen);
   }
   else{
 sleep(interval * 10);
 continue;
   }
   long length = thisLen -lastLen;
   System.out.println(thisLen is + thisLen + lastLen is  +
 lastLen);
   long cur = System.currentTimeMillis();
   cur = cur - (cur % 10);
   LongWritable current = new LongWritable(cur);
   writer.append(new Text(current.toString()), new Text(new
 LongWritable(length).toString()));
   sleep(interval * 10);

 }
 return;

   }catch (Exception e) {
 e.printStackTrace();
   }finally {
   try {
   writer.close();
   } catch (Exception e) {
   e.printStackTrace();
   }
   }
 }
   }

   public static void main(String[] args) throws Exception {
 {
   try {
 fs = FileSystem.get(conf);
   } catch (IOException e) {
 throw new RuntimeException(e);
   }
 }
 String testFunc = ;
 String fileName = ;
 String resultFileName = ;
 int fileSize = 0;
 int interval = 0;

 String usage = Usage: TestDFSWrite -testfunc [read/write] -file foo
 -size M -interval MS -result resultFile;
 if (args.length == 0) {
 System.err.println(usage);
 System.exit(-1);
   }

 for (int i = 0; i  args.length; i++) {   // parse command line
 if (args[i].equals(-testfunc)) {
   testFunc = args[++i];
 } else if (args[i].equals(-file)) {
   fileName = args[++i];
 } else if (args[i].equals(-size)) {
   fileSize = Integer.parseInt(args[++i]);
 } else if (args[i].equals(-interval)) {
   interval = Integer.parseInt(args[++i]);
 } else if (args[i].equals(-result)) {
   resultFileName = args[++i];
 }
   }

 long total = fileSize * MEGA;
 OutputStream out;
 Path file, resultFile;
 fs.delete(DATA_DIR);

 if(testFunc.equals(read)){
   System.out.println(This option is not ready.);
   return;
 }else if(testFunc.equals(write)){
   file = new Path(DATA_DIR, fileName);
   resultFile = new Path(DATA_DIR, resultFileName);
 }else{
   System.out.println(Invalid command line option.);
   return;
 }

 FileStatusChecker checker = new TestDFSWrite().new
 FileStatusChecker(file, resultFile, interval);
 System.out.println(file.toString());
 //System.out.println(F L: +
 newLongWritable(fs.getContentLength(file)).toString());
 out = fs.create(file);
 checker.start();
 long written = 0;
 try {
   finished = false;
   while (written  total) {
 long remains = total - written;
 int length = (remains=buffer.length) ? (int)remains :
 buffer.length;
 out.write(buffer, 0, length);
 

Sorting the OutputCollector

2008-04-08 Thread Aayush Garg
Hi,

I have implemented Key and value pairs in the following way:

Key (Text class) Value(Custom class)
word1
word2


class Custom{
int freq;
TreeMapString, ArrayListString
}

I construct this type of key, value pairs in the outputcollector of reduce
phase. Now I want to SORT  this   outputcollector in decreasing order of
the value of frequency in Custom class.
Could some one suggest the possible way to do this?

Thanks,
Aayush


Newbie asking: ordinary filesystem above Hadoop

2008-04-08 Thread Mika Joukainen
Hi!

Yes, I'm aware that it's not good idea build ordinary filesystem above
Hadoop. Let's say that I try to build system for my users where is 500 GB
space for every user. It seems that Hadoop can write/store 500 GB fine, but
reading and altering data later isn't easy (at least not altering).

How the big boys do this? E.g. Google filesystem, Gmail is above that (and
still latency time seems fine for the remote enduser)? How about Amazon S3?
Do the big players implement some caching layers above Hadoop like system?

My dream is to have system with easy to add more space when needed, with all
those automatic features: balancing, recovery of data (keeping it really
there no matter what happens) etc. I guess I'm not alone there.

BR,
-- 
-- MJo


Re: Newbie asking: ordinary filesystem above Hadoop

2008-04-08 Thread Andreas Kostyrka
HDFS has slightly different design goals. It's not meant as a general
purpose filesystem, it's meant as the fast sequential input/output
storage thing meant for hadoops map/reduce.

Andreas

Am Dienstag, den 08.04.2008, 16:24 +0300 schrieb Mika Joukainen:
 Hi!
 
 Yes, I'm aware that it's not good idea build ordinary filesystem above
 Hadoop. Let's say that I try to build system for my users where is 500 GB
 space for every user. It seems that Hadoop can write/store 500 GB fine, but
 reading and altering data later isn't easy (at least not altering).
 
 How the big boys do this? E.g. Google filesystem, Gmail is above that (and
 still latency time seems fine for the remote enduser)? How about Amazon S3?
 Do the big players implement some caching layers above Hadoop like system?
 
 My dream is to have system with easy to add more space when needed, with all
 those automatic features: balancing, recovery of data (keeping it really
 there no matter what happens) etc. I guess I'm not alone there.
 
 BR,


signature.asc
Description: Dies ist ein digital signierter Nachrichtenteil


DFS behavior when the disk goes bad

2008-04-08 Thread Murali Krishna
Hi,

We had a bad disk issue in one of the box and I am seeing
some strange behaviour. Just wanted to confirm whether this is
expected..

 

*   We are running a small cluster with 10 data nodes and a name
node
*   Each data node has 6 disks 
*   While a job was running, one of the disk in one data node got
corrupted and the node got blacklisted 
*   The job got killed because there was some space issue in the
entire cluster and it couldn't continue
*   Now I tried removing some unnecessary data, the disk usages
started coming down in all nodes except the node which got blacklisted
in the last job (is this expected?)
*   I restarted the entire cluster, after some time the disk usage
started coming down on that corrupted disk node and it went very low.
Essentially, it has removed everything from that node. (Does the dfs
remove the data from the all disks from the node if one of the disk was
bad? And why it didn't do before restarting?)

 

Thanks,

Murali



Re: Java inputformat for pipes job

2008-04-08 Thread Rahul Sood
I'm invoking hadoop with pipes command:

hadoop pipes -jar mytest.jar -inputformat mytest.PriceInputFormat -conf
conf/mytest.xml -input mgr/in -output mgr/out -program mgr/bin/TestMgr

I tried the -file and -cacheFile options but when either of these is
passed to hadoop pipes, the command just exits with a usage message.

There must be a way to specify a jar for a job implemented in C++ with
the hadoop Pipes API. The documentation states that record readers and
writers for Pipes jobs can be implemented in java. I looked at the
source code of org.apache.hadoop.mapred.pipes.Submitter and it's doing
the following:

/**
 * The main entry point and job submitter. It may either be used as
 * a command line-based or API-based method to launch Pipes jobs.
 */
public class Submitter {

   /**
   * Submit a pipes job based on the command line arguments.
   * @param args
   */
  public static void main(String[] args) throws Exception {
CommandLineParser cli = new CommandLineParser();
//...
  if (results.hasOption(-inputformat)) {
setIsJavaRecordReader(conf, true);
conf.setInputFormat(getClass(results, -inputformat, conf,
 InputFormat.class));
  }
  }
}

 It is loading the input format class based on the value of the
-inputformat cmdline parameter. That means there should be some way to
package the input format class along with the program binary and other
supporting files.

-Rahul Sood
[EMAIL PROTECTED]

 You should use the -pipes option in the command.
 For the input format, you can pack it into the hadoop core class jar file,
 or put it into the cache file.
 
 2008/4/8, Rahul Sood [EMAIL PROTECTED]:
 
  Hi,
 
  I implemented a customized input format in Java for a Map Reduce job.
  The mapper and reducer classes are implemented in C++, using the Hadoop
  Pipes API.
 
  The package documentation for org.apache.hadoop.mapred.pipes states that
  The job may consist of any combination of Java and C++ RecordReaders,
  Mappers, Paritioner, Combiner, Reducer, and RecordWriter
 
  I packaged the input format class in a jar file and ran the job
  invocation command:
 
  hadoop pipes -jar mytest.jar -inputformat mytest.PriceInputFormat -conf
  conf/mytest.xml -input mgr/in -output mgr/out -program mgr/bin/TestMgr
 
  It keeps failing with error ClassNotFoundException
  Although I've specified the jar file name with the -jar parameter, the
  input format class still cannot be located. Is there any other means to
  specify the input format class, or the job jar file, for a Pipes job ?
 
  Stack trace:
 
  Exception in thread main java.lang.ClassNotFoundException:
  mytest.PriceInputFormat
  at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
  at java.security.AccessController.doPrivileged(Native Method)
  at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
  at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)
  at java.lang.Class.forName0(Native Method)
  at java.lang.Class.forName(Class.java:247)
  at
 
  org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:524)
  at
  org.apache.hadoop.mapred.pipes.Submitter.getClass(Submitter.java:309)
  at
  org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:357)
 
  Thanks,
 
 
  Rahul Sood
  [EMAIL PROTECTED]
 
 
 



incorrect data check

2008-04-08 Thread Colin Freas
running a job on my 5 node cluster, i get these intermittent exceptions in
my logs:

java.io.IOException: incorrect data check
at 
org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native
Method)
at 
org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:218)
at 
org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:80)
at 
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74)
at java.io.InputStream.read(InputStream.java:89)
at 
org.apache.hadoop.mapred.LineRecordReader$LineReader.backfill(LineRecordReader.java:88)
at 
org.apache.hadoop.mapred.LineRecordReader$LineReader.readLine(LineRecordReader.java:114)
at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:215)
at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:37)
at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:147)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2084


they occur accross all the nodes, but i can't figure out which file is
causing the problem.  i'm working on the assumption it's a specific file
because it's precisely the same error that occurs on each node.  i've
scoured the logs and can't find any reference to which file caused the
hiccup.  but this is causing the job to fail.  other files are processed
without a problem.  the files are 720 .gz files, ~100mb each.  other files
are processed on each node without a problem.  i'm in the middle testing the
.gz files, but i don't think the problem is necessarily in the source data,
as much as in when i copied it into hdfs.

so my questions are these:
is this a known issue?
is there some way to determine which file or files are causing these
exceptions?
is there a way to run something like gzip -t blah.gz on the file in hdfs?
or maybe a checksum?
is there a reason other than a corrupt datafile that would be causing this?
in the original mapreduce paper, they talk about a mechanism to skip records
that cause problems.  is there a way to have hadoop skip these problematic
files and the associated records and continue with the job?


thanks,
colin


Re: secondary namenode web interface

2008-04-08 Thread Yuri Pradkin
I'd be happy to file a JIRA for the bug, I just want to make sure I understand 
what the bug is: is it the misleading null pointer message or is it that 
someone is listening on this port and not doing anything useful?  I mean,
what is the configuration parameter dfs.secondary.http.address for?  Unless 
there are plans to make this interface work, this config parameter should go 
away, and so should the listening thread, shouldn't they?

Thanks,
  -Yuri

On Friday 04 April 2008 03:30:46 pm dhruba Borthakur wrote:
 Your configuration is good. The secondary Namenode does not publish a
 web interface. The null pointer message in the secondary Namenode log
 is a harmless bug but should be fixed. It would be nice if you can open
 a JIRA for it.

 Thanks,
 Dhruba


 -Original Message-
 From: Yuri Pradkin [mailto:[EMAIL PROTECTED]
 Sent: Friday, April 04, 2008 2:45 PM
 To: core-user@hadoop.apache.org
 Subject: Re: secondary namenode web interface

 I'm re-posting this in hope that someone would help.  Thanks!

 On Wednesday 02 April 2008 01:29:45 pm Yuri Pradkin wrote:
  Hi,
 
  I'm running Hadoop (latest snapshot) on several machines and in our

 setup

  namenode and secondarynamenode are on different systems.  I see from

 the

  logs than secondary namenode regularly checkpoints fs from primary
  namenode.
 
  But when I go to the secondary namenode HTTP

 (dfs.secondary.http.address)

  in my browser I see something like this:
 
  HTTP ERROR: 500
  init
  RequestURI=/dfshealth.jsp
  Powered by Jetty://
 
  And in secondary's log I find these lines:
 
  2008-04-02 11:26:25,357 WARN /: /dfshealth.jsp:
  java.lang.NullPointerException
  at
  org.apache.hadoop.dfs.dfshealth_jsp.init(dfshealth_jsp.java:21) at
  sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

 at

 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorA
 cce

 ssorImpl.java:57) at

 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCons
 tru

 ctorAccessorImpl.java:45) at
  java.lang.reflect.Constructor.newInstance(Constructor.java:539) at
  java.lang.Class.newInstance0(Class.java:373)
  at java.lang.Class.newInstance(Class.java:326)
  at

 org.mortbay.jetty.servlet.Holder.newInstance(Holder.java:199)

  at

 org.mortbay.jetty.servlet.ServletHolder.getServlet(ServletHolder.java:32
 6)

  at

 org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:405)

  at

 org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationH
 and

 ler.java:475) at

 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
 at

  org.mortbay.http.HttpContext.handle(HttpContext.java:1565) at

 org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationCon
 tex

 t.java:635) at

 org.mortbay.http.HttpContext.handle(HttpContext.java:1517) at

  org.mortbay.http.HttpServer.service(HttpServer.java:954) at
  org.mortbay.http.HttpConnection.service(HttpConnection.java:814) at
  org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981) at
  org.mortbay.http.HttpConnection.handle(HttpConnection.java:831) at

 org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244
 )

  at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357) at
  org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)
 
  Is something missing from my configuration?  Anybody else seen these?
 
  Thanks,
 
-Yuri




RE: secondary namenode web interface

2008-04-08 Thread dhruba Borthakur
The secondary Namenode uses the HTTP interface to pull the fsimage from
the primary. Similarly, the primary Namenode uses the
dfs.secondary.http.address to pull the checkpointed-fsimage back from
the secondary to the primary. So, the definition of
dfs.secondary.http.address is needed.

However, the servlet dfshealth.jsp should not be served from the
secondary Namenode. This servet should be setup in such a way that only
the primary Namenode invokes this servlet.

Thanks,
dhruba

-Original Message-
From: Yuri Pradkin [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, April 08, 2008 10:11 AM
To: core-user@hadoop.apache.org
Subject: Re: secondary namenode web interface

I'd be happy to file a JIRA for the bug, I just want to make sure I
understand 
what the bug is: is it the misleading null pointer message or is it
that 
someone is listening on this port and not doing anything useful?  I
mean,
what is the configuration parameter dfs.secondary.http.address for?
Unless 
there are plans to make this interface work, this config parameter
should go 
away, and so should the listening thread, shouldn't they?

Thanks,
  -Yuri

On Friday 04 April 2008 03:30:46 pm dhruba Borthakur wrote:
 Your configuration is good. The secondary Namenode does not publish a
 web interface. The null pointer message in the secondary Namenode
log
 is a harmless bug but should be fixed. It would be nice if you can
open
 a JIRA for it.

 Thanks,
 Dhruba


 -Original Message-
 From: Yuri Pradkin [mailto:[EMAIL PROTECTED]
 Sent: Friday, April 04, 2008 2:45 PM
 To: core-user@hadoop.apache.org
 Subject: Re: secondary namenode web interface

 I'm re-posting this in hope that someone would help.  Thanks!

 On Wednesday 02 April 2008 01:29:45 pm Yuri Pradkin wrote:
  Hi,
 
  I'm running Hadoop (latest snapshot) on several machines and in our

 setup

  namenode and secondarynamenode are on different systems.  I see from

 the

  logs than secondary namenode regularly checkpoints fs from primary
  namenode.
 
  But when I go to the secondary namenode HTTP

 (dfs.secondary.http.address)

  in my browser I see something like this:
 
  HTTP ERROR: 500
  init
  RequestURI=/dfshealth.jsp
  Powered by Jetty://
 
  And in secondary's log I find these lines:
 
  2008-04-02 11:26:25,357 WARN /: /dfshealth.jsp:
  java.lang.NullPointerException
  at
  org.apache.hadoop.dfs.dfshealth_jsp.init(dfshealth_jsp.java:21) at
  sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)

 at


sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorA
 cce

 ssorImpl.java:57) at


sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCons
 tru

 ctorAccessorImpl.java:45) at
  java.lang.reflect.Constructor.newInstance(Constructor.java:539) at
  java.lang.Class.newInstance0(Class.java:373)
  at java.lang.Class.newInstance(Class.java:326)
  at

 org.mortbay.jetty.servlet.Holder.newInstance(Holder.java:199)

  at


org.mortbay.jetty.servlet.ServletHolder.getServlet(ServletHolder.java:32
 6)

  at

 org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:405)

  at


org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationH
 and

 ler.java:475) at


org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
 at

  org.mortbay.http.HttpContext.handle(HttpContext.java:1565) at


org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationCon
 tex

 t.java:635) at

 org.mortbay.http.HttpContext.handle(HttpContext.java:1517) at

  org.mortbay.http.HttpServer.service(HttpServer.java:954) at
  org.mortbay.http.HttpConnection.service(HttpConnection.java:814) at
  org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)
at
  org.mortbay.http.HttpConnection.handle(HttpConnection.java:831) at


org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244
 )

  at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
at
  org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)
 
  Is something missing from my configuration?  Anybody else seen
these?
 
  Thanks,
 
-Yuri




Re: Reduce Sort

2008-04-08 Thread Ted Dunning



On 4/8/08 10:43 AM, Natarajan, Senthil [EMAIL PROTECTED] wrote:

 I would like to try using Hadoop.

That is good for education, probably bad for run time.  It could take
SECONDS longer to run (oh my).

 Do you mean to write another MapReduce program which takes the output of the
 first MapReduce (the already existing file of this format)

Yes.

 And use count as the key and IP Address as the value.

Yes.

 Is it possible to do this in the same program instead of writing another one.

No.

 If it is not possible, is it something available in Hadoop once the first
 program is done, can I call Second program to do the sorting.

Yes.  If you are using Java, just create a second configuration and do the
same thing as you did the first time to run the program.
 
 If I set the number of reducer to 1, then it will take more time to reduce all
 the maps and hence affect the performance right?

Not really.  Most of the sorting work will be done by the mappers.  The
reducer will only be merging the data so it will be pretty fast.  The
largest cost will be startup time, the second largest will be network
transfer time.




RE: Reduce Sort

2008-04-08 Thread Natarajan, Senthil
Thanks Ted.

I would like to try using Hadoop.
Do you mean to write another MapReduce program which takes the output of the 
first MapReduce (the already existing file of this format)

IP Add   Count
1.2. 5. 42   27
2.8. 6. 6   24
7.9.24.13   8
7.9. 6. 9201

And use count as the key and IP Address as the value.
Is it possible to do this in the same program instead of writing another one.
If it is not possible, is it something available in Hadoop once the first 
program is done, can I call
Second program to do the sorting.

If I set the number of reducer to 1, then it will take more time to reduce all 
the maps and hence affect the performance right?

Thanks,
Senthil




-Original Message-
From: Ted Dunning [mailto:[EMAIL PROTECTED]
Sent: Tuesday, April 08, 2008 11:53 AM
To: core-user@hadoop.apache.org; '[EMAIL PROTECTED]'
Subject: Re: Reduce Sort


There are two ways to do this.  Both of them assume that you have counted
the addresses using map-reduce and the results are in HDFS.

First, since the number of unique IP address is likely to be relatively
small, simply sorting the results using conventional sort is probably as
good as it gets.  This will take just a few lines of scripting code:

   base=http://your-nameserver-here/data
   wget $base/your-results-directory/part-0 --output-document=a
   wget $base/your-results-directory/part-1 --output-document=b
   sort -k1nr a b  where-you-want-the-output

It would be convenient if there were a URL that would allow you to retrieve
the concatenation of a wild-carded list of files, but the method I show
above isn't bad.

You are likely to be unhappy at the perceived impurity of this approach, but
I would ask to think about why one might use hadoop at all.  The best reason
is to get high performance on large problems.  The sorting part of this
problem is not all that big a deal and using a conventional sort is probably
the most effective approach here.

You can also do the sorting using hadoop.  Just use a mapper that moves the
count to the key and keeps the IP as the value.  I think that if you use an
IntWritable or LongWritable as the key then the default sorting would give
you ascending order.  You can also define the sort order so that you get
descending order.  Make sure you set the number of reducers to 1 so that you
only get a single output file.

If you have less than 10 million values, the conventional sort is likely to
be faster simply because of hadoop's startup time.


On 4/8/08 8:37 AM, Natarajan, Senthil [EMAIL PROTECTED] wrote:

 Hi,
 I am new to MapReduce.

 After slightly modifying the example wordcount, to count the IP Address.
 I have two files part-0 and part-1 with the contents something like.

 IP Add   Count
 1.2. 5. 42   27
 2.8. 6. 6   24
 7.9.24.13   8
 7.9. 6. 9201

 I want to sort it by IP Address count in descending order(i.e.) I would expect
 to see

 7.9. 6. 9201
 1.2. 5. 42   27
 2.8. 6. 6   24
 7.9.24.13   8

 Could you please suggest how to do this.
 And to merge both the partitions (part-0 and part-1) in to one output
 file, is there any functions already available in MapReduce Framework.
 Or we need to use Java IO to do this.
 Thanks,
 Senthil



Re: secondary namenode web interface

2008-04-08 Thread Konstantin Shvachko

Yuri,

The NullPointerException should be fixed as Dhruba proposed.
We do not have any secondary nn web interface as of today.
The http server is used for transferring data between the primary and the 
secondary.
I don't see we can display anything useful on the secondary web UI except for 
the
current status, config values, and the last checkpoint date/time.
If you have anything in mind that can be displayed on the UI please let us know.
You can also find a jira for the issue, it would be good if this discussion
is reflected in it.

Thanks,
--Konstantin

dhruba Borthakur wrote:

The secondary Namenode uses the HTTP interface to pull the fsimage from
the primary. Similarly, the primary Namenode uses the
dfs.secondary.http.address to pull the checkpointed-fsimage back from
the secondary to the primary. So, the definition of
dfs.secondary.http.address is needed.

However, the servlet dfshealth.jsp should not be served from the
secondary Namenode. This servet should be setup in such a way that only
the primary Namenode invokes this servlet.

Thanks,
dhruba

-Original Message-
From: Yuri Pradkin [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, April 08, 2008 10:11 AM

To: core-user@hadoop.apache.org
Subject: Re: secondary namenode web interface

I'd be happy to file a JIRA for the bug, I just want to make sure I
understand 
what the bug is: is it the misleading null pointer message or is it
that 
someone is listening on this port and not doing anything useful?  I

mean,
what is the configuration parameter dfs.secondary.http.address for?
Unless 
there are plans to make this interface work, this config parameter
should go 
away, and so should the listening thread, shouldn't they?


Thanks,
  -Yuri

On Friday 04 April 2008 03:30:46 pm dhruba Borthakur wrote:


Your configuration is good. The secondary Namenode does not publish a
web interface. The null pointer message in the secondary Namenode


log


is a harmless bug but should be fixed. It would be nice if you can


open


a JIRA for it.

Thanks,
Dhruba


-Original Message-
From: Yuri Pradkin [mailto:[EMAIL PROTECTED]
Sent: Friday, April 04, 2008 2:45 PM
To: core-user@hadoop.apache.org
Subject: Re: secondary namenode web interface

I'm re-posting this in hope that someone would help.  Thanks!

On Wednesday 02 April 2008 01:29:45 pm Yuri Pradkin wrote:


Hi,

I'm running Hadoop (latest snapshot) on several machines and in our


setup



namenode and secondarynamenode are on different systems.  I see from


the



logs than secondary namenode regularly checkpoints fs from primary
namenode.

But when I go to the secondary namenode HTTP


(dfs.secondary.http.address)



in my browser I see something like this:

HTTP ERROR: 500
init
RequestURI=/dfshealth.jsp
Powered by Jetty://

And in secondary's log I find these lines:

2008-04-02 11:26:25,357 WARN /: /dfshealth.jsp:
java.lang.NullPointerException
   at
org.apache.hadoop.dfs.dfshealth_jsp.init(dfshealth_jsp.java:21) at
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native


Method)


at




sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorA


cce



ssorImpl.java:57) at




sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCons


tru



ctorAccessorImpl.java:45) at
java.lang.reflect.Constructor.newInstance(Constructor.java:539) at
java.lang.Class.newInstance0(Class.java:373)
   at java.lang.Class.newInstance(Class.java:326)
   at


org.mortbay.jetty.servlet.Holder.newInstance(Holder.java:199)



   at




org.mortbay.jetty.servlet.ServletHolder.getServlet(ServletHolder.java:32


6)



at


org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:405)



at




org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationH


and



ler.java:475) at




org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)


at



org.mortbay.http.HttpContext.handle(HttpContext.java:1565) at




org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationCon


tex



t.java:635) at


org.mortbay.http.HttpContext.handle(HttpContext.java:1517) at



org.mortbay.http.HttpServer.service(HttpServer.java:954) at
org.mortbay.http.HttpConnection.service(HttpConnection.java:814) at
org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)


at


org.mortbay.http.HttpConnection.handle(HttpConnection.java:831) at




org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244


)



at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)


at


org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)

Is something missing from my configuration?  Anybody else seen


these?


Thanks,

 -Yuri







New user, several questions/comments (MaxMapTaskFailuresPercent in particular)

2008-04-08 Thread Ian Tegebo
The wiki has been down for more than a day, any ETA?  I was going to search the
archives for the status, but I'm getting 403's for each of the Archive links on
the mailing list page:

http://hadoop.apache.org/core/mailing_lists.html

My original question was about specifying MaxMapTaskFailuresPercent as a job
conf parameter on the command line for streaming jobs.  Is there a conf setting
like the following?

mapred.taskfailure.percent

My use case is log parsing files in DFS whose replication is
necessarily set to 1; I'm
seeing block retrieval failures that kill my job (I don't care if ~10% fail).

Thanks,

-- 
Ian Tegebo


Headers and footers on Hadoop output results

2008-04-08 Thread ncardoso

Hello.

I'm using Hadoop to process several XML files, each with several XML
records, through a group of Linux servers. I am using an XMLInputFormat that
I found 
http://www.nabble.com/map-reduce-function-on-xml-string-td15816818.html here
in Nabble , and I'm using the TextOutputFormat with an overrided write
funcion, to output XML. 

Yet, the XML needs its root tag and the ?xml line. Where is the best place
to place two writing functions like header() and footer()? I've tried, but
all I manage was to write in the local task, not in the synchronized
part- file. 
-- 
View this message in context: 
http://www.nabble.com/Headers-and-footers-on-Hadoop-output-results-tp16571385p16571385.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.



Re: secondary namenode web interface

2008-04-08 Thread Yuri Pradkin
On Tuesday 08 April 2008 11:54:35 am Konstantin Shvachko wrote:
 If you have anything in mind that can be displayed on the UI please let us
 know. You can also find a jira for the issue, it would be good if this
 discussion is reflected in it.

Well, I guess we could have interface to browse the checkpointed image 
(actually this is what I was expecting to see), but it's not that big of a 
deal.

Filed 
https://issues.apache.org/jira/browse/HADOOP-3212

Thanks,

  -Yuri


Sorting the OutputCollector

2008-04-08 Thread Aayush Garg
Hi,

I have implemented Key and value pairs in the following way:

Key (Text class) Value(Custom class)
word1
word2


class Custom{
int freq;
TreeMapString, ArrayListString
}

I construct this type of key, value pairs in the outputcollector of reduce
phase. Now I want to SORT  this   outputcollector in decreasing order of
the value of frequency in Custom class.
Could some one suggest the possible way to do this?

Thanks,
Aayush


Re: New user, several questions/comments (MaxMapTaskFailuresPercent in particular)

2008-04-08 Thread Ted Dunning

Looks like it is up to me.


On 4/8/08 12:36 PM, Ian Tegebo [EMAIL PROTECTED] wrote:

 The wiki has been down for more than a day, any ETA?  I was going to search
 the
 archives for the status, but I'm getting 403's for each of the Archive links
 on
 the mailing list page:
 
 http://hadoop.apache.org/core/mailing_lists.html
 
 My original question was about specifying MaxMapTaskFailuresPercent as a job
 conf parameter on the command line for streaming jobs.  Is there a conf
 setting
 like the following?
 
 mapred.taskfailure.percent
 
 My use case is log parsing files in DFS whose replication is
 necessarily set to 1; I'm
 seeing block retrieval failures that kill my job (I don't care if ~10% fail).
 
 Thanks,



Re: DFS behavior when the disk goes bad

2008-04-08 Thread Raghu Angadi


The behavior seems correct.
Assuming blacklisted to mean NameNode marked this node 'dead' :

Murali Krishna wrote:

*   We are running a small cluster with 10 data nodes and a name
node
*	Each data node has 6 disks 
*	While a job was running, one of the disk in one data node got
corrupted and the node got blacklisted 
*	The job got killed because there was some space issue in the

entire cluster and it couldn't continue
*   Now I tried removing some unnecessary data, the disk usages
started coming down in all nodes except the node which got blacklisted
in the last job (is this expected?)


DataNode will be marked dead if it does not heartbeat with NameNode for 
10min or so. A bad disk could cause that (a thread trying delete block 
file from the bad disk might be hung, for e.g.)


Once a datanode is marked dead, NameNode does not interact with is.. so 
it did not remove any files from that node.



*   I restarted the entire cluster, after some time the disk usage
started coming down on that corrupted disk node and it went very low.
Essentially, it has removed everything from that node. (Does the dfs
remove the data from the all disks from the node if one of the disk was
bad? And why it didn't do before restarting?)


By the time this node came and functioned well, NameNode has already 
re-replicated the valid data that was sitting on this node.. so when the 
node came up, NameNode deletes most of the files the datanode since that 
data is either already deleted or replicated on other nodes.




Re: incorrect data check

2008-04-08 Thread Colin Freas
so, in an attempt to track down this problem, i've stripped out most of the
files for input, trying to identify which ones are causing the problem.

i've narrowed it down, but i can't pinpoint it.  i keep getting these
incorrect data check errors below, but the .gz files test fine with gzip.

is there some way to run an md5 or something on the files in hdfs and
compare it to the checksum of the files on my local machine?

i've looked around the lists and through the various options to send to
.../bin/hadoop, but nothing is jumping out at me.

this is particularly frustrating because it's causing my jobs to fail,
rather than skipping the problematic input files.  i've also looked through
the conf file and don't see anything similar about skipping bad files
without killing the job.

-colin


On Tue, Apr 8, 2008 at 11:53 AM, Colin Freas [EMAIL PROTECTED] wrote:

 running a job on my 5 node cluster, i get these intermittent exceptions in
 my logs:

 java.io.IOException: incorrect data check
   at 
 org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native 
 Method)

   at 
 org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:218)
   at 
 org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:80)
   at 
 org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74)

   at java.io.InputStream.read(InputStream.java:89)
   at 
 org.apache.hadoop.mapred.LineRecordReader$LineReader.backfill(LineRecordReader.java:88)
   at 
 org.apache.hadoop.mapred.LineRecordReader$LineReader.readLine(LineRecordReader.java:114)

   at 
 org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:215)
   at 
 org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:37)
   at 
 org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:147)

   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
   at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2084


 they occur accross all the nodes, but i can't figure out which file is
 causing the problem.  i'm working on the assumption it's a specific file
 because it's precisely the same error that occurs on each node.  i've
 scoured the logs and can't find any reference to which file caused the
 hiccup.  but this is causing the job to fail.  other files are processed
 without a problem.  the files are 720 .gz files, ~100mb each.  other files
 are processed on each node without a problem.  i'm in the middle testing the
 .gz files, but i don't think the problem is necessarily in the source data,
 as much as in when i copied it into hdfs.

 so my questions are these:
 is this a known issue?
 is there some way to determine which file or files are causing these
 exceptions?
 is there a way to run something like gzip -t blah.gz on the file in
 hdfs?  or maybe a checksum?
 is there a reason other than a corrupt datafile that would be causing
 this?
 in the original mapreduce paper, they talk about a mechanism to skip
 records that cause problems.  is there a way to have hadoop skip these
 problematic files and the associated records and continue with the job?


 thanks,
 colin



Re: incorrect data check

2008-04-08 Thread Norbert Burger
Colin, how about writing a streaming mapper which simply runs md5sum on each
file it gets as input?  Run this task along with the identity reducer, and
you should be able to identify pretty quickly if there's HDFS corruption
issue.

Norbert

On Tue, Apr 8, 2008 at 5:50 PM, Colin Freas [EMAIL PROTECTED] wrote:

 so, in an attempt to track down this problem, i've stripped out most of
 the
 files for input, trying to identify which ones are causing the problem.

 i've narrowed it down, but i can't pinpoint it.  i keep getting these
 incorrect data check errors below, but the .gz files test fine with gzip.

 is there some way to run an md5 or something on the files in hdfs and
 compare it to the checksum of the files on my local machine?

 i've looked around the lists and through the various options to send to
 .../bin/hadoop, but nothing is jumping out at me.

 this is particularly frustrating because it's causing my jobs to fail,
 rather than skipping the problematic input files.  i've also looked
 through
 the conf file and don't see anything similar about skipping bad files
 without killing the job.

 -colin


 On Tue, Apr 8, 2008 at 11:53 AM, Colin Freas [EMAIL PROTECTED] wrote:

  running a job on my 5 node cluster, i get these intermittent exceptions
 in
  my logs:
 
  java.io.IOException: incorrect data check
at
 org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native
 Method)
 
at
 org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:218)
at
 org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:80)
at
 org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74)
 
at java.io.InputStream.read(InputStream.java:89)
at
 org.apache.hadoop.mapred.LineRecordReader$LineReader.backfill(LineRecordReader.java:88)
at
 org.apache.hadoop.mapred.LineRecordReader$LineReader.readLine(LineRecordReader.java:114)
 
at
 org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:215)
at
 org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:37)
at
 org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:147)
 
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
at
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2084
 
 
  they occur accross all the nodes, but i can't figure out which file is
  causing the problem.  i'm working on the assumption it's a specific file
  because it's precisely the same error that occurs on each node.  i've
  scoured the logs and can't find any reference to which file caused the
  hiccup.  but this is causing the job to fail.  other files are processed
  without a problem.  the files are 720 .gz files, ~100mb each.  other
 files
  are processed on each node without a problem.  i'm in the middle testing
 the
  .gz files, but i don't think the problem is necessarily in the source
 data,
  as much as in when i copied it into hdfs.
 
  so my questions are these:
  is this a known issue?
  is there some way to determine which file or files are causing these
  exceptions?
  is there a way to run something like gzip -t blah.gz on the file in
  hdfs?  or maybe a checksum?
  is there a reason other than a corrupt datafile that would be causing
  this?
  in the original mapreduce paper, they talk about a mechanism to skip
  records that cause problems.  is there a way to have hadoop skip these
  problematic files and the associated records and continue with the job?
 
 
  thanks,
  colin
 



Re: New user, several questions/comments (MaxMapTaskFailuresPercent in particular)

2008-04-08 Thread Rick Cox
On Tue, Apr 8, 2008 at 12:36 PM, Ian Tegebo [EMAIL PROTECTED] wrote:


  My original question was about specifying MaxMapTaskFailuresPercent as a job
  conf parameter on the command line for streaming jobs.  Is there a conf 
 setting
  like the following?

  mapred.taskfailure.percent

The job conf settings to control this are:

mapred.max.map.failures.percent
mapred.max.reduce.failures.percent

Both have a default of 0, meaning any failed task makes for a failed
job (according to JobConf.java).

rick


Re: secondary namenode web interface

2008-04-08 Thread Konstantin Shvachko

Unfortunately we do not have an api for the secondary nn that would allow 
browsing the checkpoint.
I agree it would be nice to have one.
Thanks for filing the issue.
--Konstantin

Yuri Pradkin wrote:

On Tuesday 08 April 2008 11:54:35 am Konstantin Shvachko wrote:


If you have anything in mind that can be displayed on the UI please let us
know. You can also find a jira for the issue, it would be good if this
discussion is reflected in it.



Well, I guess we could have interface to browse the checkpointed image 
(actually this is what I was expecting to see), but it's not that big of a 
deal.


Filed 
https://issues.apache.org/jira/browse/HADOOP-3212


Thanks,

  -Yuri



Fuse-j-hadoopfs

2008-04-08 Thread xavier.quintuna
Hi everybody,

I have a question about fuse-j-hadoopfs. Do it handles the hadoop
permissions ?

I'm using hadoop.0.16.3

Thanks

X