Pangool: easier Hadoop, same performance

2012-03-06 Thread Pere Ferrera
Hi,
I'd like to introduce you Pangool http://pangool.net/, an easier
low-level MapReduce API for Hadoop. I'm one of the developers. We just
open-sourced it yesterday.

Pangool is a Java, low-level MapReduce API with the same flexibility and
performance than the plain Java Hadoop MapReduce API. The difference is
that it makes a lot of things easier to code and understand.

A few of Pangool's features:
- Tuple-based intermediate serialization (allowing easier development).
- Built-in, easy-to-use group by and sort by (removing boilerplate code for
things like secondary sort).
- Built-in, easy-to-use reduce-side joins (which are quite hard to
implement in Hadoop).
- Augmented Hadoop API: Built-in multiple inputs / outputs, configuration
via object instance.

Pangool meets the need of making Hadoop's steep learning curve a lot
smoother while retaining all its features, power and flexibility. It
differs in high-level tools like Pig or Hive in that it can be used as a
replacement of the low-level API. There is no performance / flexibility
penalty paid for using Pangool.

We did an initial benchmark http://pangool.net/benchmark.html to show
this idea.

I'd be very interested in hearing your feedback, opinions and questions on
it.

Cheers,

Pere.


Re: why does my mapper class reads my input file twice?

2012-03-06 Thread Jane Wayne
Harsh,

Thanks. I went into the code on FileInputFormat.addInputPath(Job,Path) and
it is as you stated. That make sense now. I simply commented out
FileInputFormat.addInputPath(job, input)
 and FileOutputFormat.setOutputPath(job, output) and everything
automagically works now.

Thanks a bunch!

On Tue, Mar 6, 2012 at 2:06 AM, Harsh J ha...@cloudera.com wrote:

 Its your use of the mapred.input.dir property, which is a reserved
 name in the framework (its what FileInputFormat uses).

 You have a config you extract path from:
 Path input = new Path(conf.get(mapred.input.dir));

 Then you do:
 FileInputFormat.addInputPath(job, input);

 Which internally, simply appends a path to a config prop called
 mapred.input.dir. Hence your job gets launched with two input files
 (the very same) - one added by default Tool-provided configuration
 (cause of your -Dmapred.input.dir) and the other added by you.

 Fix the input path line to use a different config:
 Path input = new Path(conf.get(input.path));

 And run job as:
 hadoop jar dummy-0.1.jar dummy.MyJob -Dinput.path=data/dummy.txt
 -Dmapred.output.dir=result

 On Tue, Mar 6, 2012 at 9:03 AM, Jane Wayne jane.wayne2...@gmail.com
 wrote:
  i have code that reads in a text file. i notice that each line in the
 text
  file is somehow being read twice. why is this happening?
 
  my mapper class looks like the following:
 
  public class MyMapper extends MapperLongWritable, Text, LongWritable,
  Text {
 
  private static final Log _log = LogFactory.getLog(MyMapper.class);
   @Override
  public void map(LongWritable key, Text value, Context context) throws
  IOException, InterruptedException {
  String s = (new
  StringBuilder()).append(value.toString()).append(m).toString();
  context.write(key, new Text(s));
  _log.debug(key.toString() +  =  + s);
  }
  }
 
  my reducer class looks like the following:
 
  public class MyReducer extends ReducerLongWritable, Text, LongWritable,
  Text {
 
  private static final Log _log = LogFactory.getLog(MyReducer.class);
   @Override
  public void reduce(LongWritable key, IterableText values, Context
  context) throws IOException, InterruptedException {
  for(IteratorText it = values.iterator(); it.hasNext();) {
  Text txt = it.next();
  String s = (new
  StringBuilder()).append(txt.toString()).append(r).toString();
  context.write(key, new Text(s));
  _log.debug(key.toString() +  =  + s);
  }
  }
  }
 
  my job class looks like the following:
 
  public class MyJob extends Configured implements Tool {
 
  public static void main(String[] args) throws Exception {
  ToolRunner.run(new Configuration(), new MyJob(), args);
  }
 
  @Override
  public int run(String[] args) throws Exception {
  Configuration conf = getConf();
  Path input = new Path(conf.get(mapred.input.dir));
 Path output = new Path(conf.get(mapred.output.dir));
 
 Job job = new Job(conf, dummy job);
 job.setMapOutputKeyClass(LongWritable.class);
 job.setMapOutputValueClass(Text.class);
 job.setOutputKeyClass(LongWritable.class);
 job.setOutputValueClass(Text.class);
 
 job.setMapperClass(MyMapper.class);
 job.setReducerClass(MyReducer.class);
 
 FileInputFormat.addInputPath(job, input);
 FileOutputFormat.setOutputPath(job, output);
 
 job.setJarByClass(MyJob.class);
 
 return job.waitForCompletion(true) ? 0 : 1;
  }
  }
 
  the text file that i am trying to read in looks like the following. as
 you
  can see, there are 9 lines.
 
  T, T
  T, T
  T, T
  F, F
  F, F
  F, F
  F, F
  T, F
  F, T
 
  the output file that i get after my Job runs looks like the following. as
  you can see, there are 18 lines. each key is emitted twice from the
 mapper
  to the reducer.
 
  0   T, Tmr
  0   T, Tmr
  6   T, Tmr
  6   T, Tmr
  12  T, Tmr
  12  T, Tmr
  18  F, Fmr
  18  F, Fmr
  24  F, Fmr
  24  F, Fmr
  30  F, Fmr
  30  F, Fmr
  36  F, Fmr
  36  F, Fmr
  42  T, Fmr
  42  T, Fmr
  48  F, Tmr
  48  F, Tmr
 
  the way i execute my Job is as follows (cygwin + hadoop 0.20.2).
 
  hadoop jar dummy-0.1.jar dummy.MyJob -Dmapred.input.dir=data/dummy.txt
  -Dmapred.output.dir=result
 
  originally, this happened when i read in a sequence file, but even for a
  text file, this problem is still happening. is it the way i have setup my
  Job?



 --
 Harsh J



Re: is there anyway to detect the file size as am i writing a sequence file?

2012-03-06 Thread Joey Echeverria
I think you mean Writer.getLength(). It returns the current position
in the output stream in bytes (more or less the current size of the
file).

-Joey

On Tue, Mar 6, 2012 at 9:53 AM, Jane Wayne jane.wayne2...@gmail.com wrote:
 hi,

 i am writing a little util class to recurse into a directory and add all
 *.txt files into a sequence file (key is the file name, value is the
 content of the corresponding text file). as i am writing (i.e.
 SequenceFile.Writer.append(key, value)), is there any way to detect how
 large the sequence file is?

 for example, i want to create a new sequence file as soon as the current
 one exceeds 64 MB.

 i notice there is a SequenceFile.Writer.getLong() which the javadocs says
 returns the current length of the output file, but that is vague. what is
 this Writer.getLong() method? is it the number of bytes, kilobytes,
 megabytes, or something else?

 thanks,



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434


Re: is there anyway to detect the file size as am i writing a sequence file?

2012-03-06 Thread Jane Wayne
Thanks Joey. That's what I meant (I've been staring at the screen too
long). :)

On Tue, Mar 6, 2012 at 10:00 AM, Joey Echeverria j...@cloudera.com wrote:

 I think you mean Writer.getLength(). It returns the current position
 in the output stream in bytes (more or less the current size of the
 file).

 -Joey

 On Tue, Mar 6, 2012 at 9:53 AM, Jane Wayne jane.wayne2...@gmail.com
 wrote:
  hi,
 
  i am writing a little util class to recurse into a directory and add all
  *.txt files into a sequence file (key is the file name, value is the
  content of the corresponding text file). as i am writing (i.e.
  SequenceFile.Writer.append(key, value)), is there any way to detect how
  large the sequence file is?
 
  for example, i want to create a new sequence file as soon as the current
  one exceeds 64 MB.
 
  i notice there is a SequenceFile.Writer.getLong() which the javadocs says
  returns the current length of the output file, but that is vague. what
 is
  this Writer.getLong() method? is it the number of bytes, kilobytes,
  megabytes, or something else?
 
  thanks,



 --
 Joseph Echeverria
 Cloudera, Inc.
 443.305.9434



Re: how to get rid of -libjars ?

2012-03-06 Thread Ioan Eugen Stan

Pe 06.03.2012 17:37, Jane Wayne a scris:

currently, i have my main jar and then 2 depedent jars. what i do is
1. copy dependent-1.jar to $HADOOP/lib
2. copy dependent-2.jar to $HADOOP/lib

then, when i need to run my job, MyJob inside main.jar, i do the following.

hadoop jar main.jar demo.MyJob -libjars dependent-1.jar,dependent-2.jar
-Dmapred.input.dir=/input/path -Dmapred.output.dir=/output/path

what i want to do is NOT copy the dependent jars to $HADOOP/lib and always
specify -libjars. is there any way around this multi-step procedure? i
really do not want to clutter $HADOOP/lib or specify a comma-delimited list
of jars for -libjars.

any help is appreciated.



Hello,

Specify the full path to the jar on the -libjars? My experience with 
-libjars is that it didn't work as advertised.


Search for an older post on the list about this issue ( -libjars not 
working). I tried adding a lot of jars and some got on the job classpath 
(2), some didn't (most of them).


I got over this by including all the jars in a lib directory inside the 
main jar.


Cheers,
--
Ioan Eugen Stan
http://ieugen.blogspot.com


Re: how to get rid of -libjars ?

2012-03-06 Thread Bejoy Ks
Hi Jane

+ Adding on to Joey's comments

If you want to eliminate the process of distributing the dependent
jars every time, then you need to manually pre-distribute these jars across
the nodes and add them on to the classpath of all nodes. This approach may
be chosen if you are periodically running some job at a greater frequency
on your cluster that needs external jars.

Regards
Bejoy.K.S

On Tue, Mar 6, 2012 at 9:23 PM, Joey Echeverria j...@cloudera.com wrote:

 If you're using -libjars, there's no reason to copy the jars into
 $HADOOP lib. You may have to add the jars to the HADOOP_CLASSPATH if
 you use them from your main() method:

 export HADOOP_CLASSPATH=dependent-1.jar,dependent-2.jar
 hadoop jar main.jar demo.MyJob -libjars
 dependent-1.jar,dependent-2.jar -Dmapred.input.dir=/input/path
 -Dmapred.output.dir=/output/path

 -Joey

 On Tue, Mar 6, 2012 at 10:37 AM, Jane Wayne jane.wayne2...@gmail.com
 wrote:
  currently, i have my main jar and then 2 depedent jars. what i do is
  1. copy dependent-1.jar to $HADOOP/lib
  2. copy dependent-2.jar to $HADOOP/lib
 
  then, when i need to run my job, MyJob inside main.jar, i do the
 following.
 
  hadoop jar main.jar demo.MyJob -libjars dependent-1.jar,dependent-2.jar
  -Dmapred.input.dir=/input/path -Dmapred.output.dir=/output/path
 
  what i want to do is NOT copy the dependent jars to $HADOOP/lib and
 always
  specify -libjars. is there any way around this multi-step procedure? i
  really do not want to clutter $HADOOP/lib or specify a comma-delimited
 list
  of jars for -libjars.
 
  any help is appreciated.



 --
 Joseph Echeverria
 Cloudera, Inc.
 443.305.9434



Re: how to get rid of -libjars ?

2012-03-06 Thread Shi Yu
1.  Wrap all your jar files inside your artifact,  they should be under 
lib  folder.  Sometimes this could make your jar file quite big, if you 
want to save time uploading big jar files remotely,  see 2
2.  Use -libjars with full path or relative path (w.r.t. your jar 
package) should work


On 3/6/2012 9:55 AM, Ioan Eugen Stan wrote:

Pe 06.03.2012 17:37, Jane Wayne a scris:

currently, i have my main jar and then 2 depedent jars. what i do is
1. copy dependent-1.jar to $HADOOP/lib
2. copy dependent-2.jar to $HADOOP/lib

then, when i need to run my job, MyJob inside main.jar, i do the 
following.


hadoop jar main.jar demo.MyJob -libjars dependent-1.jar,dependent-2.jar
-Dmapred.input.dir=/input/path -Dmapred.output.dir=/output/path

what i want to do is NOT copy the dependent jars to $HADOOP/lib and 
always

specify -libjars. is there any way around this multi-step procedure? i
really do not want to clutter $HADOOP/lib or specify a 
comma-delimited list

of jars for -libjars.

any help is appreciated.



Hello,

Specify the full path to the jar on the -libjars? My experience with 
-libjars is that it didn't work as advertised.


Search for an older post on the list about this issue ( -libjars not 
working). I tried adding a lot of jars and some got on the job 
classpath (2), some didn't (most of them).


I got over this by including all the jars in a lib directory inside 
the main jar.


Cheers,




HDFS Reporting Tools

2012-03-06 Thread Oren Livne

Dear All,

We are maintaining a 60-node hadoop cluster for external users, and 
would like to be automatically notified via email when an HDFS crash or 
some other infrastructure failure occurs that is not due to a user 
programming error. We've been encountering such soft errors, where 
hadoop does not crash, but becomes very slow and job hand for a long 
time and fail.


Are there existing tools that provide this capability? Or do we have to 
manually monitor the web services at on http://namenode and 
http://namenode:50030?


Thank you so much,
Oren

--
We plan ahead, which means we don't do anything right now.
  -- Valentine (Tremors)

--
We plan ahead, which means we don't do anything right now.
  -- Valentine (Tremors)



Hadoop EC2 user-data script

2012-03-06 Thread Sagar Nikam
Hi,

I am new to Hadoop. I am want to try hadoop installation using Openstack.
Openstack API for launching instance(VM) has a parameter for passing
user-data. Here we can pass scripts which will be executed on first time
boot.

This is similar to EC2 user-data.

I would like to know about the hadoop user-data script. Any help on this is
appreciated.

Thanks in advance.

Regards,
Sagar


Re: Java Heap space error

2012-03-06 Thread Mohit Anchlia
I am still trying to see how to narrow this down. Is it possible to set
heapdumponoutofmemoryerror option on these individual tasks?

On Mon, Mar 5, 2012 at 5:49 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 Sorry for multiple emails. I did find:


 2012-03-05 17:26:35,636 INFO
 org.apache.pig.impl.util.SpillableMemoryManager: first memory handler call-
 Usage threshold init = 715849728(699072K) used = 575921696(562423K)
 committed = 715849728(699072K) max = 715849728(699072K)

 2012-03-05 17:26:35,719 INFO
 org.apache.pig.impl.util.SpillableMemoryManager: Spilled an estimate of
 7816154 bytes from 1 objects. init = 715849728(699072K) used =
 575921696(562423K) committed = 715849728(699072K) max = 715849728(699072K)

 2012-03-05 17:26:36,881 INFO
 org.apache.pig.impl.util.SpillableMemoryManager: first memory handler call
 - Collection threshold init = 715849728(699072K) used = 358720384(350312K)
 committed = 715849728(699072K) max = 715849728(699072K)

 2012-03-05 17:26:36,885 INFO org.apache.hadoop.mapred.TaskLogsTruncater:
 Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1

 2012-03-05 17:26:36,888 FATAL org.apache.hadoop.mapred.Child: Error
 running child : java.lang.OutOfMemoryError: Java heap space

 at java.nio.HeapCharBuffer.init(HeapCharBuffer.java:39)

 at java.nio.CharBuffer.allocate(CharBuffer.java:312)

 at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:760)

 at org.apache.hadoop.io.Text.decode(Text.java:350)

 at org.apache.hadoop.io.Text.decode(Text.java:327)

 at org.apache.hadoop.io.Text.toString(Text.java:254)

 at
 org.apache.pig.piggybank.storage.SequenceFileLoader.translateWritableToPigDataType(SequenceFileLoader.java:105)

 at
 org.apache.pig.piggybank.storage.SequenceFileLoader.getNext(SequenceFileLoader.java:139)

 at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:187)

 at
 org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:456)

 at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)

 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)

 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)

 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)

 at org.apache.hadoop.mapred.Child$4.run(Child.java:270)

 at java.security.AccessController.doPrivileged(Native Method)

 at javax.security.auth.Subject.doAs(Subject.java:396)

 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)

 at org.apache.hadoop.mapred.Child.main(Child.java:264)


   On Mon, Mar 5, 2012 at 5:46 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 All I see in the logs is:


 2012-03-05 17:26:36,889 FATAL org.apache.hadoop.mapred.TaskTracker: Task:
 attempt_201203051722_0001_m_30_1 - Killed : Java heap space

 Looks like task tracker is killing the tasks. Not sure why. I increased
 heap from 512 to 1G and still it fails.


 On Mon, Mar 5, 2012 at 5:03 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 I currently have java.opts.mapred set to 512MB and I am getting heap
 space errors. How should I go about debugging heap space issues?






hadoop cluster ssh username

2012-03-06 Thread Pat Ferrel
I have a small cluster of servers that runs hadoop. I have a laptop that 
I'd like to use that cluster when it is available. I setup hadoop on the 
laptop so I can switch from running local to running on the cluster. 
Local works. I have setup passwordless ssh between all machines to work 
with 'hadoop-user' which is the linux username on the cluster machines 
so I can ssh from the laptop to the servers without a password thusly:


ssh hadoop-user@master

But my username on the laptop is pferrel, not hadoop-user so when 
running 'start-all.sh' it tries


ssh pferrel@master

How do I tell it to use the linux user 'hadoop-user'. I assume there is 
something in the config directory xml files that will do this?


Hadoop error=12 Cannot allocate memory

2012-03-06 Thread Rohini U
Hi,

I have a hadoop cluster of size 5 and a data of size 1GB. I am running
a simple map reduce program which reads text data and outputs
 sequence files.
I found some solutions to this problem suggesting to set over commmit
to 0 and to increase the unlimit.
I have memory over commit set to 0 and have ulimit unlimited. Even
with this, I keep getting the following error. Is any one aware
of any work arounds for this?
java.io.IOException: Cannot run program bash: java.io.IOException:

*Error:
*
*error*=12, Cannot allocate memory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
at org.apache.hadoop.util.Shell.run(Shell.java:134)
at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:296)
at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
at 
org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java:107)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:734)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:694)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220)
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)



Thanks
-Rohini


Re: hadoop cluster ssh username

2012-03-06 Thread Harsh J
Pat,

Your question seems to be in multiple parts if am right?

1. How do you manage configuration so that switching between local and
wider-cluster mode both work?

My suggestion would be to create two git branches in your conf
directory and switch them as you need, with simple git checkouts.

2. How do you get the start/stop scripts to ssh as hadoop-user instead
of using your user name?

In your masters and slaves file, instead of placing just a list of
hostnames, place hadoop-user@hostnames. That should do the trick.

If you want your SSH itself to use a different username when being
asked to connect to a hostname, follow the per-host configuration to
specify a username to automatically pick when provided a hostname:
http://technosophos.com/content/ssh-host-configuration

On Tue, Mar 6, 2012 at 11:42 PM, Pat Ferrel p...@occamsmachete.com wrote:
 I have a small cluster of servers that runs hadoop. I have a laptop that I'd
 like to use that cluster when it is available. I setup hadoop on the laptop
 so I can switch from running local to running on the cluster. Local works. I
 have setup passwordless ssh between all machines to work with 'hadoop-user'
 which is the linux username on the cluster machines so I can ssh from the
 laptop to the servers without a password thusly:

 ssh hadoop-user@master

 But my username on the laptop is pferrel, not hadoop-user so when running
 'start-all.sh' it tries

 ssh pferrel@master

 How do I tell it to use the linux user 'hadoop-user'. I assume there is
 something in the config directory xml files that will do this?



-- 
Harsh J


Re: getting NullPointerException while running Word cont example

2012-03-06 Thread Harsh J
Hi Sujit,

Please also tell us which version/distribution of Hadoop is this?

On Tue, Mar 6, 2012 at 11:27 PM, Sujit Dhamale sujitdhamal...@gmail.com wrote:
 Hi,

 I am new to Hadoop., i install Hadoop as per
 http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluste


 while running Word cont example i am getting *NullPointerException

 *can some one please look in to this issue ?*

 *Thanks in Advance*  !!!

 *


 duser@sujit:~/Desktop/hadoop$ bin/hadoop dfs -ls /user/hduser/data
 Found 3 items
 -rw-r--r--   1 hduser supergroup     674566 2012-03-06 23:04
 /user/hduser/data/pg20417.txt
 -rw-r--r--   1 hduser supergroup    1573150 2012-03-06 23:04
 /user/hduser/data/pg4300.txt
 -rw-r--r--   1 hduser supergroup    1423801 2012-03-06 23:04
 /user/hduser/data/pg5000.txt

 hduser@sujit:~/Desktop/hadoop$ bin/hadoop jar hadoop*examples*.jar
 wordcount /user/hduser/data /user/hduser/gutenberg-outputd

 12/03/06 23:14:33 INFO input.FileInputFormat: Total input paths to process
 : 3
 12/03/06 23:14:33 INFO mapred.JobClient: Running job: job_201203062221_0002
 12/03/06 23:14:34 INFO mapred.JobClient:  map 0% reduce 0%
 12/03/06 23:14:49 INFO mapred.JobClient:  map 66% reduce 0%
 12/03/06 23:14:55 INFO mapred.JobClient:  map 100% reduce 0%
 12/03/06 23:14:58 INFO mapred.JobClient: Task Id :
 attempt_201203062221_0002_r_00_0, Status : FAILED
 Error: java.lang.NullPointerException
    at
 java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
    at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2900)
    at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2820)

 12/03/06 23:15:07 INFO mapred.JobClient: Task Id :
 attempt_201203062221_0002_r_00_1, Status : FAILED
 Error: java.lang.NullPointerException
    at
 java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
    at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2900)
    at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2820)

 12/03/06 23:15:16 INFO mapred.JobClient: Task Id :
 attempt_201203062221_0002_r_00_2, Status : FAILED
 Error: java.lang.NullPointerException
    at
 java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
    at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2900)
    at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2820)

 12/03/06 23:15:31 INFO mapred.JobClient: Job complete: job_201203062221_0002
 12/03/06 23:15:31 INFO mapred.JobClient: Counters: 20
 12/03/06 23:15:31 INFO mapred.JobClient:   Job Counters
 12/03/06 23:15:31 INFO mapred.JobClient:     Launched reduce tasks=4
 12/03/06 23:15:31 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=22084
 12/03/06 23:15:31 INFO mapred.JobClient:     Total time spent by all
 reduces waiting after reserving slots (ms)=0
 12/03/06 23:15:31 INFO mapred.JobClient:     Total time spent by all maps
 waiting after reserving slots (ms)=0
 12/03/06 23:15:31 INFO mapred.JobClient:     Launched map tasks=3
 12/03/06 23:15:31 INFO mapred.JobClient:     Data-local map tasks=3
 12/03/06 23:15:31 INFO mapred.JobClient:     Failed reduce tasks=1
 12/03/06 23:15:31 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=16799
 12/03/06 23:15:31 INFO mapred.JobClient:   FileSystemCounters
 12/03/06 23:15:31 INFO mapred.JobClient:     FILE_BYTES_READ=740520
 12/03/06 23:15:31 INFO mapred.JobClient:     HDFS_BYTES_READ=3671863
 12/03/06 23:15:31 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=2278287
 12/03/06 23:15:31 INFO mapred.JobClient:   File Input Format Counters
 12/03/06 23:15:31 INFO mapred.JobClient:     Bytes Read=3671517
 12/03/06 23:15:31 INFO mapred.JobClient:   Map-Reduce Framework
 12/03/06 23:15:31 INFO mapred.JobClient:     Map output materialized
 bytes=1474341
 12/03/06 23:15:31 INFO mapred.JobClient:     Combine output records=102322
 12/03/06 23:15:31 INFO mapred.JobClient:     Map input records=77932
 12/03/06 23:15:31 INFO mapred.JobClient:     Spilled Records=153640
 12/03/06 23:15:31 INFO mapred.JobClient:     Map output bytes=6076095
 12/03/06 23:15:31 INFO mapred.JobClient:     Combine input records=629172
 12/03/06 23:15:31 INFO mapred.JobClient:     Map output records=629172
 12/03/06 23:15:31 INFO mapred.JobClient:     SPLIT_RAW_BYTES=346
 hduser@sujit:~/Desktop/hadoop$



-- 
Harsh J


Hadoop runtime metrics

2012-03-06 Thread en-hui chang
Hi All,

We have a medium cluster running 70 nodes, using  0.20.2-cdh3u1. We collect 
run-time metrics thru Ganglia. We found that the certain metrics like  
waiting_reduces , tasks_failed_timeout is high and looks the values are getting 
cumulative. Any thoughts on this will be helpful.

Thanks


Re: how is userlogs supposed to be cleaned up?

2012-03-06 Thread Arun C Murthy

On Mar 6, 2012, at 10:22 AM, Chris Curtin wrote:

 Hi,
 
 We had a fun morning trying to figure out why our cluster was failing jobs,
 removing nodes from the cluster etc. The majority of the errors were
 something like:
 
[snip]

 We are running CDH3u3.

You'll need to check with CDH lists. 

However, hadoop-1.0 (and prior, starting with hadoop-0.20.203) have mechanisms 
to clean up userlogs automatically; else, as you've found out, operating large 
clusters (4k nodes) with millions of jobs per month is too painful.

Arun

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/




Re: Hadoop runtime metrics

2012-03-06 Thread Arun C Murthy

On Mar 6, 2012, at 10:54 AM, en-hui chang wrote:

 Hi All,
 
 We have a medium cluster running 70 nodes, using  0.20.2-cdh3u1. We collect 
 run-time metrics thru Ganglia. We found that the certain metrics like  
 waiting_reduces , tasks_failed_timeout is high and looks the values are 
 getting cumulative. Any thoughts on this will be helpful.


You'll need to check with CDH lists.

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/




Re: HDFS Reporting Tools

2012-03-06 Thread Jamack, Peter
You could set up things like Ganglia, Nagios to monitor, send off events,
issues.
Within the Hadoop Ecosystem, there are things like Vaidya, maybe
Ambari(not sure as I've not used this), Splunk even has a new beta test
for Shep/Splunk Hadoop Monitoring app.

Peter Jamack

On 3/6/12 8:35 AM, Oren Livne li...@uchicago.edu wrote:

Dear All,

We are maintaining a 60-node hadoop cluster for external users, and
would like to be automatically notified via email when an HDFS crash or
some other infrastructure failure occurs that is not due to a user
programming error. We've been encountering such soft errors, where
hadoop does not crash, but becomes very slow and job hand for a long
time and fail.

Are there existing tools that provide this capability? Or do we have to
manually monitor the web services at on http://namenode and
http://namenode:50030?

Thank you so much,
Oren

-- 
We plan ahead, which means we don't do anything right now.
   -- Valentine (Tremors)

-- 
We plan ahead, which means we don't do anything right now.
   -- Valentine (Tremors)




Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R

2012-03-06 Thread Russell Jurney
Rules of thumb IMO:

You should be using Pig in place of MR jobs at all times that performance
isn't absolutely crucial.  Writing unnecessary MR is needless technical
debt that you will regret as people are replaced and your organization
scales.  Pig gets it done in much less time.  If you need faster jobs, then
optimize your Pig, and if that doesn't work, put a single
MAPREDUCEhttp://pig.apache.org/docs/r0.9.2/basic.html#mapreduce job
at the bottleneck.  Also, realize that it can be hard to actually beat
Pig's performance without experience.  Check that your MR job is actually
faster than Pig at the same load before assuming you can do better than Pig.

Streaming is good if your data doesn't easily map to tuples, you really
like using the abstractions of your favoriate language's MR library, or you
are doing something weird like simulations/pure batch jobs (no mR).

If you're doing a lot of joins and performance is a problem - consider
doing fewer joins.  I would strongly suggest that you prioritize
de-normalizing and duplicating data over switching to raw MR jobs because
HIVE joins are slow.  MapReduce is slow at joins.  Programmer time is more
valuable than machine time.  If you're having to write tons of raw MR, then
get more machines.

On Fri, Mar 2, 2012 at 6:21 AM, Subir S subir.sasiku...@gmail.com wrote:

 On Fri, Mar 2, 2012 at 12:38 PM, Harsh J ha...@cloudera.com wrote:

  On Fri, Mar 2, 2012 at 10:18 AM, Subir S subir.sasiku...@gmail.com
  wrote:
   Hello Folks,
  
   Are there any pointers to such comparisons between Apache Pig and
 Hadoop
   Streaming Map Reduce jobs?
 
  I do not see why you seek to compare these two. Pig offers a language
  that lets you write data-flow operations and runs these statements as
  a series of MR jobs for you automatically (Making it a great tool to
  use to get data processing done really quick, without bothering with
  code), while streaming is something you use to write non-Java, simple
  MR jobs. Both have their own purposes.
 

 Basically we are comparing these two to see the benefits and how much they
 help in improving the productive coding time, without jeopardizing the
 performance of MR jobs.


   Also there was a claim in our company that Pig performs better than Map
   Reduce jobs? Is this true? Are there any such benchmarks available
 
  Pig _runs_ MR jobs. It does do job design (and some data)
  optimizations based on your queries, which is what may give it an edge
  over designing elaborate flows of plain MR jobs with tools like
  Oozie/JobControl (Which takes more time to do). But regardless, Pig
  only makes it easy doing the same thing with Pig Latin statements for
  you.
 

 I knew that Pig runs MR jobs, as Hive runs MR jobs. But Hive jobs become
 pretty slow with lot of joins, which we can achieve faster with writing raw
 MR jobs. So with that context was trying to see how Pig runs MR jobs. Like
 for example what kind of projects should consider Pig. Say when we have a
 lot of Joins, which writing with plain MR jobs takes time. Thoughts?

 Thank you Harsh for your comments. They are helpful!


 
  --
  Harsh J
 




-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com


Re: hadoop cluster ssh username

2012-03-06 Thread Harsh J
Pat,

On Wed, Mar 7, 2012 at 4:10 AM, Pat Ferrel p...@farfetchers.com wrote:
 Thanks, #2 below gets me partway.

 I can start-all.sh and stop-all.sh from the laptop and can fs -ls but
 copying gives me:

 Maclaurin:mahout-distribution-0.6 pferrel$ fs -copyFromLocal
 wikipedia-seqfiles/ wikipedia-seqfiles/
 2012-03-06 13:45:04.225 java[7468:1903] Unable to load realm info from
 SCDynamicStore
 copyFromLocal: org.apache.hadoop.security.AccessControlException: Permission
 denied: user=pferrel, access=WRITE, inode=user:pat:supergroup:rwxr-xr-x

This seems like a totally different issue now, and deals with HDFS
permissions not cluster start/stop.

Yes, you have some files created (or some daemons running) with
username pat, while you try to access now as pferrel (your local
user). This you can't work around against or evade and will need to
fix via hadoop fs -chmod/-chown and such. You can disable
permissions if you do not need it though, simply set dfs.permissions
to false in NameNode's hdfs-site.xml and restart NN.

-- 
Harsh J


Fair Scheduler Problem

2012-03-06 Thread hao.wang
Hi ,All,
I encountered a problem in using Cloudera Hadoop 0.20.2-cdh3u1. When I use 
the fair Scheduler I find the scheduler seems  not support preemption.  
Can anybody tell me whether preemption is supported in this version?
This is my configration:
 mapred-site.xml   
property 
  namemapred.jobtracker.taskScheduler/name 
  valueorg.apache.hadoop.mapred.FairScheduler/value 
/property 
property
  namemapred.fairscheduler.allocation.file/name
  value/usr/lib/hadoop-0.20/conf/fair-scheduler.xml/value
/property
property
namemapred.fairscheduler.preemption/name
valuetrue/value
/property
property
namemapred.fairscheduler.preemption.only.log/name
valuetrue/value
/property
property
namemapred.fairscheduler.preemption.interval/name
value15000/value
/property
property
  namemapred.fairscheduler.weightadjuster/name
  valueorg.apache.hadoop.mapred.NewJobWeightBooster/value
/property
property
  namemapred.fairscheduler.sizebasedweight/name
  valuetrue/value
/property
fair-scheduler.xml 
allocations
   pool name=root
  minMaps10/minMaps
minReduces5/minReduces
maxMaps200/maxMaps
   maxReduces80/maxReduces
   maxRunningJobs100/maxRunningJobs
  minSharePreemptionTimeout30/minSharePreemptionTimeout
weight1.0/weight
  /pool
  pool name=hadoop
   minMaps10/minMaps
minReduces5/minReduces
   maxMaps80/maxMaps
   maxReduces80/maxReduces
maxRunningJobs5/maxRunningJobs
   minSharePreemptionTimeout30/minSharePreemptionTimeout
   weight1.0/weight
  /pool
  user name=user1
   maxRunningJobs10/maxRunningJobs
  /user
poolMaxJobsDefault20/poolMaxJobsDefault
   userMaxJobsDefault10/userMaxJobsDefault
   defaultMinSharePreemptionTimeout30/defaultMinSharePreemptionTimeout
   fairSharePreemptionTimeout30/fairSharePreemptionTimeout
/allocations

regards,

2012-03-07 



hao.wang 


Re: getting NullPointerException while running Word cont example

2012-03-06 Thread Sujit Dhamale
Hadoop version : hadoop-0.20.203.0rc1.tar
Operaring Syatem : ubuntu 11.10


On Wed, Mar 7, 2012 at 12:19 AM, Harsh J ha...@cloudera.com wrote:

 Hi Sujit,

 Please also tell us which version/distribution of Hadoop is this?

 On Tue, Mar 6, 2012 at 11:27 PM, Sujit Dhamale sujitdhamal...@gmail.com
 wrote:
  Hi,
 
  I am new to Hadoop., i install Hadoop as per
 
 http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
 
 http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluste
 
 
 
  while running Word cont example i am getting *NullPointerException
 
  *can some one please look in to this issue ?*
 
  *Thanks in Advance*  !!!
 
  *
 
 
  duser@sujit:~/Desktop/hadoop$ bin/hadoop dfs -ls /user/hduser/data
  Found 3 items
  -rw-r--r--   1 hduser supergroup 674566 2012-03-06 23:04
  /user/hduser/data/pg20417.txt
  -rw-r--r--   1 hduser supergroup1573150 2012-03-06 23:04
  /user/hduser/data/pg4300.txt
  -rw-r--r--   1 hduser supergroup1423801 2012-03-06 23:04
  /user/hduser/data/pg5000.txt
 
  hduser@sujit:~/Desktop/hadoop$ bin/hadoop jar hadoop*examples*.jar
  wordcount /user/hduser/data /user/hduser/gutenberg-outputd
 
  12/03/06 23:14:33 INFO input.FileInputFormat: Total input paths to
 process
  : 3
  12/03/06 23:14:33 INFO mapred.JobClient: Running job:
 job_201203062221_0002
  12/03/06 23:14:34 INFO mapred.JobClient:  map 0% reduce 0%
  12/03/06 23:14:49 INFO mapred.JobClient:  map 66% reduce 0%
  12/03/06 23:14:55 INFO mapred.JobClient:  map 100% reduce 0%
  12/03/06 23:14:58 INFO mapred.JobClient: Task Id :
  attempt_201203062221_0002_r_00_0, Status : FAILED
  Error: java.lang.NullPointerException
 at
  java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
 at
 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2900)
 at
 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2820)
 
  12/03/06 23:15:07 INFO mapred.JobClient: Task Id :
  attempt_201203062221_0002_r_00_1, Status : FAILED
  Error: java.lang.NullPointerException
 at
  java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
 at
 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2900)
 at
 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2820)
 
  12/03/06 23:15:16 INFO mapred.JobClient: Task Id :
  attempt_201203062221_0002_r_00_2, Status : FAILED
  Error: java.lang.NullPointerException
 at
  java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
 at
 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2900)
 at
 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2820)
 
  12/03/06 23:15:31 INFO mapred.JobClient: Job complete:
 job_201203062221_0002
  12/03/06 23:15:31 INFO mapred.JobClient: Counters: 20
  12/03/06 23:15:31 INFO mapred.JobClient:   Job Counters
  12/03/06 23:15:31 INFO mapred.JobClient: Launched reduce tasks=4
  12/03/06 23:15:31 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=22084
  12/03/06 23:15:31 INFO mapred.JobClient: Total time spent by all
  reduces waiting after reserving slots (ms)=0
  12/03/06 23:15:31 INFO mapred.JobClient: Total time spent by all maps
  waiting after reserving slots (ms)=0
  12/03/06 23:15:31 INFO mapred.JobClient: Launched map tasks=3
  12/03/06 23:15:31 INFO mapred.JobClient: Data-local map tasks=3
  12/03/06 23:15:31 INFO mapred.JobClient: Failed reduce tasks=1
  12/03/06 23:15:31 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=16799
  12/03/06 23:15:31 INFO mapred.JobClient:   FileSystemCounters
  12/03/06 23:15:31 INFO mapred.JobClient: FILE_BYTES_READ=740520
  12/03/06 23:15:31 INFO mapred.JobClient: HDFS_BYTES_READ=3671863
  12/03/06 23:15:31 INFO mapred.JobClient: FILE_BYTES_WRITTEN=2278287
  12/03/06 23:15:31 INFO mapred.JobClient:   File Input Format Counters
  12/03/06 23:15:31 INFO mapred.JobClient: Bytes Read=3671517
  12/03/06 23:15:31 INFO mapred.JobClient:   Map-Reduce Framework
  12/03/06 23:15:31 INFO mapred.JobClient: Map output materialized
  bytes=1474341
  12/03/06 23:15:31 INFO mapred.JobClient: Combine output
 records=102322
  12/03/06 23:15:31 INFO mapred.JobClient: Map input records=77932
  12/03/06 23:15:31 INFO mapred.JobClient: Spilled Records=153640
  12/03/06 23:15:31 INFO mapred.JobClient: Map output bytes=6076095
  12/03/06 23:15:31 INFO mapred.JobClient: Combine input records=629172
  12/03/06 23:15:31 INFO mapred.JobClient: Map output records=629172
  12/03/06 23:15:31 INFO mapred.JobClient: SPLIT_RAW_BYTES=346
  hduser@sujit:~/Desktop/hadoop$



 --
 Harsh J



Re: Fair Scheduler Problem

2012-03-06 Thread Harsh J
Hello Hao,

Its best to submit CDH user queries to
https://groups.google.com/a/cloudera.org/group/cdh-user/topics
(cdh-u...@cloudera.org) where the majority of CDH users community
resides.

How do you determine that preemption did not/does not work? Preemption
between pools occurs if a pool's minShare isn't satisfied within
preemption-timeout seconds. In this case, it will preempt tasks from
other pools.

Your settings look alright on a high level. Does your log not carry
any preemption logs? What was your pool's share scenario when you
tried to observe if it works or not?

On Wed, Mar 7, 2012 at 8:35 AM, hao.wang hao.w...@ipinyou.com wrote:
 Hi ,All,
    I encountered a problem in using Cloudera Hadoop 0.20.2-cdh3u1. When I use 
 the fair Scheduler I find the scheduler seems  not support preemption.
    Can anybody tell me whether preemption is supported in this version?
    This is my configration:
  mapred-site.xml
 property
  namemapred.jobtracker.taskScheduler/name
  valueorg.apache.hadoop.mapred.FairScheduler/value
 /property
 property
      namemapred.fairscheduler.allocation.file/name
      value/usr/lib/hadoop-0.20/conf/fair-scheduler.xml/value
 /property
 property
 namemapred.fairscheduler.preemption/name
 valuetrue/value
 /property
 property
 namemapred.fairscheduler.preemption.only.log/name
 valuetrue/value
 /property
 property
 namemapred.fairscheduler.preemption.interval/name
 value15000/value
 /property
 property
  namemapred.fairscheduler.weightadjuster/name
  valueorg.apache.hadoop.mapred.NewJobWeightBooster/value
 /property
 property
  namemapred.fairscheduler.sizebasedweight/name
  valuetrue/value
 /property
 fair-scheduler.xml
 allocations
   pool name=root
      minMaps10/minMaps
    minReduces5/minReduces
    maxMaps200/maxMaps
   maxReduces80/maxReduces
       maxRunningJobs100/maxRunningJobs
      minSharePreemptionTimeout30/minSharePreemptionTimeout
        weight1.0/weight
  /pool
  pool name=hadoop
       minMaps10/minMaps
    minReduces5/minReduces
   maxMaps80/maxMaps
   maxReduces80/maxReduces
        maxRunningJobs5/maxRunningJobs
       minSharePreemptionTimeout30/minSharePreemptionTimeout
       weight1.0/weight
  /pool
  user name=user1
       maxRunningJobs10/maxRunningJobs
  /user
    poolMaxJobsDefault20/poolMaxJobsDefault
   userMaxJobsDefault10/userMaxJobsDefault
   defaultMinSharePreemptionTimeout30/defaultMinSharePreemptionTimeout
   fairSharePreemptionTimeout30/fairSharePreemptionTimeout
 /allocations

 regards,

 2012-03-07



 hao.wang



-- 
Harsh J


Re: Re: Fair Scheduler Problem

2012-03-06 Thread hao.wang
Hi, Thanks for your reply!
I have solved this problem by setting mapred.fairscheduler.preemption.only.log 
 to false. The preemption works!
But I don't know why can not set mapred.fairscheduler.preemption.only.log  to 
true. Is it a bug?

regards,

2012-03-07 



hao.wang 



发件人: Harsh J 
发送时间: 2012-03-07  14:14:05 
收件人: common-user 
抄送: 
主题: Re: Fair Scheduler Problem 
 
Hello Hao,
Its best to submit CDH user queries to
https://groups.google.com/a/cloudera.org/group/cdh-user/topics
(cdh-u...@cloudera.org) where the majority of CDH users community
resides.
How do you determine that preemption did not/does not work? Preemption
between pools occurs if a pool's minShare isn't satisfied within
preemption-timeout seconds. In this case, it will preempt tasks from
other pools.
Your settings look alright on a high level. Does your log not carry
any preemption logs? What was your pool's share scenario when you
tried to observe if it works or not?
On Wed, Mar 7, 2012 at 8:35 AM, hao.wang hao.w...@ipinyou.com wrote:
 Hi ,All,
I encountered a problem in using Cloudera Hadoop 0.20.2-cdh3u1. When I use 
 the fair Scheduler I find the scheduler seems  not support preemption.
Can anybody tell me whether preemption is supported in this version?
This is my configration:
  mapred-site.xml
 property
  namemapred.jobtracker.taskScheduler/name
  valueorg.apache.hadoop.mapred.FairScheduler/value
 /property
 property
  namemapred.fairscheduler.allocation.file/name
  value/usr/lib/hadoop-0.20/conf/fair-scheduler.xml/value
 /property
 property
 namemapred.fairscheduler.preemption/name
 valuetrue/value
 /property
 property
 namemapred.fairscheduler.preemption.only.log/name
 valuetrue/value
 /property
 property
 namemapred.fairscheduler.preemption.interval/name
 value15000/value
 /property
 property
  namemapred.fairscheduler.weightadjuster/name
  valueorg.apache.hadoop.mapred.NewJobWeightBooster/value
 /property
 property
  namemapred.fairscheduler.sizebasedweight/name
  valuetrue/value
 /property
 fair-scheduler.xml
 allocations
   pool name=root
  minMaps10/minMaps
minReduces5/minReduces
maxMaps200/maxMaps
   maxReduces80/maxReduces
   maxRunningJobs100/maxRunningJobs
  minSharePreemptionTimeout30/minSharePreemptionTimeout
weight1.0/weight
  /pool
  pool name=hadoop
   minMaps10/minMaps
minReduces5/minReduces
   maxMaps80/maxMaps
   maxReduces80/maxReduces
maxRunningJobs5/maxRunningJobs
   minSharePreemptionTimeout30/minSharePreemptionTimeout
   weight1.0/weight
  /pool
  user name=user1
   maxRunningJobs10/maxRunningJobs
  /user
poolMaxJobsDefault20/poolMaxJobsDefault
   userMaxJobsDefault10/userMaxJobsDefault
   defaultMinSharePreemptionTimeout30/defaultMinSharePreemptionTimeout
   fairSharePreemptionTimeout30/fairSharePreemptionTimeout
 /allocations

 regards,

 2012-03-07



 hao.wang
-- 
Harsh J


Re: Re: Fair Scheduler Problem

2012-03-06 Thread Harsh J
Ah my bad that I missed it when reading your doc.

Yes that property being true would make it only LOG about preemption
scenarios, not do preemption.

On Wed, Mar 7, 2012 at 12:05 PM, hao.wang hao.w...@ipinyou.com wrote:
 Hi, Thanks for your reply!
 I have solved this problem by setting
 mapred.fairscheduler.preemption.only.log  to false. The preemption
 works!
 But I don't know why can not set mapred.fairscheduler.preemption.only.log 
 to true. Is it a bug?

 regards,



 2012-03-07
 
 hao.wang
 
 发件人: Harsh J
 发送时间: 2012-03-07  14:14:05
 收件人: common-user
 抄送:
 主题: Re: Fair Scheduler Problem
 Hello Hao,
 Its best to submit CDH user queries to
 https://groups.google.com/a/cloudera.org/group/cdh-user/topics
 (cdh-u...@cloudera.org) where the majority of CDH users community
 resides.
 How do you determine that preemption did not/does not work? Preemption
 between pools occurs if a pool's minShare isn't satisfied within
 preemption-timeout seconds. In this case, it will preempt tasks from
 other pools.
 Your settings look alright on a high level. Does your log not carry
 any preemption logs? What was your pool's share scenario when you
 tried to observe if it works or not?
 On Wed, Mar 7, 2012 at 8:35 AM, hao.wang hao.w...@ipinyou.com wrote:
 Hi ,All,
I encountered a problem in using Cloudera Hadoop 0.20.2-cdh3u1. When I 
 use the fair Scheduler I find the scheduler seems  not support preemption.
Can anybody tell me whether preemption is supported in this version?
This is my configration:
  mapred-site.xml
 property
  namemapred.jobtracker.taskScheduler/name
  valueorg.apache.hadoop.mapred.FairScheduler/value
 /property
 property
  namemapred.fairscheduler.allocation.file/name
  value/usr/lib/hadoop-0.20/conf/fair-scheduler.xml/value
 /property
 property
 namemapred.fairscheduler.preemption/name
 valuetrue/value
 /property
 property
 namemapred.fairscheduler.preemption.only.log/name
 valuetrue/value
 /property
 property
 namemapred.fairscheduler.preemption.interval/name
 value15000/value
 /property
 property
  namemapred.fairscheduler.weightadjuster/name
  valueorg.apache.hadoop.mapred.NewJobWeightBooster/value
 /property
 property
  namemapred.fairscheduler.sizebasedweight/name
  valuetrue/value
 /property
 fair-scheduler.xml
 allocations
   pool name=root
  minMaps10/minMaps
minReduces5/minReduces
maxMaps200/maxMaps
   maxReduces80/maxReduces
   maxRunningJobs100/maxRunningJobs
  minSharePreemptionTimeout30/minSharePreemptionTimeout
weight1.0/weight
  /pool
  pool name=hadoop
   minMaps10/minMaps
minReduces5/minReduces
   maxMaps80/maxMaps
   maxReduces80/maxReduces
maxRunningJobs5/maxRunningJobs
   minSharePreemptionTimeout30/minSharePreemptionTimeout
   weight1.0/weight
  /pool
  user name=user1
   maxRunningJobs10/maxRunningJobs
  /user
poolMaxJobsDefault20/poolMaxJobsDefault
   userMaxJobsDefault10/userMaxJobsDefault
   defaultMinSharePreemptionTimeout30/defaultMinSharePreemptionTimeout
   fairSharePreemptionTimeout30/fairSharePreemptionTimeout
 /allocations

 regards,

 2012-03-07



 hao.wang
 --
 Harsh J



-- 
Harsh J


Re: how is userlogs supposed to be cleaned up?

2012-03-06 Thread Joep Rottinghuis
Aside from cleanup, it seems like you are running into max number of 
subdirectories per directory on ext3.

Joep

Sent from my iPhone

On Mar 6, 2012, at 10:22 AM, Chris Curtin curtin.ch...@gmail.com wrote:

 Hi,
 
 We had a fun morning trying to figure out why our cluster was failing jobs,
 removing nodes from the cluster etc. The majority of the errors were
 something like:
 
 
 Error initializing attempt_201203061035_0047_m_02_0:
 
 org.apache.hadoop.util.Shell$ExitCodeException: chmod: cannot access
 `/disk1/userlogs/job_201203061035_0047': No such file or directory
 
 
 
at org.apache.hadoop.util.Shell.runCommand(Shell.java:255)
 
at org.apache.hadoop.util.Shell.run(Shell.java:182)
 
at
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
 
at org.apache.hadoop.util.Shell.execCommand(Shell.java:461)
 
at org.apache.hadoop.util.Shell.execCommand(Shell.java:444)
 
at
 org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.java:533)
 
at
 org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:524)
 
at
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344)
 
at
 org.apache.hadoop.mapred.JobLocalizer.initializeJobLogDir(JobLocalizer.java:240)
 
at
 org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:216)
 
at
 org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1352)
 
 
 
 Finally we shutdown the entire cluster and found that the 'userlogs'
 directory on the failed nodes had 30,000+ directories and the 'live' nodes
 25,000+. Looking at creation timestamps it looks like around adding
 30,000th directory the node falls over.
 
 
 
 Many of the directorys are weeks old and a few were months old.
 
 
 
 Deleting ALL the directories on all the nodes allowed us to bring the
 cluster up and things to run again. (Some users are claiming it is running
 faster now?)
 
 
 
 Our question: what is supposed to be cleaning up these directories? How
 often is that process or step taken?
 
 
 
 We are running CDH3u3.
 
 
 
 Thanks,
 
 
 
 Chris