Pangool: easier Hadoop, same performance
Hi, I'd like to introduce you Pangool http://pangool.net/, an easier low-level MapReduce API for Hadoop. I'm one of the developers. We just open-sourced it yesterday. Pangool is a Java, low-level MapReduce API with the same flexibility and performance than the plain Java Hadoop MapReduce API. The difference is that it makes a lot of things easier to code and understand. A few of Pangool's features: - Tuple-based intermediate serialization (allowing easier development). - Built-in, easy-to-use group by and sort by (removing boilerplate code for things like secondary sort). - Built-in, easy-to-use reduce-side joins (which are quite hard to implement in Hadoop). - Augmented Hadoop API: Built-in multiple inputs / outputs, configuration via object instance. Pangool meets the need of making Hadoop's steep learning curve a lot smoother while retaining all its features, power and flexibility. It differs in high-level tools like Pig or Hive in that it can be used as a replacement of the low-level API. There is no performance / flexibility penalty paid for using Pangool. We did an initial benchmark http://pangool.net/benchmark.html to show this idea. I'd be very interested in hearing your feedback, opinions and questions on it. Cheers, Pere.
Re: why does my mapper class reads my input file twice?
Harsh, Thanks. I went into the code on FileInputFormat.addInputPath(Job,Path) and it is as you stated. That make sense now. I simply commented out FileInputFormat.addInputPath(job, input) and FileOutputFormat.setOutputPath(job, output) and everything automagically works now. Thanks a bunch! On Tue, Mar 6, 2012 at 2:06 AM, Harsh J ha...@cloudera.com wrote: Its your use of the mapred.input.dir property, which is a reserved name in the framework (its what FileInputFormat uses). You have a config you extract path from: Path input = new Path(conf.get(mapred.input.dir)); Then you do: FileInputFormat.addInputPath(job, input); Which internally, simply appends a path to a config prop called mapred.input.dir. Hence your job gets launched with two input files (the very same) - one added by default Tool-provided configuration (cause of your -Dmapred.input.dir) and the other added by you. Fix the input path line to use a different config: Path input = new Path(conf.get(input.path)); And run job as: hadoop jar dummy-0.1.jar dummy.MyJob -Dinput.path=data/dummy.txt -Dmapred.output.dir=result On Tue, Mar 6, 2012 at 9:03 AM, Jane Wayne jane.wayne2...@gmail.com wrote: i have code that reads in a text file. i notice that each line in the text file is somehow being read twice. why is this happening? my mapper class looks like the following: public class MyMapper extends MapperLongWritable, Text, LongWritable, Text { private static final Log _log = LogFactory.getLog(MyMapper.class); @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String s = (new StringBuilder()).append(value.toString()).append(m).toString(); context.write(key, new Text(s)); _log.debug(key.toString() + = + s); } } my reducer class looks like the following: public class MyReducer extends ReducerLongWritable, Text, LongWritable, Text { private static final Log _log = LogFactory.getLog(MyReducer.class); @Override public void reduce(LongWritable key, IterableText values, Context context) throws IOException, InterruptedException { for(IteratorText it = values.iterator(); it.hasNext();) { Text txt = it.next(); String s = (new StringBuilder()).append(txt.toString()).append(r).toString(); context.write(key, new Text(s)); _log.debug(key.toString() + = + s); } } } my job class looks like the following: public class MyJob extends Configured implements Tool { public static void main(String[] args) throws Exception { ToolRunner.run(new Configuration(), new MyJob(), args); } @Override public int run(String[] args) throws Exception { Configuration conf = getConf(); Path input = new Path(conf.get(mapred.input.dir)); Path output = new Path(conf.get(mapred.output.dir)); Job job = new Job(conf, dummy job); job.setMapOutputKeyClass(LongWritable.class); job.setMapOutputValueClass(Text.class); job.setOutputKeyClass(LongWritable.class); job.setOutputValueClass(Text.class); job.setMapperClass(MyMapper.class); job.setReducerClass(MyReducer.class); FileInputFormat.addInputPath(job, input); FileOutputFormat.setOutputPath(job, output); job.setJarByClass(MyJob.class); return job.waitForCompletion(true) ? 0 : 1; } } the text file that i am trying to read in looks like the following. as you can see, there are 9 lines. T, T T, T T, T F, F F, F F, F F, F T, F F, T the output file that i get after my Job runs looks like the following. as you can see, there are 18 lines. each key is emitted twice from the mapper to the reducer. 0 T, Tmr 0 T, Tmr 6 T, Tmr 6 T, Tmr 12 T, Tmr 12 T, Tmr 18 F, Fmr 18 F, Fmr 24 F, Fmr 24 F, Fmr 30 F, Fmr 30 F, Fmr 36 F, Fmr 36 F, Fmr 42 T, Fmr 42 T, Fmr 48 F, Tmr 48 F, Tmr the way i execute my Job is as follows (cygwin + hadoop 0.20.2). hadoop jar dummy-0.1.jar dummy.MyJob -Dmapred.input.dir=data/dummy.txt -Dmapred.output.dir=result originally, this happened when i read in a sequence file, but even for a text file, this problem is still happening. is it the way i have setup my Job? -- Harsh J
Re: is there anyway to detect the file size as am i writing a sequence file?
I think you mean Writer.getLength(). It returns the current position in the output stream in bytes (more or less the current size of the file). -Joey On Tue, Mar 6, 2012 at 9:53 AM, Jane Wayne jane.wayne2...@gmail.com wrote: hi, i am writing a little util class to recurse into a directory and add all *.txt files into a sequence file (key is the file name, value is the content of the corresponding text file). as i am writing (i.e. SequenceFile.Writer.append(key, value)), is there any way to detect how large the sequence file is? for example, i want to create a new sequence file as soon as the current one exceeds 64 MB. i notice there is a SequenceFile.Writer.getLong() which the javadocs says returns the current length of the output file, but that is vague. what is this Writer.getLong() method? is it the number of bytes, kilobytes, megabytes, or something else? thanks, -- Joseph Echeverria Cloudera, Inc. 443.305.9434
Re: is there anyway to detect the file size as am i writing a sequence file?
Thanks Joey. That's what I meant (I've been staring at the screen too long). :) On Tue, Mar 6, 2012 at 10:00 AM, Joey Echeverria j...@cloudera.com wrote: I think you mean Writer.getLength(). It returns the current position in the output stream in bytes (more or less the current size of the file). -Joey On Tue, Mar 6, 2012 at 9:53 AM, Jane Wayne jane.wayne2...@gmail.com wrote: hi, i am writing a little util class to recurse into a directory and add all *.txt files into a sequence file (key is the file name, value is the content of the corresponding text file). as i am writing (i.e. SequenceFile.Writer.append(key, value)), is there any way to detect how large the sequence file is? for example, i want to create a new sequence file as soon as the current one exceeds 64 MB. i notice there is a SequenceFile.Writer.getLong() which the javadocs says returns the current length of the output file, but that is vague. what is this Writer.getLong() method? is it the number of bytes, kilobytes, megabytes, or something else? thanks, -- Joseph Echeverria Cloudera, Inc. 443.305.9434
Re: how to get rid of -libjars ?
Pe 06.03.2012 17:37, Jane Wayne a scris: currently, i have my main jar and then 2 depedent jars. what i do is 1. copy dependent-1.jar to $HADOOP/lib 2. copy dependent-2.jar to $HADOOP/lib then, when i need to run my job, MyJob inside main.jar, i do the following. hadoop jar main.jar demo.MyJob -libjars dependent-1.jar,dependent-2.jar -Dmapred.input.dir=/input/path -Dmapred.output.dir=/output/path what i want to do is NOT copy the dependent jars to $HADOOP/lib and always specify -libjars. is there any way around this multi-step procedure? i really do not want to clutter $HADOOP/lib or specify a comma-delimited list of jars for -libjars. any help is appreciated. Hello, Specify the full path to the jar on the -libjars? My experience with -libjars is that it didn't work as advertised. Search for an older post on the list about this issue ( -libjars not working). I tried adding a lot of jars and some got on the job classpath (2), some didn't (most of them). I got over this by including all the jars in a lib directory inside the main jar. Cheers, -- Ioan Eugen Stan http://ieugen.blogspot.com
Re: how to get rid of -libjars ?
Hi Jane + Adding on to Joey's comments If you want to eliminate the process of distributing the dependent jars every time, then you need to manually pre-distribute these jars across the nodes and add them on to the classpath of all nodes. This approach may be chosen if you are periodically running some job at a greater frequency on your cluster that needs external jars. Regards Bejoy.K.S On Tue, Mar 6, 2012 at 9:23 PM, Joey Echeverria j...@cloudera.com wrote: If you're using -libjars, there's no reason to copy the jars into $HADOOP lib. You may have to add the jars to the HADOOP_CLASSPATH if you use them from your main() method: export HADOOP_CLASSPATH=dependent-1.jar,dependent-2.jar hadoop jar main.jar demo.MyJob -libjars dependent-1.jar,dependent-2.jar -Dmapred.input.dir=/input/path -Dmapred.output.dir=/output/path -Joey On Tue, Mar 6, 2012 at 10:37 AM, Jane Wayne jane.wayne2...@gmail.com wrote: currently, i have my main jar and then 2 depedent jars. what i do is 1. copy dependent-1.jar to $HADOOP/lib 2. copy dependent-2.jar to $HADOOP/lib then, when i need to run my job, MyJob inside main.jar, i do the following. hadoop jar main.jar demo.MyJob -libjars dependent-1.jar,dependent-2.jar -Dmapred.input.dir=/input/path -Dmapred.output.dir=/output/path what i want to do is NOT copy the dependent jars to $HADOOP/lib and always specify -libjars. is there any way around this multi-step procedure? i really do not want to clutter $HADOOP/lib or specify a comma-delimited list of jars for -libjars. any help is appreciated. -- Joseph Echeverria Cloudera, Inc. 443.305.9434
Re: how to get rid of -libjars ?
1. Wrap all your jar files inside your artifact, they should be under lib folder. Sometimes this could make your jar file quite big, if you want to save time uploading big jar files remotely, see 2 2. Use -libjars with full path or relative path (w.r.t. your jar package) should work On 3/6/2012 9:55 AM, Ioan Eugen Stan wrote: Pe 06.03.2012 17:37, Jane Wayne a scris: currently, i have my main jar and then 2 depedent jars. what i do is 1. copy dependent-1.jar to $HADOOP/lib 2. copy dependent-2.jar to $HADOOP/lib then, when i need to run my job, MyJob inside main.jar, i do the following. hadoop jar main.jar demo.MyJob -libjars dependent-1.jar,dependent-2.jar -Dmapred.input.dir=/input/path -Dmapred.output.dir=/output/path what i want to do is NOT copy the dependent jars to $HADOOP/lib and always specify -libjars. is there any way around this multi-step procedure? i really do not want to clutter $HADOOP/lib or specify a comma-delimited list of jars for -libjars. any help is appreciated. Hello, Specify the full path to the jar on the -libjars? My experience with -libjars is that it didn't work as advertised. Search for an older post on the list about this issue ( -libjars not working). I tried adding a lot of jars and some got on the job classpath (2), some didn't (most of them). I got over this by including all the jars in a lib directory inside the main jar. Cheers,
HDFS Reporting Tools
Dear All, We are maintaining a 60-node hadoop cluster for external users, and would like to be automatically notified via email when an HDFS crash or some other infrastructure failure occurs that is not due to a user programming error. We've been encountering such soft errors, where hadoop does not crash, but becomes very slow and job hand for a long time and fail. Are there existing tools that provide this capability? Or do we have to manually monitor the web services at on http://namenode and http://namenode:50030? Thank you so much, Oren -- We plan ahead, which means we don't do anything right now. -- Valentine (Tremors) -- We plan ahead, which means we don't do anything right now. -- Valentine (Tremors)
Hadoop EC2 user-data script
Hi, I am new to Hadoop. I am want to try hadoop installation using Openstack. Openstack API for launching instance(VM) has a parameter for passing user-data. Here we can pass scripts which will be executed on first time boot. This is similar to EC2 user-data. I would like to know about the hadoop user-data script. Any help on this is appreciated. Thanks in advance. Regards, Sagar
Re: Java Heap space error
I am still trying to see how to narrow this down. Is it possible to set heapdumponoutofmemoryerror option on these individual tasks? On Mon, Mar 5, 2012 at 5:49 PM, Mohit Anchlia mohitanch...@gmail.comwrote: Sorry for multiple emails. I did find: 2012-03-05 17:26:35,636 INFO org.apache.pig.impl.util.SpillableMemoryManager: first memory handler call- Usage threshold init = 715849728(699072K) used = 575921696(562423K) committed = 715849728(699072K) max = 715849728(699072K) 2012-03-05 17:26:35,719 INFO org.apache.pig.impl.util.SpillableMemoryManager: Spilled an estimate of 7816154 bytes from 1 objects. init = 715849728(699072K) used = 575921696(562423K) committed = 715849728(699072K) max = 715849728(699072K) 2012-03-05 17:26:36,881 INFO org.apache.pig.impl.util.SpillableMemoryManager: first memory handler call - Collection threshold init = 715849728(699072K) used = 358720384(350312K) committed = 715849728(699072K) max = 715849728(699072K) 2012-03-05 17:26:36,885 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1 2012-03-05 17:26:36,888 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: Java heap space at java.nio.HeapCharBuffer.init(HeapCharBuffer.java:39) at java.nio.CharBuffer.allocate(CharBuffer.java:312) at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:760) at org.apache.hadoop.io.Text.decode(Text.java:350) at org.apache.hadoop.io.Text.decode(Text.java:327) at org.apache.hadoop.io.Text.toString(Text.java:254) at org.apache.pig.piggybank.storage.SequenceFileLoader.translateWritableToPigDataType(SequenceFileLoader.java:105) at org.apache.pig.piggybank.storage.SequenceFileLoader.getNext(SequenceFileLoader.java:139) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:187) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:456) at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157) at org.apache.hadoop.mapred.Child.main(Child.java:264) On Mon, Mar 5, 2012 at 5:46 PM, Mohit Anchlia mohitanch...@gmail.comwrote: All I see in the logs is: 2012-03-05 17:26:36,889 FATAL org.apache.hadoop.mapred.TaskTracker: Task: attempt_201203051722_0001_m_30_1 - Killed : Java heap space Looks like task tracker is killing the tasks. Not sure why. I increased heap from 512 to 1G and still it fails. On Mon, Mar 5, 2012 at 5:03 PM, Mohit Anchlia mohitanch...@gmail.comwrote: I currently have java.opts.mapred set to 512MB and I am getting heap space errors. How should I go about debugging heap space issues?
hadoop cluster ssh username
I have a small cluster of servers that runs hadoop. I have a laptop that I'd like to use that cluster when it is available. I setup hadoop on the laptop so I can switch from running local to running on the cluster. Local works. I have setup passwordless ssh between all machines to work with 'hadoop-user' which is the linux username on the cluster machines so I can ssh from the laptop to the servers without a password thusly: ssh hadoop-user@master But my username on the laptop is pferrel, not hadoop-user so when running 'start-all.sh' it tries ssh pferrel@master How do I tell it to use the linux user 'hadoop-user'. I assume there is something in the config directory xml files that will do this?
Hadoop error=12 Cannot allocate memory
Hi, I have a hadoop cluster of size 5 and a data of size 1GB. I am running a simple map reduce program which reads text data and outputs sequence files. I found some solutions to this problem suggesting to set over commmit to 0 and to increase the unlimit. I have memory over commit set to 0 and have ulimit unlimited. Even with this, I keep getting the following error. Is any one aware of any work arounds for this? java.io.IOException: Cannot run program bash: java.io.IOException: *Error: * *error*=12, Cannot allocate memory at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) at org.apache.hadoop.util.Shell.runCommand(Shell.java:149) at org.apache.hadoop.util.Shell.run(Shell.java:134) at org.apache.hadoop.fs.DF.getAvailable(DF.java:73) at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:296) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124) at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java:107) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:734) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:694) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124) Thanks -Rohini
Re: hadoop cluster ssh username
Pat, Your question seems to be in multiple parts if am right? 1. How do you manage configuration so that switching between local and wider-cluster mode both work? My suggestion would be to create two git branches in your conf directory and switch them as you need, with simple git checkouts. 2. How do you get the start/stop scripts to ssh as hadoop-user instead of using your user name? In your masters and slaves file, instead of placing just a list of hostnames, place hadoop-user@hostnames. That should do the trick. If you want your SSH itself to use a different username when being asked to connect to a hostname, follow the per-host configuration to specify a username to automatically pick when provided a hostname: http://technosophos.com/content/ssh-host-configuration On Tue, Mar 6, 2012 at 11:42 PM, Pat Ferrel p...@occamsmachete.com wrote: I have a small cluster of servers that runs hadoop. I have a laptop that I'd like to use that cluster when it is available. I setup hadoop on the laptop so I can switch from running local to running on the cluster. Local works. I have setup passwordless ssh between all machines to work with 'hadoop-user' which is the linux username on the cluster machines so I can ssh from the laptop to the servers without a password thusly: ssh hadoop-user@master But my username on the laptop is pferrel, not hadoop-user so when running 'start-all.sh' it tries ssh pferrel@master How do I tell it to use the linux user 'hadoop-user'. I assume there is something in the config directory xml files that will do this? -- Harsh J
Re: getting NullPointerException while running Word cont example
Hi Sujit, Please also tell us which version/distribution of Hadoop is this? On Tue, Mar 6, 2012 at 11:27 PM, Sujit Dhamale sujitdhamal...@gmail.com wrote: Hi, I am new to Hadoop., i install Hadoop as per http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluste while running Word cont example i am getting *NullPointerException *can some one please look in to this issue ?* *Thanks in Advance* !!! * duser@sujit:~/Desktop/hadoop$ bin/hadoop dfs -ls /user/hduser/data Found 3 items -rw-r--r-- 1 hduser supergroup 674566 2012-03-06 23:04 /user/hduser/data/pg20417.txt -rw-r--r-- 1 hduser supergroup 1573150 2012-03-06 23:04 /user/hduser/data/pg4300.txt -rw-r--r-- 1 hduser supergroup 1423801 2012-03-06 23:04 /user/hduser/data/pg5000.txt hduser@sujit:~/Desktop/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/data /user/hduser/gutenberg-outputd 12/03/06 23:14:33 INFO input.FileInputFormat: Total input paths to process : 3 12/03/06 23:14:33 INFO mapred.JobClient: Running job: job_201203062221_0002 12/03/06 23:14:34 INFO mapred.JobClient: map 0% reduce 0% 12/03/06 23:14:49 INFO mapred.JobClient: map 66% reduce 0% 12/03/06 23:14:55 INFO mapred.JobClient: map 100% reduce 0% 12/03/06 23:14:58 INFO mapred.JobClient: Task Id : attempt_201203062221_0002_r_00_0, Status : FAILED Error: java.lang.NullPointerException at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2900) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2820) 12/03/06 23:15:07 INFO mapred.JobClient: Task Id : attempt_201203062221_0002_r_00_1, Status : FAILED Error: java.lang.NullPointerException at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2900) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2820) 12/03/06 23:15:16 INFO mapred.JobClient: Task Id : attempt_201203062221_0002_r_00_2, Status : FAILED Error: java.lang.NullPointerException at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2900) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2820) 12/03/06 23:15:31 INFO mapred.JobClient: Job complete: job_201203062221_0002 12/03/06 23:15:31 INFO mapred.JobClient: Counters: 20 12/03/06 23:15:31 INFO mapred.JobClient: Job Counters 12/03/06 23:15:31 INFO mapred.JobClient: Launched reduce tasks=4 12/03/06 23:15:31 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=22084 12/03/06 23:15:31 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/03/06 23:15:31 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/03/06 23:15:31 INFO mapred.JobClient: Launched map tasks=3 12/03/06 23:15:31 INFO mapred.JobClient: Data-local map tasks=3 12/03/06 23:15:31 INFO mapred.JobClient: Failed reduce tasks=1 12/03/06 23:15:31 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=16799 12/03/06 23:15:31 INFO mapred.JobClient: FileSystemCounters 12/03/06 23:15:31 INFO mapred.JobClient: FILE_BYTES_READ=740520 12/03/06 23:15:31 INFO mapred.JobClient: HDFS_BYTES_READ=3671863 12/03/06 23:15:31 INFO mapred.JobClient: FILE_BYTES_WRITTEN=2278287 12/03/06 23:15:31 INFO mapred.JobClient: File Input Format Counters 12/03/06 23:15:31 INFO mapred.JobClient: Bytes Read=3671517 12/03/06 23:15:31 INFO mapred.JobClient: Map-Reduce Framework 12/03/06 23:15:31 INFO mapred.JobClient: Map output materialized bytes=1474341 12/03/06 23:15:31 INFO mapred.JobClient: Combine output records=102322 12/03/06 23:15:31 INFO mapred.JobClient: Map input records=77932 12/03/06 23:15:31 INFO mapred.JobClient: Spilled Records=153640 12/03/06 23:15:31 INFO mapred.JobClient: Map output bytes=6076095 12/03/06 23:15:31 INFO mapred.JobClient: Combine input records=629172 12/03/06 23:15:31 INFO mapred.JobClient: Map output records=629172 12/03/06 23:15:31 INFO mapred.JobClient: SPLIT_RAW_BYTES=346 hduser@sujit:~/Desktop/hadoop$ -- Harsh J
Hadoop runtime metrics
Hi All, We have a medium cluster running 70 nodes, using 0.20.2-cdh3u1. We collect run-time metrics thru Ganglia. We found that the certain metrics like waiting_reduces , tasks_failed_timeout is high and looks the values are getting cumulative. Any thoughts on this will be helpful. Thanks
Re: how is userlogs supposed to be cleaned up?
On Mar 6, 2012, at 10:22 AM, Chris Curtin wrote: Hi, We had a fun morning trying to figure out why our cluster was failing jobs, removing nodes from the cluster etc. The majority of the errors were something like: [snip] We are running CDH3u3. You'll need to check with CDH lists. However, hadoop-1.0 (and prior, starting with hadoop-0.20.203) have mechanisms to clean up userlogs automatically; else, as you've found out, operating large clusters (4k nodes) with millions of jobs per month is too painful. Arun -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/
Re: Hadoop runtime metrics
On Mar 6, 2012, at 10:54 AM, en-hui chang wrote: Hi All, We have a medium cluster running 70 nodes, using 0.20.2-cdh3u1. We collect run-time metrics thru Ganglia. We found that the certain metrics like waiting_reduces , tasks_failed_timeout is high and looks the values are getting cumulative. Any thoughts on this will be helpful. You'll need to check with CDH lists. -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/
Re: HDFS Reporting Tools
You could set up things like Ganglia, Nagios to monitor, send off events, issues. Within the Hadoop Ecosystem, there are things like Vaidya, maybe Ambari(not sure as I've not used this), Splunk even has a new beta test for Shep/Splunk Hadoop Monitoring app. Peter Jamack On 3/6/12 8:35 AM, Oren Livne li...@uchicago.edu wrote: Dear All, We are maintaining a 60-node hadoop cluster for external users, and would like to be automatically notified via email when an HDFS crash or some other infrastructure failure occurs that is not due to a user programming error. We've been encountering such soft errors, where hadoop does not crash, but becomes very slow and job hand for a long time and fail. Are there existing tools that provide this capability? Or do we have to manually monitor the web services at on http://namenode and http://namenode:50030? Thank you so much, Oren -- We plan ahead, which means we don't do anything right now. -- Valentine (Tremors) -- We plan ahead, which means we don't do anything right now. -- Valentine (Tremors)
Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R
Rules of thumb IMO: You should be using Pig in place of MR jobs at all times that performance isn't absolutely crucial. Writing unnecessary MR is needless technical debt that you will regret as people are replaced and your organization scales. Pig gets it done in much less time. If you need faster jobs, then optimize your Pig, and if that doesn't work, put a single MAPREDUCEhttp://pig.apache.org/docs/r0.9.2/basic.html#mapreduce job at the bottleneck. Also, realize that it can be hard to actually beat Pig's performance without experience. Check that your MR job is actually faster than Pig at the same load before assuming you can do better than Pig. Streaming is good if your data doesn't easily map to tuples, you really like using the abstractions of your favoriate language's MR library, or you are doing something weird like simulations/pure batch jobs (no mR). If you're doing a lot of joins and performance is a problem - consider doing fewer joins. I would strongly suggest that you prioritize de-normalizing and duplicating data over switching to raw MR jobs because HIVE joins are slow. MapReduce is slow at joins. Programmer time is more valuable than machine time. If you're having to write tons of raw MR, then get more machines. On Fri, Mar 2, 2012 at 6:21 AM, Subir S subir.sasiku...@gmail.com wrote: On Fri, Mar 2, 2012 at 12:38 PM, Harsh J ha...@cloudera.com wrote: On Fri, Mar 2, 2012 at 10:18 AM, Subir S subir.sasiku...@gmail.com wrote: Hello Folks, Are there any pointers to such comparisons between Apache Pig and Hadoop Streaming Map Reduce jobs? I do not see why you seek to compare these two. Pig offers a language that lets you write data-flow operations and runs these statements as a series of MR jobs for you automatically (Making it a great tool to use to get data processing done really quick, without bothering with code), while streaming is something you use to write non-Java, simple MR jobs. Both have their own purposes. Basically we are comparing these two to see the benefits and how much they help in improving the productive coding time, without jeopardizing the performance of MR jobs. Also there was a claim in our company that Pig performs better than Map Reduce jobs? Is this true? Are there any such benchmarks available Pig _runs_ MR jobs. It does do job design (and some data) optimizations based on your queries, which is what may give it an edge over designing elaborate flows of plain MR jobs with tools like Oozie/JobControl (Which takes more time to do). But regardless, Pig only makes it easy doing the same thing with Pig Latin statements for you. I knew that Pig runs MR jobs, as Hive runs MR jobs. But Hive jobs become pretty slow with lot of joins, which we can achieve faster with writing raw MR jobs. So with that context was trying to see how Pig runs MR jobs. Like for example what kind of projects should consider Pig. Say when we have a lot of Joins, which writing with plain MR jobs takes time. Thoughts? Thank you Harsh for your comments. They are helpful! -- Harsh J -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com
Re: hadoop cluster ssh username
Pat, On Wed, Mar 7, 2012 at 4:10 AM, Pat Ferrel p...@farfetchers.com wrote: Thanks, #2 below gets me partway. I can start-all.sh and stop-all.sh from the laptop and can fs -ls but copying gives me: Maclaurin:mahout-distribution-0.6 pferrel$ fs -copyFromLocal wikipedia-seqfiles/ wikipedia-seqfiles/ 2012-03-06 13:45:04.225 java[7468:1903] Unable to load realm info from SCDynamicStore copyFromLocal: org.apache.hadoop.security.AccessControlException: Permission denied: user=pferrel, access=WRITE, inode=user:pat:supergroup:rwxr-xr-x This seems like a totally different issue now, and deals with HDFS permissions not cluster start/stop. Yes, you have some files created (or some daemons running) with username pat, while you try to access now as pferrel (your local user). This you can't work around against or evade and will need to fix via hadoop fs -chmod/-chown and such. You can disable permissions if you do not need it though, simply set dfs.permissions to false in NameNode's hdfs-site.xml and restart NN. -- Harsh J
Fair Scheduler Problem
Hi ,All, I encountered a problem in using Cloudera Hadoop 0.20.2-cdh3u1. When I use the fair Scheduler I find the scheduler seems not support preemption. Can anybody tell me whether preemption is supported in this version? This is my configration: mapred-site.xml property namemapred.jobtracker.taskScheduler/name valueorg.apache.hadoop.mapred.FairScheduler/value /property property namemapred.fairscheduler.allocation.file/name value/usr/lib/hadoop-0.20/conf/fair-scheduler.xml/value /property property namemapred.fairscheduler.preemption/name valuetrue/value /property property namemapred.fairscheduler.preemption.only.log/name valuetrue/value /property property namemapred.fairscheduler.preemption.interval/name value15000/value /property property namemapred.fairscheduler.weightadjuster/name valueorg.apache.hadoop.mapred.NewJobWeightBooster/value /property property namemapred.fairscheduler.sizebasedweight/name valuetrue/value /property fair-scheduler.xml allocations pool name=root minMaps10/minMaps minReduces5/minReduces maxMaps200/maxMaps maxReduces80/maxReduces maxRunningJobs100/maxRunningJobs minSharePreemptionTimeout30/minSharePreemptionTimeout weight1.0/weight /pool pool name=hadoop minMaps10/minMaps minReduces5/minReduces maxMaps80/maxMaps maxReduces80/maxReduces maxRunningJobs5/maxRunningJobs minSharePreemptionTimeout30/minSharePreemptionTimeout weight1.0/weight /pool user name=user1 maxRunningJobs10/maxRunningJobs /user poolMaxJobsDefault20/poolMaxJobsDefault userMaxJobsDefault10/userMaxJobsDefault defaultMinSharePreemptionTimeout30/defaultMinSharePreemptionTimeout fairSharePreemptionTimeout30/fairSharePreemptionTimeout /allocations regards, 2012-03-07 hao.wang
Re: getting NullPointerException while running Word cont example
Hadoop version : hadoop-0.20.203.0rc1.tar Operaring Syatem : ubuntu 11.10 On Wed, Mar 7, 2012 at 12:19 AM, Harsh J ha...@cloudera.com wrote: Hi Sujit, Please also tell us which version/distribution of Hadoop is this? On Tue, Mar 6, 2012 at 11:27 PM, Sujit Dhamale sujitdhamal...@gmail.com wrote: Hi, I am new to Hadoop., i install Hadoop as per http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluste while running Word cont example i am getting *NullPointerException *can some one please look in to this issue ?* *Thanks in Advance* !!! * duser@sujit:~/Desktop/hadoop$ bin/hadoop dfs -ls /user/hduser/data Found 3 items -rw-r--r-- 1 hduser supergroup 674566 2012-03-06 23:04 /user/hduser/data/pg20417.txt -rw-r--r-- 1 hduser supergroup1573150 2012-03-06 23:04 /user/hduser/data/pg4300.txt -rw-r--r-- 1 hduser supergroup1423801 2012-03-06 23:04 /user/hduser/data/pg5000.txt hduser@sujit:~/Desktop/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/data /user/hduser/gutenberg-outputd 12/03/06 23:14:33 INFO input.FileInputFormat: Total input paths to process : 3 12/03/06 23:14:33 INFO mapred.JobClient: Running job: job_201203062221_0002 12/03/06 23:14:34 INFO mapred.JobClient: map 0% reduce 0% 12/03/06 23:14:49 INFO mapred.JobClient: map 66% reduce 0% 12/03/06 23:14:55 INFO mapred.JobClient: map 100% reduce 0% 12/03/06 23:14:58 INFO mapred.JobClient: Task Id : attempt_201203062221_0002_r_00_0, Status : FAILED Error: java.lang.NullPointerException at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2900) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2820) 12/03/06 23:15:07 INFO mapred.JobClient: Task Id : attempt_201203062221_0002_r_00_1, Status : FAILED Error: java.lang.NullPointerException at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2900) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2820) 12/03/06 23:15:16 INFO mapred.JobClient: Task Id : attempt_201203062221_0002_r_00_2, Status : FAILED Error: java.lang.NullPointerException at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2900) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2820) 12/03/06 23:15:31 INFO mapred.JobClient: Job complete: job_201203062221_0002 12/03/06 23:15:31 INFO mapred.JobClient: Counters: 20 12/03/06 23:15:31 INFO mapred.JobClient: Job Counters 12/03/06 23:15:31 INFO mapred.JobClient: Launched reduce tasks=4 12/03/06 23:15:31 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=22084 12/03/06 23:15:31 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/03/06 23:15:31 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/03/06 23:15:31 INFO mapred.JobClient: Launched map tasks=3 12/03/06 23:15:31 INFO mapred.JobClient: Data-local map tasks=3 12/03/06 23:15:31 INFO mapred.JobClient: Failed reduce tasks=1 12/03/06 23:15:31 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=16799 12/03/06 23:15:31 INFO mapred.JobClient: FileSystemCounters 12/03/06 23:15:31 INFO mapred.JobClient: FILE_BYTES_READ=740520 12/03/06 23:15:31 INFO mapred.JobClient: HDFS_BYTES_READ=3671863 12/03/06 23:15:31 INFO mapred.JobClient: FILE_BYTES_WRITTEN=2278287 12/03/06 23:15:31 INFO mapred.JobClient: File Input Format Counters 12/03/06 23:15:31 INFO mapred.JobClient: Bytes Read=3671517 12/03/06 23:15:31 INFO mapred.JobClient: Map-Reduce Framework 12/03/06 23:15:31 INFO mapred.JobClient: Map output materialized bytes=1474341 12/03/06 23:15:31 INFO mapred.JobClient: Combine output records=102322 12/03/06 23:15:31 INFO mapred.JobClient: Map input records=77932 12/03/06 23:15:31 INFO mapred.JobClient: Spilled Records=153640 12/03/06 23:15:31 INFO mapred.JobClient: Map output bytes=6076095 12/03/06 23:15:31 INFO mapred.JobClient: Combine input records=629172 12/03/06 23:15:31 INFO mapred.JobClient: Map output records=629172 12/03/06 23:15:31 INFO mapred.JobClient: SPLIT_RAW_BYTES=346 hduser@sujit:~/Desktop/hadoop$ -- Harsh J
Re: Fair Scheduler Problem
Hello Hao, Its best to submit CDH user queries to https://groups.google.com/a/cloudera.org/group/cdh-user/topics (cdh-u...@cloudera.org) where the majority of CDH users community resides. How do you determine that preemption did not/does not work? Preemption between pools occurs if a pool's minShare isn't satisfied within preemption-timeout seconds. In this case, it will preempt tasks from other pools. Your settings look alright on a high level. Does your log not carry any preemption logs? What was your pool's share scenario when you tried to observe if it works or not? On Wed, Mar 7, 2012 at 8:35 AM, hao.wang hao.w...@ipinyou.com wrote: Hi ,All, I encountered a problem in using Cloudera Hadoop 0.20.2-cdh3u1. When I use the fair Scheduler I find the scheduler seems not support preemption. Can anybody tell me whether preemption is supported in this version? This is my configration: mapred-site.xml property namemapred.jobtracker.taskScheduler/name valueorg.apache.hadoop.mapred.FairScheduler/value /property property namemapred.fairscheduler.allocation.file/name value/usr/lib/hadoop-0.20/conf/fair-scheduler.xml/value /property property namemapred.fairscheduler.preemption/name valuetrue/value /property property namemapred.fairscheduler.preemption.only.log/name valuetrue/value /property property namemapred.fairscheduler.preemption.interval/name value15000/value /property property namemapred.fairscheduler.weightadjuster/name valueorg.apache.hadoop.mapred.NewJobWeightBooster/value /property property namemapred.fairscheduler.sizebasedweight/name valuetrue/value /property fair-scheduler.xml allocations pool name=root minMaps10/minMaps minReduces5/minReduces maxMaps200/maxMaps maxReduces80/maxReduces maxRunningJobs100/maxRunningJobs minSharePreemptionTimeout30/minSharePreemptionTimeout weight1.0/weight /pool pool name=hadoop minMaps10/minMaps minReduces5/minReduces maxMaps80/maxMaps maxReduces80/maxReduces maxRunningJobs5/maxRunningJobs minSharePreemptionTimeout30/minSharePreemptionTimeout weight1.0/weight /pool user name=user1 maxRunningJobs10/maxRunningJobs /user poolMaxJobsDefault20/poolMaxJobsDefault userMaxJobsDefault10/userMaxJobsDefault defaultMinSharePreemptionTimeout30/defaultMinSharePreemptionTimeout fairSharePreemptionTimeout30/fairSharePreemptionTimeout /allocations regards, 2012-03-07 hao.wang -- Harsh J
Re: Re: Fair Scheduler Problem
Hi, Thanks for your reply! I have solved this problem by setting mapred.fairscheduler.preemption.only.log to false. The preemption works! But I don't know why can not set mapred.fairscheduler.preemption.only.log to true. Is it a bug? regards, 2012-03-07 hao.wang 发件人: Harsh J 发送时间: 2012-03-07 14:14:05 收件人: common-user 抄送: 主题: Re: Fair Scheduler Problem Hello Hao, Its best to submit CDH user queries to https://groups.google.com/a/cloudera.org/group/cdh-user/topics (cdh-u...@cloudera.org) where the majority of CDH users community resides. How do you determine that preemption did not/does not work? Preemption between pools occurs if a pool's minShare isn't satisfied within preemption-timeout seconds. In this case, it will preempt tasks from other pools. Your settings look alright on a high level. Does your log not carry any preemption logs? What was your pool's share scenario when you tried to observe if it works or not? On Wed, Mar 7, 2012 at 8:35 AM, hao.wang hao.w...@ipinyou.com wrote: Hi ,All, I encountered a problem in using Cloudera Hadoop 0.20.2-cdh3u1. When I use the fair Scheduler I find the scheduler seems not support preemption. Can anybody tell me whether preemption is supported in this version? This is my configration: mapred-site.xml property namemapred.jobtracker.taskScheduler/name valueorg.apache.hadoop.mapred.FairScheduler/value /property property namemapred.fairscheduler.allocation.file/name value/usr/lib/hadoop-0.20/conf/fair-scheduler.xml/value /property property namemapred.fairscheduler.preemption/name valuetrue/value /property property namemapred.fairscheduler.preemption.only.log/name valuetrue/value /property property namemapred.fairscheduler.preemption.interval/name value15000/value /property property namemapred.fairscheduler.weightadjuster/name valueorg.apache.hadoop.mapred.NewJobWeightBooster/value /property property namemapred.fairscheduler.sizebasedweight/name valuetrue/value /property fair-scheduler.xml allocations pool name=root minMaps10/minMaps minReduces5/minReduces maxMaps200/maxMaps maxReduces80/maxReduces maxRunningJobs100/maxRunningJobs minSharePreemptionTimeout30/minSharePreemptionTimeout weight1.0/weight /pool pool name=hadoop minMaps10/minMaps minReduces5/minReduces maxMaps80/maxMaps maxReduces80/maxReduces maxRunningJobs5/maxRunningJobs minSharePreemptionTimeout30/minSharePreemptionTimeout weight1.0/weight /pool user name=user1 maxRunningJobs10/maxRunningJobs /user poolMaxJobsDefault20/poolMaxJobsDefault userMaxJobsDefault10/userMaxJobsDefault defaultMinSharePreemptionTimeout30/defaultMinSharePreemptionTimeout fairSharePreemptionTimeout30/fairSharePreemptionTimeout /allocations regards, 2012-03-07 hao.wang -- Harsh J
Re: Re: Fair Scheduler Problem
Ah my bad that I missed it when reading your doc. Yes that property being true would make it only LOG about preemption scenarios, not do preemption. On Wed, Mar 7, 2012 at 12:05 PM, hao.wang hao.w...@ipinyou.com wrote: Hi, Thanks for your reply! I have solved this problem by setting mapred.fairscheduler.preemption.only.log to false. The preemption works! But I don't know why can not set mapred.fairscheduler.preemption.only.log to true. Is it a bug? regards, 2012-03-07 hao.wang 发件人: Harsh J 发送时间: 2012-03-07 14:14:05 收件人: common-user 抄送: 主题: Re: Fair Scheduler Problem Hello Hao, Its best to submit CDH user queries to https://groups.google.com/a/cloudera.org/group/cdh-user/topics (cdh-u...@cloudera.org) where the majority of CDH users community resides. How do you determine that preemption did not/does not work? Preemption between pools occurs if a pool's minShare isn't satisfied within preemption-timeout seconds. In this case, it will preempt tasks from other pools. Your settings look alright on a high level. Does your log not carry any preemption logs? What was your pool's share scenario when you tried to observe if it works or not? On Wed, Mar 7, 2012 at 8:35 AM, hao.wang hao.w...@ipinyou.com wrote: Hi ,All, I encountered a problem in using Cloudera Hadoop 0.20.2-cdh3u1. When I use the fair Scheduler I find the scheduler seems not support preemption. Can anybody tell me whether preemption is supported in this version? This is my configration: mapred-site.xml property namemapred.jobtracker.taskScheduler/name valueorg.apache.hadoop.mapred.FairScheduler/value /property property namemapred.fairscheduler.allocation.file/name value/usr/lib/hadoop-0.20/conf/fair-scheduler.xml/value /property property namemapred.fairscheduler.preemption/name valuetrue/value /property property namemapred.fairscheduler.preemption.only.log/name valuetrue/value /property property namemapred.fairscheduler.preemption.interval/name value15000/value /property property namemapred.fairscheduler.weightadjuster/name valueorg.apache.hadoop.mapred.NewJobWeightBooster/value /property property namemapred.fairscheduler.sizebasedweight/name valuetrue/value /property fair-scheduler.xml allocations pool name=root minMaps10/minMaps minReduces5/minReduces maxMaps200/maxMaps maxReduces80/maxReduces maxRunningJobs100/maxRunningJobs minSharePreemptionTimeout30/minSharePreemptionTimeout weight1.0/weight /pool pool name=hadoop minMaps10/minMaps minReduces5/minReduces maxMaps80/maxMaps maxReduces80/maxReduces maxRunningJobs5/maxRunningJobs minSharePreemptionTimeout30/minSharePreemptionTimeout weight1.0/weight /pool user name=user1 maxRunningJobs10/maxRunningJobs /user poolMaxJobsDefault20/poolMaxJobsDefault userMaxJobsDefault10/userMaxJobsDefault defaultMinSharePreemptionTimeout30/defaultMinSharePreemptionTimeout fairSharePreemptionTimeout30/fairSharePreemptionTimeout /allocations regards, 2012-03-07 hao.wang -- Harsh J -- Harsh J
Re: how is userlogs supposed to be cleaned up?
Aside from cleanup, it seems like you are running into max number of subdirectories per directory on ext3. Joep Sent from my iPhone On Mar 6, 2012, at 10:22 AM, Chris Curtin curtin.ch...@gmail.com wrote: Hi, We had a fun morning trying to figure out why our cluster was failing jobs, removing nodes from the cluster etc. The majority of the errors were something like: Error initializing attempt_201203061035_0047_m_02_0: org.apache.hadoop.util.Shell$ExitCodeException: chmod: cannot access `/disk1/userlogs/job_201203061035_0047': No such file or directory at org.apache.hadoop.util.Shell.runCommand(Shell.java:255) at org.apache.hadoop.util.Shell.run(Shell.java:182) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375) at org.apache.hadoop.util.Shell.execCommand(Shell.java:461) at org.apache.hadoop.util.Shell.execCommand(Shell.java:444) at org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.java:533) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:524) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344) at org.apache.hadoop.mapred.JobLocalizer.initializeJobLogDir(JobLocalizer.java:240) at org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:216) at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1352) Finally we shutdown the entire cluster and found that the 'userlogs' directory on the failed nodes had 30,000+ directories and the 'live' nodes 25,000+. Looking at creation timestamps it looks like around adding 30,000th directory the node falls over. Many of the directorys are weeks old and a few were months old. Deleting ALL the directories on all the nodes allowed us to bring the cluster up and things to run again. (Some users are claiming it is running faster now?) Our question: what is supposed to be cleaning up these directories? How often is that process or step taken? We are running CDH3u3. Thanks, Chris