Re: map task execution time
Hi, Am 05.04.2012 um 00:20 schrieb bikash sharma: Is it possible to get the execution time of the constituent map/reduce tasks of a MapReduce job (say sort) at the end of a job run? Preferably, can we obtain this programatically? you can access the JobTracker's web UI and see the start and stop timestamps for every individual task. Since the JobTracker Java API is exposed, you can write your own application to fetch that data through your own code. Also, hadoop job on the command line can be used to read job statistics. Kai -- Kai Voigt k...@123.org
Re: cloudera vs apache hadoop migration stories ?
Hi Jay, Probably makes sense to move this to the cdh-user list if you have any Cloudera-specific questions. But I just wanted to clarify: CDH doesn't make any API changes that aren't already upstream. So, in some places, CDH may be ahead of whatever Apache release you are comparing against, but it is always made up of patches from the Apache trunk. In the specific case of MultipleInputs, we did backport the new API implementation from Apache Hadoop 0.21+. If you find something in CDH that you would like backported to upstream Apache Hadoop 1.0.x, please feel free to file a JIRA and assign it to me - I'm happy to look into it for you. Thanks Todd On Wed, Apr 4, 2012 at 10:15 AM, Jay Vyas jayunit...@gmail.com wrote: Seems like cloudera and standard apache-hadoop are really not cross compatible. Things like MultipleInputs and stuff that we are finding don't work the same. Any good (recent) war stories on the migration between the two ? Its interesting to me that cloudera and amazon are that difficult to swap in/out in cloud. -- Todd Lipcon Software Engineer, Cloudera
Re: Jobtracker history logs missing
Hi Prashant, The userlogs for job are deleted after time specified by * mapred.userlog.retain.hours* property defined in mapred-site.xml (default is 24 Hrs). Thanks, Nitin On 5 April 2012 14:26, Prashant Kommireddi prash1...@gmail.com wrote: I am noticing something strange with JobTracker history logs on my cluster. I see configuration files (*_conf.xml) under /logs/history/ but none of the actual job logs. Anyone has ideas on what might be happening? Thanks, -- Nitin Khandelwal
Re: map task execution time
Thanks Kai, I will try those. On Thu, Apr 5, 2012 at 3:15 AM, Kai Voigt k...@123.org wrote: Hi, Am 05.04.2012 um 00:20 schrieb bikash sharma: Is it possible to get the execution time of the constituent map/reduce tasks of a MapReduce job (say sort) at the end of a job run? Preferably, can we obtain this programatically? you can access the JobTracker's web UI and see the start and stop timestamps for every individual task. Since the JobTracker Java API is exposed, you can write your own application to fetch that data through your own code. Also, hadoop job on the command line can be used to read job statistics. Kai -- Kai Voigt k...@123.org
Re: map task execution time
How can hadoop job be used to read m/r statistics ? On Thu, Apr 5, 2012 at 7:30 AM, bikash sharma sharmabiks...@gmail.comwrote: Thanks Kai, I will try those. On Thu, Apr 5, 2012 at 3:15 AM, Kai Voigt k...@123.org wrote: Hi, Am 05.04.2012 um 00:20 schrieb bikash sharma: Is it possible to get the execution time of the constituent map/reduce tasks of a MapReduce job (say sort) at the end of a job run? Preferably, can we obtain this programatically? you can access the JobTracker's web UI and see the start and stop timestamps for every individual task. Since the JobTracker Java API is exposed, you can write your own application to fetch that data through your own code. Also, hadoop job on the command line can be used to read job statistics. Kai -- Kai Voigt k...@123.org -- Jay Vyas MMSB/UCHC
Re: map task execution time
(excuse the typo in the last email : I meant I've been playing with Cinch , not I've been with Cinch) On Thu, Apr 5, 2012 at 7:54 AM, Jay Vyas jayunit...@gmail.com wrote: How can hadoop job be used to read m/r statistics ? On Thu, Apr 5, 2012 at 7:30 AM, bikash sharma sharmabiks...@gmail.comwrote: Thanks Kai, I will try those. On Thu, Apr 5, 2012 at 3:15 AM, Kai Voigt k...@123.org wrote: Hi, Am 05.04.2012 um 00:20 schrieb bikash sharma: Is it possible to get the execution time of the constituent map/reduce tasks of a MapReduce job (say sort) at the end of a job run? Preferably, can we obtain this programatically? you can access the JobTracker's web UI and see the start and stop timestamps for every individual task. Since the JobTracker Java API is exposed, you can write your own application to fetch that data through your own code. Also, hadoop job on the command line can be used to read job statistics. Kai -- Kai Voigt k...@123.org -- Jay Vyas MMSB/UCHC -- Jay Vyas MMSB/UCHC
Re: map task execution time
Yes, how can we use hadoop job to get MR job stats, especially constituent task finish times? On Thu, Apr 5, 2012 at 9:02 AM, Jay Vyas jayunit...@gmail.com wrote: (excuse the typo in the last email : I meant I've been playing with Cinch , not I've been with Cinch) On Thu, Apr 5, 2012 at 7:54 AM, Jay Vyas jayunit...@gmail.com wrote: How can hadoop job be used to read m/r statistics ? On Thu, Apr 5, 2012 at 7:30 AM, bikash sharma sharmabiks...@gmail.com wrote: Thanks Kai, I will try those. On Thu, Apr 5, 2012 at 3:15 AM, Kai Voigt k...@123.org wrote: Hi, Am 05.04.2012 um 00:20 schrieb bikash sharma: Is it possible to get the execution time of the constituent map/reduce tasks of a MapReduce job (say sort) at the end of a job run? Preferably, can we obtain this programatically? you can access the JobTracker's web UI and see the start and stop timestamps for every individual task. Since the JobTracker Java API is exposed, you can write your own application to fetch that data through your own code. Also, hadoop job on the command line can be used to read job statistics. Kai -- Kai Voigt k...@123.org -- Jay Vyas MMSB/UCHC -- Jay Vyas MMSB/UCHC
Re: Doubt from the book Definitive Guide
On Wed, Apr 4, 2012 at 10:02 PM, Prashant Kommireddi prash1...@gmail.comwrote: Hi Mohit, What would be the advantage? Reducers in most cases read data from all the mappers. In the case where mappers were to write to HDFS, a reducer would still require to read data from other datanodes across the cluster. Only advantage I was thinking of was that in some cases reducers might be able to take advantage of data locality and avoid multiple HTTP calls, no? Data is anyways written, so last merged file could go on HDFS instead of local disk. I am new to hadoop so just asking question to understand the rational behind using local disk for final output. Prashant On Apr 4, 2012, at 9:55 PM, Mohit Anchlia mohitanch...@gmail.com wrote: On Wed, Apr 4, 2012 at 8:42 PM, Harsh J ha...@cloudera.com wrote: Hi Mohit, On Thu, Apr 5, 2012 at 5:26 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I am going through the chapter How mapreduce works and have some confusion: 1) Below description of Mapper says that reducers get the output file using HTTP call. But the description under The Reduce Side doesn't specifically say if it's copied using HTTP. So first confusion, Is the output copied from mapper - reducer or from reducer - mapper? And second, Is the call http:// or hdfs:// The flow is simple as this: 1. For M+R job, map completes its task after writing all partitions down into the tasktracker's local filesystem (under mapred.local.dir directories). 2. Reducers fetch completion locations from events at JobTracker, and query the TaskTracker there to provide it the specific partition it needs, which is done over the TaskTracker's HTTP service (50060). So to clear things up - map doesn't send it to reduce, nor does reduce ask the actual map task. It is the task tracker itself that makes the bridge here. Note however, that in Hadoop 2.0 the transfer via ShuffleHandler would be over Netty connections. This would be much more faster and reliable. 2) My understanding was that mapper output gets written to hdfs, since I've seen part-m-0 files in hdfs. If mapper output is written to HDFS then shouldn't reducers simply read it from hdfs instead of making http calls to tasktrackers location? A map-only job usually writes out to HDFS directly (no sorting done, cause no reducer is involved). If the job is a map+reduce one, the default output is collected to local filesystem for partitioning and sorting at map end, and eventually grouping at reduce end. Basically: Data you want to send to reducer from mapper goes to local FS for multiple actions to be performed on them, other data may directly go to HDFS. Reducers currently are scheduled pretty randomly but yes their scheduling can be improved for certain scenarios. However, if you are pointing that map partitions ought to be written to HDFS itself (with replication or without), I don't see performance improving. Note that the partitions aren't merely written but need to be sorted as well (at either end). To do that would need ability to spill frequently (cause we don't have infinite memory to do it all in RAM) and doing such a thing on HDFS would only mean slowdown. Thanks for clearing my doubts. In this case I was merely suggesting that if the mapper output (merged output in the end or the shuffle output) is stored in HDFS then reducers can just retrieve it from HDFS instead of asking tasktracker for it. Once reducer threads read it they can continue to work locally. I hope this helps clear some things up for you. -- Harsh J
Re: getting NullPointerException while running Word cont example
Hi Sujit, I think it is a problem with the host names configuration. Could you please check whether you added the host names of the master and the slaves in the etc/hosts file of all the nodes. On Mon, Apr 2, 2012 at 8:00 PM, Sujit Dhamale sujitdhamal...@gmail.comwrote: Can some one please look in to below issue ?? Thanks in Advance On Wed, Mar 7, 2012 at 9:09 AM, Sujit Dhamale sujitdhamal...@gmail.com wrote: Hadoop version : hadoop-0.20.203.0rc1.tar Operaring Syatem : ubuntu 11.10 On Wed, Mar 7, 2012 at 12:19 AM, Harsh J ha...@cloudera.com wrote: Hi Sujit, Please also tell us which version/distribution of Hadoop is this? On Tue, Mar 6, 2012 at 11:27 PM, Sujit Dhamale sujitdhamal...@gmail.com wrote: Hi, I am new to Hadoop., i install Hadoop as per http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluste while running Word cont example i am getting *NullPointerException *can some one please look in to this issue ?* *Thanks in Advance* !!! * duser@sujit:~/Desktop/hadoop$ bin/hadoop dfs -ls /user/hduser/data Found 3 items -rw-r--r-- 1 hduser supergroup 674566 2012-03-06 23:04 /user/hduser/data/pg20417.txt -rw-r--r-- 1 hduser supergroup1573150 2012-03-06 23:04 /user/hduser/data/pg4300.txt -rw-r--r-- 1 hduser supergroup1423801 2012-03-06 23:04 /user/hduser/data/pg5000.txt hduser@sujit:~/Desktop/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/data /user/hduser/gutenberg-outputd 12/03/06 23:14:33 INFO input.FileInputFormat: Total input paths to process : 3 12/03/06 23:14:33 INFO mapred.JobClient: Running job: job_201203062221_0002 12/03/06 23:14:34 INFO mapred.JobClient: map 0% reduce 0% 12/03/06 23:14:49 INFO mapred.JobClient: map 66% reduce 0% 12/03/06 23:14:55 INFO mapred.JobClient: map 100% reduce 0% 12/03/06 23:14:58 INFO mapred.JobClient: Task Id : attempt_201203062221_0002_r_00_0, Status : FAILED Error: java.lang.NullPointerException at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2900) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2820) 12/03/06 23:15:07 INFO mapred.JobClient: Task Id : attempt_201203062221_0002_r_00_1, Status : FAILED Error: java.lang.NullPointerException at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2900) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2820) 12/03/06 23:15:16 INFO mapred.JobClient: Task Id : attempt_201203062221_0002_r_00_2, Status : FAILED Error: java.lang.NullPointerException at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2900) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2820) 12/03/06 23:15:31 INFO mapred.JobClient: Job complete: job_201203062221_0002 12/03/06 23:15:31 INFO mapred.JobClient: Counters: 20 12/03/06 23:15:31 INFO mapred.JobClient: Job Counters 12/03/06 23:15:31 INFO mapred.JobClient: Launched reduce tasks=4 12/03/06 23:15:31 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=22084 12/03/06 23:15:31 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/03/06 23:15:31 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/03/06 23:15:31 INFO mapred.JobClient: Launched map tasks=3 12/03/06 23:15:31 INFO mapred.JobClient: Data-local map tasks=3 12/03/06 23:15:31 INFO mapred.JobClient: Failed reduce tasks=1 12/03/06 23:15:31 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=16799 12/03/06 23:15:31 INFO mapred.JobClient: FileSystemCounters 12/03/06 23:15:31 INFO mapred.JobClient: FILE_BYTES_READ=740520 12/03/06 23:15:31 INFO mapred.JobClient: HDFS_BYTES_READ=3671863 12/03/06 23:15:31 INFO mapred.JobClient: FILE_BYTES_WRITTEN=2278287 12/03/06 23:15:31 INFO mapred.JobClient: File Input Format Counters 12/03/06 23:15:31 INFO mapred.JobClient: Bytes Read=3671517 12/03/06 23:15:31 INFO mapred.JobClient: Map-Reduce Framework 12/03/06 23:15:31 INFO mapred.JobClient: Map output materialized bytes=1474341 12/03/06 23:15:31 INFO mapred.JobClient: Combine output records=102322 12/03/06
Re: Jobtracker history logs missing
Thanks Nitin. I believe the config key you mentioned controls the task attempts logs that go under - ${hadoop.log.dir}/userlogs. The ones that I mentioned are the job history logs that go under - ${hadoop.log.dir}/history and are specified by the key hadoop.job.history.location. Are these cleaned up based on mapred.userlog.retain.hours too? Also, this is what I am seeing in history dir Available Conf files - Mar 3rd - April 5th Available Job files - Mar 3rd - April 3rd There is no job file present after the 3rd of April, but conf files continue to be written. Thanks, Prashant On Thu, Apr 5, 2012 at 3:22 AM, Nitin Khandelwal nitin.khandel...@germinait.com wrote: Hi Prashant, The userlogs for job are deleted after time specified by * mapred.userlog.retain.hours* property defined in mapred-site.xml (default is 24 Hrs). Thanks, Nitin On 5 April 2012 14:26, Prashant Kommireddi prash1...@gmail.com wrote: I am noticing something strange with JobTracker history logs on my cluster. I see configuration files (*_conf.xml) under /logs/history/ but none of the actual job logs. Anyone has ideas on what might be happening? Thanks, -- Nitin Khandelwal
Re: Doubt from the book Definitive Guide
On Thu, Apr 5, 2012 at 7:03 AM, Mohit Anchlia mohitanch...@gmail.com wrote: Only advantage I was thinking of was that in some cases reducers might be able to take advantage of data locality and avoid multiple HTTP calls, no? Data is anyways written, so last merged file could go on HDFS instead of local disk. I am new to hadoop so just asking question to understand the rational behind using local disk for final output. So basically it's a tradeoff here, you get more replicas to copy from but you have 2 more copies to write. Considering that that data's very short lived and that it doesn't need to be replicated (since if the machine fails the maps are replayed anyway) it seems that writing 2 replicas that are potentially unused would be hurtful. Regarding locality, it might make sense on a small cluster but the more you add nodes the smaller the chance to have local replicas for each blocks of data you're looking for. J-D
Hadoop pipes and streaming ..
Hi guys, Two quick questions: 1. Are there any performance gains from hadoop streaming or pipes ? As far as I read, it is to ease testing using your favorite language. Which I think implies that everything is translated to bytecode eventually and executed.
Hadoop streaming or pipes ..
Hi guys, quick question: Are there any performance gains from hadoop streaming or pipes over Java? From what I've read, it's only to ease testing by using your favorite language. So I guess it is eventually translated to bytecode then executed. Is that true? Thank you, Mark
Re: Hadoop streaming or pipes ..
Both streaming and pipes do very similar things. They will fork/exec a separate process that is running whatever you want it to run. The JVM that is running hadoop then communicates with this process to send the data over and get the processing results back. The difference between streaming and pipes is that streaming uses stdin/stdout for this communication so preexisting processing like grep, sed and awk can be used here. Pipes uses a custom protocol with a C++ library to communicate. The C++ library is tagged with SWIG compatible data so that it can be wrapped to have APIs in other languages like python or perl. I am not sure what the performance difference is between the two, but in my own work I have seen a significant performance penalty from using either of them, because there is a somewhat large overhead of sending all of the data out to a separate process just to read it back in again. --Bobby Evans On 4/5/12 1:54 PM, Mark question markq2...@gmail.com wrote: Hi guys, quick question: Are there any performance gains from hadoop streaming or pipes over Java? From what I've read, it's only to ease testing by using your favorite language. So I guess it is eventually translated to bytecode then executed. Is that true? Thank you, Mark
Re: Hadoop streaming or pipes ..
Thanks for the response Robert .. so the overhead will be in read/write and communication. But is the new process spawned a JVM or a regular process? Thanks, Mark On Thu, Apr 5, 2012 at 12:49 PM, Robert Evans ev...@yahoo-inc.com wrote: Both streaming and pipes do very similar things. They will fork/exec a separate process that is running whatever you want it to run. The JVM that is running hadoop then communicates with this process to send the data over and get the processing results back. The difference between streaming and pipes is that streaming uses stdin/stdout for this communication so preexisting processing like grep, sed and awk can be used here. Pipes uses a custom protocol with a C++ library to communicate. The C++ library is tagged with SWIG compatible data so that it can be wrapped to have APIs in other languages like python or perl. I am not sure what the performance difference is between the two, but in my own work I have seen a significant performance penalty from using either of them, because there is a somewhat large overhead of sending all of the data out to a separate process just to read it back in again. --Bobby Evans On 4/5/12 1:54 PM, Mark question markq2...@gmail.com wrote: Hi guys, quick question: Are there any performance gains from hadoop streaming or pipes over Java? From what I've read, it's only to ease testing by using your favorite language. So I guess it is eventually translated to bytecode then executed. Is that true? Thank you, Mark
how do i view the local file system output of a mapper on cygwin + windows?
i am currently testing my map reduce job on Windows + Cygwin + Hadoop v0.20.205. for some strange reason, the list of values (i.e. IterableT values) going into the reducer looks all wrong. i have tracked the map reduce process with logging statements (i.e. logged the input to the map, logged the output from the map, logged the partitioner, logged the input to the reducer). at all stages, everything looks correct except at the reducer. is there anyway (using Windows + Cygwin) to view the local map outputs before they are shuffled/sorted to the reducer? i need to know why the values are incorrect.
Re: Hadoop streaming or pipes ..
Also bear in mind that there is a kind of detour involved, in the sense that a pipes map must send key,value data back to the Java process and then to reduce (more or less). I think that the Hadoop C Extension (HCE, there is a patch) is supposed to be faster. Would be interested to know if the community has any experience with HCE performance. C On Apr 5, 2012, at 3:49 PM, Robert Evans ev...@yahoo-inc.com wrote: Both streaming and pipes do very similar things. They will fork/exec a separate process that is running whatever you want it to run. The JVM that is running hadoop then communicates with this process to send the data over and get the processing results back. The difference between streaming and pipes is that streaming uses stdin/stdout for this communication so preexisting processing like grep, sed and awk can be used here. Pipes uses a custom protocol with a C++ library to communicate. The C++ library is tagged with SWIG compatible data so that it can be wrapped to have APIs in other languages like python or perl. I am not sure what the performance difference is between the two, but in my own work I have seen a significant performance penalty from using either of them, because there is a somewhat large overhead of sending all of the data out to a separate process just to read it back in again. --Bobby Evans On 4/5/12 1:54 PM, Mark question markq2...@gmail.com wrote: Hi guys, quick question: Are there any performance gains from hadoop streaming or pipes over Java? From what I've read, it's only to ease testing by using your favorite language. So I guess it is eventually translated to bytecode then executed. Is that true? Thank you, Mark
Re: how do i view the local file system output of a mapper on cygwin + windows?
i found out what my problem was. apparently, when you iterate over IterableType values, that instance of Type is being used over and over. for example, in my reducer, public void reduce(Key key, IteratorValue values, Context context) throws IOException, InterruptedException { IteratorValue it = values.iterator(); Value a = it.next(); Value b = it.next(); } the variables, a and b of type Value, will be the same object instance! i suppose this behavior of the iterator is to optimize iterating so as to avoid the new operator. On Thu, Apr 5, 2012 at 4:55 PM, Jane Wayne jane.wayne2...@gmail.com wrote: i am currently testing my map reduce job on Windows + Cygwin + Hadoop v0.20.205. for some strange reason, the list of values (i.e. IterableT values) going into the reducer looks all wrong. i have tracked the map reduce process with logging statements (i.e. logged the input to the map, logged the output from the map, logged the partitioner, logged the input to the reducer). at all stages, everything looks correct except at the reducer. is there anyway (using Windows + Cygwin) to view the local map outputs before they are shuffled/sorted to the reducer? i need to know why the values are incorrect.
Re: how do i view the local file system output of a mapper on cygwin + windows?
Jane, Yes and thats documented: http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/Reducer.html#reduce(K2,%20java.util.Iterator,%20org.apache.hadoop.mapred.OutputCollector,%20org.apache.hadoop.mapred.Reporter) The framework will reuse the key and value objects that are passed into the reduce, therefore the application should clone the objects they want to keep a copy of. On Fri, Apr 6, 2012 at 6:26 AM, Jane Wayne jane.wayne2...@gmail.com wrote: i found out what my problem was. apparently, when you iterate over IterableType values, that instance of Type is being used over and over. for example, in my reducer, public void reduce(Key key, IteratorValue values, Context context) throws IOException, InterruptedException { IteratorValue it = values.iterator(); Value a = it.next(); Value b = it.next(); } the variables, a and b of type Value, will be the same object instance! i suppose this behavior of the iterator is to optimize iterating so as to avoid the new operator. On Thu, Apr 5, 2012 at 4:55 PM, Jane Wayne jane.wayne2...@gmail.com wrote: i am currently testing my map reduce job on Windows + Cygwin + Hadoop v0.20.205. for some strange reason, the list of values (i.e. IterableT values) going into the reducer looks all wrong. i have tracked the map reduce process with logging statements (i.e. logged the input to the map, logged the output from the map, logged the partitioner, logged the input to the reducer). at all stages, everything looks correct except at the reducer. is there anyway (using Windows + Cygwin) to view the local map outputs before they are shuffled/sorted to the reducer? i need to know why the values are incorrect. -- Harsh J