Re: how do i view the local file system output of a mapper on cygwin + windows?
Jane, Yes and thats documented: http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/Reducer.html#reduce(K2,%20java.util.Iterator,%20org.apache.hadoop.mapred.OutputCollector,%20org.apache.hadoop.mapred.Reporter) "The framework will reuse the key and value objects that are passed into the reduce, therefore the application should clone the objects they want to keep a copy of." On Fri, Apr 6, 2012 at 6:26 AM, Jane Wayne wrote: > i found out what my problem was. apparently, when you iterate over > Iterable values, that instance of Type is being used over and > over. for example, in my reducer, > > public void reduce(Key key, Iterator values, Context context) > throws IOException, InterruptedException { > Iterator it = values.iterator(); > Value a = it.next(); > Value b = it.next(); > } > > the variables, a and b of type Value, will be the same object > instance! i suppose this behavior of the iterator is to optimize > iterating so as to avoid the new operator. > > > > On Thu, Apr 5, 2012 at 4:55 PM, Jane Wayne wrote: >> i am currently testing my map reduce job on Windows + Cygwin + Hadoop >> v0.20.205. for some strange reason, the list of values (i.e. >> Iterable values) going into the reducer looks all wrong. i have >> tracked the map reduce process with logging statements (i.e. logged >> the input to the map, logged the output from the map, logged the >> partitioner, logged the input to the reducer). at all stages, >> everything looks correct except at the reducer. >> >> is there anyway (using Windows + Cygwin) to view the local map >> outputs before they are shuffled/sorted to the reducer? i need to know >> why the values are incorrect. -- Harsh J
Re: how do i view the local file system output of a mapper on cygwin + windows?
i found out what my problem was. apparently, when you iterate over Iterable values, that instance of Type is being used over and over. for example, in my reducer, public void reduce(Key key, Iterator values, Context context) throws IOException, InterruptedException { Iterator it = values.iterator(); Value a = it.next(); Value b = it.next(); } the variables, a and b of type Value, will be the same object instance! i suppose this behavior of the iterator is to optimize iterating so as to avoid the new operator. On Thu, Apr 5, 2012 at 4:55 PM, Jane Wayne wrote: > i am currently testing my map reduce job on Windows + Cygwin + Hadoop > v0.20.205. for some strange reason, the list of values (i.e. > Iterable values) going into the reducer looks all wrong. i have > tracked the map reduce process with logging statements (i.e. logged > the input to the map, logged the output from the map, logged the > partitioner, logged the input to the reducer). at all stages, > everything looks correct except at the reducer. > > is there anyway (using Windows + Cygwin) to view the local map > outputs before they are shuffled/sorted to the reducer? i need to know > why the values are incorrect.
Re: Hadoop streaming or pipes ..
Also bear in mind that there is a kind of detour involved, in the sense that a pipes map must send key,value data back to the Java process and then to reduce (more or less). I think that the Hadoop C Extension (HCE, there is a patch) is supposed to be faster. Would be interested to know if the community has any experience with HCE performance. C On Apr 5, 2012, at 3:49 PM, Robert Evans wrote: > Both streaming and pipes do very similar things. They will fork/exec a > separate process that is running whatever you want it to run. The JVM that > is running hadoop then communicates with this process to send the data over > and get the processing results back. The difference between streaming and > pipes is that streaming uses stdin/stdout for this communication so > preexisting processing like grep, sed and awk can be used here. Pipes uses a > custom protocol with a C++ library to communicate. The C++ library is tagged > with SWIG compatible data so that it can be wrapped to have APIs in other > languages like python or perl. > > I am not sure what the performance difference is between the two, but in my > own work I have seen a significant performance penalty from using either of > them, because there is a somewhat large overhead of sending all of the data > out to a separate process just to read it back in again. > > --Bobby Evans > > > On 4/5/12 1:54 PM, "Mark question" wrote: > > Hi guys, > quick question: > Are there any performance gains from hadoop streaming or pipes over > Java? From what I've read, it's only to ease testing by using your favorite > language. So I guess it is eventually translated to bytecode then executed. > Is that true? > > Thank you, > Mark >
Re: Hadoop streaming or pipes ..
It is a regular process, unless you explicitly say you want it to be java, which would be a bit odd to do, but possible. --Bobby On 4/5/12 3:14 PM, "Mark question" wrote: Thanks for the response Robert .. so the overhead will be in read/write and communication. But is the new process spawned a JVM or a regular process? Thanks, Mark On Thu, Apr 5, 2012 at 12:49 PM, Robert Evans wrote: > Both streaming and pipes do very similar things. They will fork/exec a > separate process that is running whatever you want it to run. The JVM that > is running hadoop then communicates with this process to send the data over > and get the processing results back. The difference between streaming and > pipes is that streaming uses stdin/stdout for this communication so > preexisting processing like grep, sed and awk can be used here. Pipes uses > a custom protocol with a C++ library to communicate. The C++ library is > tagged with SWIG compatible data so that it can be wrapped to have APIs in > other languages like python or perl. > > I am not sure what the performance difference is between the two, but in > my own work I have seen a significant performance penalty from using either > of them, because there is a somewhat large overhead of sending all of the > data out to a separate process just to read it back in again. > > --Bobby Evans > > > On 4/5/12 1:54 PM, "Mark question" wrote: > > Hi guys, > quick question: > Are there any performance gains from hadoop streaming or pipes over > Java? From what I've read, it's only to ease testing by using your favorite > language. So I guess it is eventually translated to bytecode then executed. > Is that true? > > Thank you, > Mark > >
how do i view the local file system output of a mapper on cygwin + windows?
i am currently testing my map reduce job on Windows + Cygwin + Hadoop v0.20.205. for some strange reason, the list of values (i.e. Iterable values) going into the reducer looks all wrong. i have tracked the map reduce process with logging statements (i.e. logged the input to the map, logged the output from the map, logged the partitioner, logged the input to the reducer). at all stages, everything looks correct except at the reducer. is there anyway (using Windows + Cygwin) to view the local map outputs before they are shuffled/sorted to the reducer? i need to know why the values are incorrect.
Re: Hadoop streaming or pipes ..
Thanks for the response Robert .. so the overhead will be in read/write and communication. But is the new process spawned a JVM or a regular process? Thanks, Mark On Thu, Apr 5, 2012 at 12:49 PM, Robert Evans wrote: > Both streaming and pipes do very similar things. They will fork/exec a > separate process that is running whatever you want it to run. The JVM that > is running hadoop then communicates with this process to send the data over > and get the processing results back. The difference between streaming and > pipes is that streaming uses stdin/stdout for this communication so > preexisting processing like grep, sed and awk can be used here. Pipes uses > a custom protocol with a C++ library to communicate. The C++ library is > tagged with SWIG compatible data so that it can be wrapped to have APIs in > other languages like python or perl. > > I am not sure what the performance difference is between the two, but in > my own work I have seen a significant performance penalty from using either > of them, because there is a somewhat large overhead of sending all of the > data out to a separate process just to read it back in again. > > --Bobby Evans > > > On 4/5/12 1:54 PM, "Mark question" wrote: > > Hi guys, > quick question: > Are there any performance gains from hadoop streaming or pipes over > Java? From what I've read, it's only to ease testing by using your favorite > language. So I guess it is eventually translated to bytecode then executed. > Is that true? > > Thank you, > Mark > >
Re: Hadoop streaming or pipes ..
Both streaming and pipes do very similar things. They will fork/exec a separate process that is running whatever you want it to run. The JVM that is running hadoop then communicates with this process to send the data over and get the processing results back. The difference between streaming and pipes is that streaming uses stdin/stdout for this communication so preexisting processing like grep, sed and awk can be used here. Pipes uses a custom protocol with a C++ library to communicate. The C++ library is tagged with SWIG compatible data so that it can be wrapped to have APIs in other languages like python or perl. I am not sure what the performance difference is between the two, but in my own work I have seen a significant performance penalty from using either of them, because there is a somewhat large overhead of sending all of the data out to a separate process just to read it back in again. --Bobby Evans On 4/5/12 1:54 PM, "Mark question" wrote: Hi guys, quick question: Are there any performance gains from hadoop streaming or pipes over Java? From what I've read, it's only to ease testing by using your favorite language. So I guess it is eventually translated to bytecode then executed. Is that true? Thank you, Mark
Hadoop streaming or pipes ..
Hi guys, quick question: Are there any performance gains from hadoop streaming or pipes over Java? From what I've read, it's only to ease testing by using your favorite language. So I guess it is eventually translated to bytecode then executed. Is that true? Thank you, Mark
Hadoop pipes and streaming ..
Hi guys, Two quick questions: 1. Are there any performance gains from hadoop streaming or pipes ? As far as I read, it is to ease testing using your favorite language. Which I think implies that everything is translated to bytecode eventually and executed.
Analysing Hadoop log files offline.
Please take a look at https://github.com/rajvish/hadoop-summary These scripts enable you to take a hadoop job logs and can provide both a summary information or detailed task information. I have tested it on both 0.20.* and 1.0. The output is a CSV file that can be analysed using other programs such as excel. I would love to get feedback. -regards Raj
Re: Doubt from the book "Definitive Guide"
On Thu, Apr 5, 2012 at 7:03 AM, Mohit Anchlia wrote: > Only advantage I was thinking of was that in some cases reducers might be > able to take advantage of data locality and avoid multiple HTTP calls, no? > Data is anyways written, so last merged file could go on HDFS instead of > local disk. > I am new to hadoop so just asking question to understand the rational > behind using local disk for final output. So basically it's a tradeoff here, you get more replicas to copy from but you have 2 more copies to write. Considering that that data's very short lived and that it doesn't need to be replicated (since if the machine fails the maps are replayed anyway) it seems that writing 2 replicas that are potentially unused would be hurtful. Regarding locality, it might make sense on a small cluster but the more you add nodes the smaller the chance to have local replicas for each blocks of data you're looking for. J-D
Re: Jobtracker history logs missing
Thanks Nitin. I believe the config key you mentioned controls the task attempts logs that go under - ${hadoop.log.dir}/userlogs. The ones that I mentioned are the job history logs that go under - ${hadoop.log.dir}/history and are specified by the key "hadoop.job.history.location". Are these cleaned up based on "mapred.userlog.retain.hours" too? Also, this is what I am seeing in history dir Available Conf files - Mar 3rd - April 5th Available Job files - Mar 3rd - April 3rd There is no job file present after the 3rd of April, but conf files continue to be written. Thanks, Prashant On Thu, Apr 5, 2012 at 3:22 AM, Nitin Khandelwal < nitin.khandel...@germinait.com> wrote: > Hi Prashant, > > The userlogs for job are deleted after time specified by "* > mapred.userlog.retain.hours*" property defined in mapred-site.xml (default > is 24 Hrs). > > Thanks, > Nitin > > On 5 April 2012 14:26, Prashant Kommireddi wrote: > > > I am noticing something strange with JobTracker history logs on my > cluster. > > I see configuration files (*_conf.xml) under /logs/history/ but none of > the > > actual job logs. Anyone has ideas on what might be happening? > > > > Thanks, > > > > > > -- > > > Nitin Khandelwal >
Re: getting NullPointerException while running Word cont example
Hi Sujit, I think it is a problem with the host names configuration. Could you please check whether you added the host names of the master and the slaves in the etc/hosts file of all the nodes. On Mon, Apr 2, 2012 at 8:00 PM, Sujit Dhamale wrote: > Can some one please look in to below issue ?? > Thanks in Advance > > On Wed, Mar 7, 2012 at 9:09 AM, Sujit Dhamale >wrote: > > > Hadoop version : hadoop-0.20.203.0rc1.tar > > Operaring Syatem : ubuntu 11.10 > > > > > > > > On Wed, Mar 7, 2012 at 12:19 AM, Harsh J wrote: > > > >> Hi Sujit, > >> > >> Please also tell us which version/distribution of Hadoop is this? > >> > >> On Tue, Mar 6, 2012 at 11:27 PM, Sujit Dhamale < > sujitdhamal...@gmail.com> > >> wrote: > >> > Hi, > >> > > >> > I am new to Hadoop., i install Hadoop as per > >> > > >> > http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ > >> < > >> > http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluste > >> > > >> > > >> > > >> > while running Word cont example i am getting *NullPointerException > >> > > >> > *can some one please look in to this issue ?* > >> > > >> > *Thanks in Advance* !!! > >> > > >> > * > >> > > >> > > >> > duser@sujit:~/Desktop/hadoop$ bin/hadoop dfs -ls /user/hduser/data > >> > Found 3 items > >> > -rw-r--r-- 1 hduser supergroup 674566 2012-03-06 23:04 > >> > /user/hduser/data/pg20417.txt > >> > -rw-r--r-- 1 hduser supergroup1573150 2012-03-06 23:04 > >> > /user/hduser/data/pg4300.txt > >> > -rw-r--r-- 1 hduser supergroup1423801 2012-03-06 23:04 > >> > /user/hduser/data/pg5000.txt > >> > > >> > hduser@sujit:~/Desktop/hadoop$ bin/hadoop jar hadoop*examples*.jar > >> > wordcount /user/hduser/data /user/hduser/gutenberg-outputd > >> > > >> > 12/03/06 23:14:33 INFO input.FileInputFormat: Total input paths to > >> process > >> > : 3 > >> > 12/03/06 23:14:33 INFO mapred.JobClient: Running job: > >> job_201203062221_0002 > >> > 12/03/06 23:14:34 INFO mapred.JobClient: map 0% reduce 0% > >> > 12/03/06 23:14:49 INFO mapred.JobClient: map 66% reduce 0% > >> > 12/03/06 23:14:55 INFO mapred.JobClient: map 100% reduce 0% > >> > 12/03/06 23:14:58 INFO mapred.JobClient: Task Id : > >> > attempt_201203062221_0002_r_00_0, Status : FAILED > >> > Error: java.lang.NullPointerException > >> >at > >> > java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768) > >> >at > >> > > >> > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2900) > >> >at > >> > > >> > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2820) > >> > > >> > 12/03/06 23:15:07 INFO mapred.JobClient: Task Id : > >> > attempt_201203062221_0002_r_00_1, Status : FAILED > >> > Error: java.lang.NullPointerException > >> >at > >> > java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768) > >> >at > >> > > >> > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2900) > >> >at > >> > > >> > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2820) > >> > > >> > 12/03/06 23:15:16 INFO mapred.JobClient: Task Id : > >> > attempt_201203062221_0002_r_00_2, Status : FAILED > >> > Error: java.lang.NullPointerException > >> >at > >> > java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768) > >> >at > >> > > >> > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2900) > >> >at > >> > > >> > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2820) > >> > > >> > 12/03/06 23:15:31 INFO mapred.JobClient: Job complete: > >> job_201203062221_0002 > >> > 12/03/06 23:15:31 INFO mapred.JobClient: Counters: 20 > >> > 12/03/06 23:15:31 INFO mapred.JobClient: Job Counters > >> > 12/03/06 23:15:31 INFO mapred.JobClient: Launched reduce tasks=4 > >> > 12/03/06 23:15:31 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=22084 > >> > 12/03/06 23:15:31 INFO mapred.JobClient: Total time spent by all > >> > reduces waiting after reserving slots (ms)=0 > >> > 12/03/06 23:15:31 INFO mapred.JobClient: Total time spent by all > >> maps > >> > waiting after reserving slots (ms)=0 > >> > 12/03/06 23:15:31 INFO mapred.JobClient: Launched map tasks=3 > >> > 12/03/06 23:15:31 INFO mapred.JobClient: Data-local map tasks=3 > >> > 12/03/06 23:15:31 INFO mapred.JobClient: Failed reduce tasks=1 > >> > 12/03/06 23:15:31 INFO mapred.JobClient: > SLOTS_MILLIS_REDUCES=16799 > >> > 12/03/06 23:15:31 INFO mapred.JobClient: FileSystemCounters > >> > 12/03/06 23:15:31 INFO mapred.JobClient: FILE_BYTES_READ=740520 > >> > 12/03/06 23:15:31 INFO mapred.JobClient: HDFS_BYTES_READ=3671863 > >> > 12/03/06 23:15:31 INFO mapred.JobClient: > FILE_BYTES_WRITTEN=2278287 > >> > 12/03/06 23:15:31 INFO mapred.JobC
Re: Doubt from the book "Definitive Guide"
On Wed, Apr 4, 2012 at 10:02 PM, Prashant Kommireddi wrote: > Hi Mohit, > > What would be the advantage? Reducers in most cases read data from all > the mappers. In the case where mappers were to write to HDFS, a > reducer would still require to read data from other datanodes across > the cluster. > > Only advantage I was thinking of was that in some cases reducers might be able to take advantage of data locality and avoid multiple HTTP calls, no? Data is anyways written, so last merged file could go on HDFS instead of local disk. I am new to hadoop so just asking question to understand the rational behind using local disk for final output. > Prashant > > On Apr 4, 2012, at 9:55 PM, Mohit Anchlia wrote: > > > On Wed, Apr 4, 2012 at 8:42 PM, Harsh J wrote: > > > >> Hi Mohit, > >> > >> On Thu, Apr 5, 2012 at 5:26 AM, Mohit Anchlia > >> wrote: > >>> I am going through the chapter "How mapreduce works" and have some > >>> confusion: > >>> > >>> 1) Below description of Mapper says that reducers get the output file > >> using > >>> HTTP call. But the description under "The Reduce Side" doesn't > >> specifically > >>> say if it's copied using HTTP. So first confusion, Is the output copied > >>> from mapper -> reducer or from reducer -> mapper? And second, Is the > call > >>> http:// or hdfs:// > >> > >> The flow is simple as this: > >> 1. For M+R job, map completes its task after writing all partitions > >> down into the tasktracker's local filesystem (under mapred.local.dir > >> directories). > >> 2. Reducers fetch completion locations from events at JobTracker, and > >> query the TaskTracker there to provide it the specific partition it > >> needs, which is done over the TaskTracker's HTTP service (50060). > >> > >> So to clear things up - map doesn't send it to reduce, nor does reduce > >> ask the actual map task. It is the task tracker itself that makes the > >> bridge here. > >> > >> Note however, that in Hadoop 2.0 the transfer via ShuffleHandler would > >> be over Netty connections. This would be much more faster and > >> reliable. > >> > >>> 2) My understanding was that mapper output gets written to hdfs, since > >> I've > >>> seen part-m-0 files in hdfs. If mapper output is written to HDFS > then > >>> shouldn't reducers simply read it from hdfs instead of making http > calls > >> to > >>> tasktrackers location? > >> > >> A map-only job usually writes out to HDFS directly (no sorting done, > >> cause no reducer is involved). If the job is a map+reduce one, the > >> default output is collected to local filesystem for partitioning and > >> sorting at map end, and eventually grouping at reduce end. Basically: > >> Data you want to send to reducer from mapper goes to local FS for > >> multiple actions to be performed on them, other data may directly go > >> to HDFS. > >> > >> Reducers currently are scheduled pretty randomly but yes their > >> scheduling can be improved for certain scenarios. However, if you are > >> pointing that map partitions ought to be written to HDFS itself (with > >> replication or without), I don't see performance improving. Note that > >> the partitions aren't merely written but need to be sorted as well (at > >> either end). To do that would need ability to spill frequently (cause > >> we don't have infinite memory to do it all in RAM) and doing such a > >> thing on HDFS would only mean slowdown. > >> > >> Thanks for clearing my doubts. In this case I was merely suggesting that > > if the mapper output (merged output in the end or the shuffle output) is > > stored in HDFS then reducers can just retrieve it from HDFS instead of > > asking tasktracker for it. Once reducer threads read it they can continue > > to work locally. > > > > > > > >> I hope this helps clear some things up for you. > >> > >> -- > >> Harsh J > >> >
Re: map task execution time
Yes, how can we use "hadoop job" to get MR job stats, especially constituent task finish times? On Thu, Apr 5, 2012 at 9:02 AM, Jay Vyas wrote: > (excuse the typo in the last email : I meant "I've been playing with Cinch" > , not "I've been with Cinch") > > On Thu, Apr 5, 2012 at 7:54 AM, Jay Vyas wrote: > > > How can "hadoop job" be used to read m/r statistics ? > > > > On Thu, Apr 5, 2012 at 7:30 AM, bikash sharma >wrote: > > > >> Thanks Kai, I will try those. > >> > >> On Thu, Apr 5, 2012 at 3:15 AM, Kai Voigt wrote: > >> > >> > Hi, > >> > > >> > Am 05.04.2012 um 00:20 schrieb bikash sharma: > >> > > >> > > Is it possible to get the execution time of the constituent > map/reduce > >> > > tasks of a MapReduce job (say sort) at the end of a job run? > >> > > Preferably, can we obtain this programatically? > >> > > >> > > >> > you can access the JobTracker's web UI and see the start and stop > >> > timestamps for every individual task. > >> > > >> > Since the JobTracker Java API is exposed, you can write your own > >> > application to fetch that data through your own code. > >> > > >> > Also, "hadoop job" on the command line can be used to read job > >> statistics. > >> > > >> > Kai > >> > > >> > > >> > -- > >> > Kai Voigt > >> > k...@123.org > >> > > >> > > >> > > >> > > >> > > >> > > > > > > > > -- > > Jay Vyas > > MMSB/UCHC > > > > > > -- > Jay Vyas > MMSB/UCHC >
Re: map task execution time
(excuse the typo in the last email : I meant "I've been playing with Cinch" , not "I've been with Cinch") On Thu, Apr 5, 2012 at 7:54 AM, Jay Vyas wrote: > How can "hadoop job" be used to read m/r statistics ? > > On Thu, Apr 5, 2012 at 7:30 AM, bikash sharma wrote: > >> Thanks Kai, I will try those. >> >> On Thu, Apr 5, 2012 at 3:15 AM, Kai Voigt wrote: >> >> > Hi, >> > >> > Am 05.04.2012 um 00:20 schrieb bikash sharma: >> > >> > > Is it possible to get the execution time of the constituent map/reduce >> > > tasks of a MapReduce job (say sort) at the end of a job run? >> > > Preferably, can we obtain this programatically? >> > >> > >> > you can access the JobTracker's web UI and see the start and stop >> > timestamps for every individual task. >> > >> > Since the JobTracker Java API is exposed, you can write your own >> > application to fetch that data through your own code. >> > >> > Also, "hadoop job" on the command line can be used to read job >> statistics. >> > >> > Kai >> > >> > >> > -- >> > Kai Voigt >> > k...@123.org >> > >> > >> > >> > >> > >> > > > > -- > Jay Vyas > MMSB/UCHC > -- Jay Vyas MMSB/UCHC
Re: map task execution time
How can "hadoop job" be used to read m/r statistics ? On Thu, Apr 5, 2012 at 7:30 AM, bikash sharma wrote: > Thanks Kai, I will try those. > > On Thu, Apr 5, 2012 at 3:15 AM, Kai Voigt wrote: > > > Hi, > > > > Am 05.04.2012 um 00:20 schrieb bikash sharma: > > > > > Is it possible to get the execution time of the constituent map/reduce > > > tasks of a MapReduce job (say sort) at the end of a job run? > > > Preferably, can we obtain this programatically? > > > > > > you can access the JobTracker's web UI and see the start and stop > > timestamps for every individual task. > > > > Since the JobTracker Java API is exposed, you can write your own > > application to fetch that data through your own code. > > > > Also, "hadoop job" on the command line can be used to read job > statistics. > > > > Kai > > > > > > -- > > Kai Voigt > > k...@123.org > > > > > > > > > > > -- Jay Vyas MMSB/UCHC
Re: map task execution time
Thanks Kai, I will try those. On Thu, Apr 5, 2012 at 3:15 AM, Kai Voigt wrote: > Hi, > > Am 05.04.2012 um 00:20 schrieb bikash sharma: > > > Is it possible to get the execution time of the constituent map/reduce > > tasks of a MapReduce job (say sort) at the end of a job run? > > Preferably, can we obtain this programatically? > > > you can access the JobTracker's web UI and see the start and stop > timestamps for every individual task. > > Since the JobTracker Java API is exposed, you can write your own > application to fetch that data through your own code. > > Also, "hadoop job" on the command line can be used to read job statistics. > > Kai > > > -- > Kai Voigt > k...@123.org > > > > >
Re: Jobtracker history logs missing
Hi Prashant, The userlogs for job are deleted after time specified by "* mapred.userlog.retain.hours*" property defined in mapred-site.xml (default is 24 Hrs). Thanks, Nitin On 5 April 2012 14:26, Prashant Kommireddi wrote: > I am noticing something strange with JobTracker history logs on my cluster. > I see configuration files (*_conf.xml) under /logs/history/ but none of the > actual job logs. Anyone has ideas on what might be happening? > > Thanks, > -- Nitin Khandelwal
Re: cloudera vs apache hadoop migration stories ?
Hi Jay, Probably makes sense to move this to the cdh-user list if you have any Cloudera-specific questions. But I just wanted to clarify: CDH doesn't make any API changes that aren't already upstream. So, in some places, CDH may be ahead of whatever Apache release you are comparing against, but it is always made up of patches from the Apache trunk. In the specific case of MultipleInputs, we did backport the new API implementation from Apache Hadoop 0.21+. If you find something in CDH that you would like backported to upstream Apache Hadoop 1.0.x, please feel free to file a JIRA and assign it to me - I'm happy to look into it for you. Thanks Todd On Wed, Apr 4, 2012 at 10:15 AM, Jay Vyas wrote: > Seems like cloudera and standard apache-hadoop are really not cross > compatible. Things like MultipleInputs and stuff that we are finding don't > work the same. Any good (recent) war stories on the migration between the > two ? > > Its interesting to me that cloudera and amazon are that difficult to swap > in/out in cloud. -- Todd Lipcon Software Engineer, Cloudera
Re: map task execution time
Hi, Am 05.04.2012 um 00:20 schrieb bikash sharma: > Is it possible to get the execution time of the constituent map/reduce > tasks of a MapReduce job (say sort) at the end of a job run? > Preferably, can we obtain this programatically? you can access the JobTracker's web UI and see the start and stop timestamps for every individual task. Since the JobTracker Java API is exposed, you can write your own application to fetch that data through your own code. Also, "hadoop job" on the command line can be used to read job statistics. Kai -- Kai Voigt k...@123.org