Yahoo Hadoop Tutorial with new APIs?
Regards to all the list. There are many people that use the Hadoop Tutorial released by Yahoo at http://developer.yahoo.com/hadoop/tutorial/ http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining The main issue here is that, this tutorial is written with the old APIs? (Hadoop 0.18 I think). Is there a project for update this tutorial to the new APIs? to Hadoop 1.0.2 or YARN (Hadoop 0.23) Best wishes -- Marcos Luis Ortíz Valmaseda (@marcosluis2186) Data Engineer at UCI http://marcosluis2186.posterous.com 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: Yahoo Hadoop Tutorial with new APIs?
Hello Marcos Yes , Yahoo tutorials are pretty old but still they explain the concepts of Map Reduce , HDFS beautifully. The way in which tutorials have been defined into sub sections , each builing on previous one is awesome. I remember when i started i was digged in there for many days. The tutorials are lagging now from new API point of view. Lets have some documentation session one day , I would love to Volunteer to update those tutorials if people at Yahoo take input from outside world :) Regards, Jagat - Original Message - From: Marcos Ortiz Sent: 04/04/12 08:32 AM To: common-user@hadoop.apache.org, 'hdfs-u...@hadoop.apache.org', mapreduce-u...@hadoop.apache.org Subject: Yahoo Hadoop Tutorial with new APIs? Regards to all the list. There are many people that use the Hadoop Tutorial released by Yahoo at http://developer.yahoo.com/hadoop/tutorial/ http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining The main issue here is that, this tutorial is written with the old APIs? (Hadoop 0.18 I think). Is there a project for update this tutorial to the new APIs? to Hadoop 1.0.2 or YARN (Hadoop 0.23) Best wishes -- Marcos Luis Ortíz Valmaseda (@marcosluis2186) Data Engineer at UCI http://marcosluis2186.posterous.com http://www.uci.cu/
Re: Yahoo Hadoop Tutorial with new APIs?
On 04/04/2012 09:15 AM, Jagat Singh wrote: Hello Marcos Yes , Yahoo tutorials are pretty old but still they explain the concepts of Map Reduce , HDFS beautifully. The way in which tutorials have been defined into sub sections , each builing on previous one is awesome. I remember when i started i was digged in there for many days. The tutorials are lagging now from new API point of view. Yes, for that reason, for its beauty, this tutorial is read by many new Hadoop comers, so, I think that it need an update. Lets have some documentation session one day , I would love to Volunteer to update those tutorials if people at Yahoo take input from outside world :) I want to help on this too, so, we need to talk with Hadoop colleagues to do this. Regards and best wishes Regards, Jagat - Original Message - From: Marcos Ortiz Sent: 04/04/12 08:32 AM To: common-user@hadoop.apache.org, 'hdfs-u...@hadoop.apache.org', mapreduce-u...@hadoop.apache.org Subject: Yahoo Hadoop Tutorial with new APIs? Regards to all the list. There are many people that use the Hadoop Tutorial released by Yahoo at http://developer.yahoo.com/hadoop/tutorial/ http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining The main issue here is that, this tutorial is written with the old APIs? (Hadoop 0.18 I think). Is there a project for update this tutorial to the new APIs? to Hadoop 1.0.2 or YARN (Hadoop 0.23) Best wishes -- Marcos Luis Ortíz Valmaseda (@marcosluis2186) Data Engineer at UCI http://marcosluis2186.posterous.com http://www.uci.cu/ -- Marcos Luis Ortíz Valmaseda (@marcosluis2186) Data Engineer at UCI http://marcosluis2186.posterous.com 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
opensuse 12.1
What is the best way to install hadoop on opensuse 12.1 for a small two node cluster. -SB
FW: opensuse 12.1
-Original Message- From: Barry, Sean F [mailto:sean.f.ba...@intel.com] Sent: Wednesday, April 04, 2012 9:10 AM To: common-user@hadoop.apache.org Subject: opensuse 12.1 What is the best way to install hadoop on opensuse 12.1 for a small two node cluster. -SB
Re: opensuse 12.1
Lots of people seem to start with this. http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ Raj From: Barry, Sean F sean.f.ba...@intel.com To: common-user@hadoop.apache.org common-user@hadoop.apache.org Sent: Wednesday, April 4, 2012 9:12 AM Subject: FW: opensuse 12.1 -Original Message- From: Barry, Sean F [mailto:sean.f.ba...@intel.com] Sent: Wednesday, April 04, 2012 9:10 AM To: common-user@hadoop.apache.org Subject: opensuse 12.1 What is the best way to install hadoop on opensuse 12.1 for a small two node cluster. -SB
Re: how to fine tuning my map reduce job that is generating a lot of intermediate key-value pairs (a lot of I/O operations)
How many datanodes do you use fir your job? On 4/3/12 8:11 PM, Jane Wayne jane.wayne2...@gmail.com wrote: i don't have the option of setting the map heap size to 2 GB since my real environment is AWS EMR and the constraints are set. http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html this link is where i am currently reading on the meaning of io.sort.factor and io.sort.mb. it seems io.sort.mb tunes the map tasks and io.sort.factor tunes the shuffle/reduce task. am i correct to say then that io.sort.factor is not relevant here (yet, anways)? since i don't really make it to the reduce phase (except for only a very small data size). in that link above, here is the description for, io.sort.mb: The cumulative size of the serialization and accounting buffers storing records emitted from the map, in megabytes. there's a paragraph above the table that is value is simply the threshold that triggers a sort and spill to the disk. furthermore, it says, If either buffer fills completely while the spill is in progress, the map thread will block, which is what i believe is happening in my case. this sentence concerns me, Minimizing the number of spills to disk can decrease map time, but a larger buffer also decreases the memory available to the mapper. to minimize the number of spills, you need a larger buffer; however, this statement seems to suggest to NOT minimize the number of spills; a) you will not decrease map time, b) you will not decrease the memory available to the mapper. so, in your advice below, you say to increase, but i may actually want to decrease the value for io.sort.mb. (if i understood the documentation correctly, ) it seems these three map tuning parameters, io.sort.mb, io.sort.record.percent, and io.sort.spill.percent are a pain-point trading off between speed and memory. to me, if you set them high, more serialized data + metadata are stored in memory before a spill (an I/O operation is performed). you also get less merges (less I/O operation?), but the negatives are blocking map operations and more memory requirements. if you set them low, there are more frequent spills (more I/O operations), but less memory requirements. it just seems like no matter what you do, you are stuck: you may stall the mapper if the values are high because of the amount of time required to spill an enormous amount of data; you may stall the mapper if the values are low because of the amount of I/O operations required (spill/merge). i must be understanding something wrong here because everywhere i read, hadoop is supposed to be #1 at sorting. but here, in dealing with the intermediary key-value pairs, in the process of sorting, mappers can stall for any number of reasons. does anyone know any competitive dynamic hadoop clustering service like AWS EMR? the reason why i ask is because AWS EMR does not use HDFS (it uses S3), and therefore, data locality is not possible. also, i have read the TCP protocol is not efficient for network transfers; if the S3 node and task nodes are far, this distance will certainly exacerbate the situation of slow speed. it seems there are a lot of factors working against me. any help is appreciated. On Tue, Apr 3, 2012 at 7:48 AM, Bejoy Ks bejoy.had...@gmail.com wrote: Jane, From my first look, properties that can help you could be - Increase io sort factor to 100 - Increase io.sort.mb to 512Mb - increase map task heap size to 2GB. If the task still stalls, try providing lesser input for each mapper. Regards Bejoy KS On Tue, Apr 3, 2012 at 2:08 PM, Jane Wayne jane.wayne2...@gmail.com wrote: i have a map reduce job that is generating a lot of intermediate key-value pairs. for example, when i am 1/3 complete with my map phase, i may have generated over 130,000,000 output records (which is about 9 gigabytes). to get to the 1/3 complete mark is very fast (less than 10 minutes), but at the 1/3 complete mark, it seems to stall. when i look at the counter logs, i do not see any logging of spilling yet. however, on the web job UI, i see that FILE_BYTES_WRITTEN and Spilled Records keeps increasing. needless to say, i have to dig deeper to see what is going on. my question is, how do i fine tune my map reduce job with the above properties? namely, the property of generating a lot of intermediate key-value pairs? it seems the I/O operations are negatively impacting the job speed. there are so many map- and reduce-side tuning properties (see Tom White, Hadoop, 2nd edition, pp 181-182), i am a little unsure about just how to approach the tuning parameters. since the slow down is happening during the map-phase/task, i assume i should narrow down on the map-side tuning properties. by the way, i am using the CPU-intensive c1.medium instances of amazon web service's (AWS) elastic map reduce (EMR) on hadoop v0.20. a compute node has 2 mappers, 1 reducers, and 384 MB JVM memory per task. this instance type is documented
Re: opensuse 12.1
Like OpenSUSE is a RPM-based distribution, you can try with the Apache BigTop project [1], and look for the RPM packages and give them a try. You have noticed that the RPM specification between OpenSUSE and Red Hat-based distributions () change a little, but it can be a starting point. See the documentation for the project [2]. [1] http://incubator.apache.org/projects/bigtop.html [2] https://cwiki.apache.org/confluence/display/BIGTOP/Index%3bjsessionid=AA31645DFDAE1F3282D0159DB9B6AE9A Regards On 04/04/2012 12:24 PM, Raj Vishwanathan wrote: Lots of people seem to start with this. http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ Raj From: Barry, Sean Fsean.f.ba...@intel.com To: common-user@hadoop.apache.orgcommon-user@hadoop.apache.org Sent: Wednesday, April 4, 2012 9:12 AM Subject: FW: opensuse 12.1 -Original Message- From: Barry, Sean F [mailto:sean.f.ba...@intel.com] Sent: Wednesday, April 04, 2012 9:10 AM To: common-user@hadoop.apache.org Subject: opensuse 12.1 What is the best way to install hadoop on opensuse 12.1 for a small two node cluster. -SB 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci -- Marcos Luis Ortíz Valmaseda (@marcosluis2186) Data Engineer at UCI http://marcosluis2186.posterous.com 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: Yahoo Hadoop Tutorial with new APIs?
I am dropping the cross posts and leaving this on common-user with the others BCCed. Marcos, That is a great idea to be able to update the tutorial, especially if the community is interested in helping to do so. We are looking into the best way to do this. The idea right now is to donate this to the Hadoop project so that the community can keep it up to date, but we need some time to jump through all of the corporate hoops to get this to happen. We have a lot going on right now, so if you don't see any progress on this please feel free to ping me and bug me about it. -- Bobby Evans On 4/4/12 8:15 AM, Jagat Singh jagatsi...@gmail.com wrote: Hello Marcos Yes , Yahoo tutorials are pretty old but still they explain the concepts of Map Reduce , HDFS beautifully. The way in which tutorials have been defined into sub sections , each builing on previous one is awesome. I remember when i started i was digged in there for many days. The tutorials are lagging now from new API point of view. Lets have some documentation session one day , I would love to Volunteer to update those tutorials if people at Yahoo take input from outside world :) Regards, Jagat - Original Message - From: Marcos Ortiz Sent: 04/04/12 08:32 AM To: common-user@hadoop.apache.org, 'hdfs-u...@hadoop.apache.org', mapreduce-u...@hadoop.apache.org Subject: Yahoo Hadoop Tutorial with new APIs? Regards to all the list. There are many people that use the Hadoop Tutorial released by Yahoo at http://developer.yahoo.com/hadoop/tutorial/ http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining The main issue here is that, this tutorial is written with the old APIs? (Hadoop 0.18 I think). Is there a project for update this tutorial to the new APIs? to Hadoop 1.0.2 or YARN (Hadoop 0.23) Best wishes -- Marcos Luis Ortíz Valmaseda (@marcosluis2186) Data Engineer at UCI http://marcosluis2186.posterous.com http://www.uci.cu/
Re: Yahoo Hadoop Tutorial with new APIs?
Hi, any interest in joining with this effort of mine? http://hadoopilluminated.com/ - I am also doing only for community benefit. I have more chapters that I am putting out. But, I want to keep the fun, informal style. Thanks, Mark On Wed, Apr 4, 2012 at 4:29 PM, Robert Evans ev...@yahoo-inc.com wrote: I am dropping the cross posts and leaving this on common-user with the others BCCed. Marcos, That is a great idea to be able to update the tutorial, especially if the community is interested in helping to do so. We are looking into the best way to do this. The idea right now is to donate this to the Hadoop project so that the community can keep it up to date, but we need some time to jump through all of the corporate hoops to get this to happen. We have a lot going on right now, so if you don't see any progress on this please feel free to ping me and bug me about it. -- Bobby Evans On 4/4/12 8:15 AM, Jagat Singh jagatsi...@gmail.com wrote: Hello Marcos Yes , Yahoo tutorials are pretty old but still they explain the concepts of Map Reduce , HDFS beautifully. The way in which tutorials have been defined into sub sections , each builing on previous one is awesome. I remember when i started i was digged in there for many days. The tutorials are lagging now from new API point of view. Lets have some documentation session one day , I would love to Volunteer to update those tutorials if people at Yahoo take input from outside world :) Regards, Jagat - Original Message - From: Marcos Ortiz Sent: 04/04/12 08:32 AM To: common-user@hadoop.apache.org, 'hdfs-u...@hadoop.apache.org', mapreduce-u...@hadoop.apache.org Subject: Yahoo Hadoop Tutorial with new APIs? Regards to all the list. There are many people that use the Hadoop Tutorial released by Yahoo at http://developer.yahoo.com/hadoop/tutorial/ http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining The main issue here is that, this tutorial is written with the old APIs? (Hadoop 0.18 I think). Is there a project for update this tutorial to the new APIs? to Hadoop 1.0.2 or YARN (Hadoop 0.23) Best wishes -- Marcos Luis Ortíz Valmaseda (@marcosluis2186) Data Engineer at UCI http://marcosluis2186.posterous.com http://www.uci.cu/
Re: Yahoo Hadoop Tutorial with new APIs?
Ok, Robert, I will be waiting for you then. There are many folks that use this tutorial, so I think this a good effort in favor of the Hadoop community.It would be nice if Yahoo! donate this work, because, I have some ideas behind this, for example: to release a Spanish version of the tutorial. Regards and best wishes On 04/04/2012 05:29 PM, Robert Evans wrote: I am dropping the cross posts and leaving this on common-user with the others BCCed. Marcos, That is a great idea to be able to update the tutorial, especially if the community is interested in helping to do so. We are looking into the best way to do this. The idea right now is to donate this to the Hadoop project so that the community can keep it up to date, but we need some time to jump through all of the corporate hoops to get this to happen. We have a lot going on right now, so if you don't see any progress on this please feel free to ping me and bug me about it. -- Bobby Evans On 4/4/12 8:15 AM, Jagat Singh jagatsi...@gmail.com wrote: Hello Marcos Yes , Yahoo tutorials are pretty old but still they explain the concepts of Map Reduce , HDFS beautifully. The way in which tutorials have been defined into sub sections , each builing on previous one is awesome. I remember when i started i was digged in there for many days. The tutorials are lagging now from new API point of view. Lets have some documentation session one day , I would love to Volunteer to update those tutorials if people at Yahoo take input from outside world :) Regards, Jagat - Original Message - From: Marcos Ortiz Sent: 04/04/12 08:32 AM To: common-user@hadoop.apache.org, 'hdfs-u...@hadoop.apache.org %27hdfs-u...@hadoop.apache.org', mapreduce-u...@hadoop.apache.org Subject: Yahoo Hadoop Tutorial with new APIs? Regards to all the list. There are many people that use the Hadoop Tutorial released by Yahoo at http://developer.yahoo.com/hadoop/tutorial/ http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining The main issue here is that, this tutorial is written with the old APIs? (Hadoop 0.18 I think). Is there a project for update this tutorial to the new APIs? to Hadoop 1.0.2 or YARN (Hadoop 0.23) Best wishes -- Marcos Luis Ortíz Valmaseda (@marcosluis2186) Data Engineer at UCI http://marcosluis2186.posterous.com http://www.uci.cu/ http://www.uci.cu/ -- Marcos Luis Ortíz Valmaseda (@marcosluis2186) Data Engineer at UCI http://marcosluis2186.posterous.com 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: Yahoo Hadoop Tutorial with new APIs?
Nathan but together the steps together on this blog. http://blog.milford.io/2012/01/kicking-the-tires-on-hadoop-0-23-pseudo-distributed-mode/ Which fills out the missing details such as property nameyarn.nodemanager.local-dirs/name value/value descriptionthe local directories used by the nodemanager/description /property in the official docs. http://hadoop.apache.org/common/docs/r0.23.1/hadoop-yarn/hadoop-yarn-site/SingleCluster.html On Wed, Apr 4, 2012 at 5:43 PM, Marcos Ortiz mlor...@uci.cu wrote: Ok, Robert, I will be waiting for you then. There are many folks that use this tutorial, so I think this a good effort in favor of the Hadoop community.It would be nice if Yahoo! donate this work, because, I have some ideas behind this, for example: to release a Spanish version of the tutorial. Regards and best wishes On 04/04/2012 05:29 PM, Robert Evans wrote: I am dropping the cross posts and leaving this on common-user with the others BCCed. Marcos, That is a great idea to be able to update the tutorial, especially if the community is interested in helping to do so. We are looking into the best way to do this. The idea right now is to donate this to the Hadoop project so that the community can keep it up to date, but we need some time to jump through all of the corporate hoops to get this to happen. We have a lot going on right now, so if you don't see any progress on this please feel free to ping me and bug me about it. -- Bobby Evans On 4/4/12 8:15 AM, Jagat Singh jagatsi...@gmail.com wrote: Hello Marcos Yes , Yahoo tutorials are pretty old but still they explain the concepts of Map Reduce , HDFS beautifully. The way in which tutorials have been defined into sub sections , each builing on previous one is awesome. I remember when i started i was digged in there for many days. The tutorials are lagging now from new API point of view. Lets have some documentation session one day , I would love to Volunteer to update those tutorials if people at Yahoo take input from outside world :) Regards, Jagat - Original Message - From: Marcos Ortiz Sent: 04/04/12 08:32 AM To: common-user@hadoop.apache.org, 'hdfs-u...@hadoop.apache.org %27hdfs-u...@hadoop.apache.org', mapreduce-u...@hadoop.apache.org Subject: Yahoo Hadoop Tutorial with new APIs? Regards to all the list. There are many people that use the Hadoop Tutorial released by Yahoo at http://developer.yahoo.com/hadoop/tutorial/ http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining The main issue here is that, this tutorial is written with the old APIs? (Hadoop 0.18 I think). Is there a project for update this tutorial to the new APIs? to Hadoop 1.0.2 or YARN (Hadoop 0.23) Best wishes -- Marcos Luis Ortíz Valmaseda (@marcosluis2186) Data Engineer at UCI http://marcosluis2186.posterous.com http://www.uci.cu/ http://www.uci.cu/ -- Marcos Luis Ortíz Valmaseda (@marcosluis2186) Data Engineer at UCI http://marcosluis2186.posterous.com 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Doubt from the book Definitive Guide
I am going through the chapter How mapreduce works and have some confusion: 1) Below description of Mapper says that reducers get the output file using HTTP call. But the description under The Reduce Side doesn't specifically say if it's copied using HTTP. So first confusion, Is the output copied from mapper - reducer or from reducer - mapper? And second, Is the call http:// or hdfs:// 2) My understanding was that mapper output gets written to hdfs, since I've seen part-m-0 files in hdfs. If mapper output is written to HDFS then shouldn't reducers simply read it from hdfs instead of making http calls to tasktrackers location? - from the book --- Mapper The output file’s partitions are made available to the reducers over HTTP. The number of worker threads used to serve the file partitions is controlled by the tasktracker.http.threads property this setting is per tasktracker, not per map task slot. The default of 40 may need increasing for large clusters running large jobs.6.4.2. The Reduce Side Let’s turn now to the reduce part of the process. The map output file is sitting on the local disk of the tasktracker that ran the map task (note that although map outputs always get written to the local disk of the map tasktracker, reduce outputs may not be), but now it is needed by the tasktracker that is about to run the reduce task for the partition. Furthermore, the reduce task needs the map output for its particular partition from several map tasks across the cluster. The map tasks may finish at different times, so the reduce task starts copying their outputs as soon as each completes. This is known as the copy phase of the reduce task. The reduce task has a small number of copier threads so that it can fetch map outputs in parallel. The default is five threads, but this number can be changed by setting the mapred.reduce.parallel.copies property.
Re: Doubt from the book Definitive Guide
Answers inline. On Wed, Apr 4, 2012 at 4:56 PM, Mohit Anchlia mohitanch...@gmail.comwrote: I am going through the chapter How mapreduce works and have some confusion: 1) Below description of Mapper says that reducers get the output file using HTTP call. But the description under The Reduce Side doesn't specifically say if it's copied using HTTP. So first confusion, Is the output copied from mapper - reducer or from reducer - mapper? And second, Is the call http:// or hdfs:// Map output is written to local FS, not HDFS. 2) My understanding was that mapper output gets written to hdfs, since I've seen part-m-0 files in hdfs. If mapper output is written to HDFS then shouldn't reducers simply read it from hdfs instead of making http calls to tasktrackers location? Map output is sent to HDFS when reducer is not used. - from the book --- Mapper The output file’s partitions are made available to the reducers over HTTP. The number of worker threads used to serve the file partitions is controlled by the tasktracker.http.threads property this setting is per tasktracker, not per map task slot. The default of 40 may need increasing for large clusters running large jobs.6.4.2. The Reduce Side Let’s turn now to the reduce part of the process. The map output file is sitting on the local disk of the tasktracker that ran the map task (note that although map outputs always get written to the local disk of the map tasktracker, reduce outputs may not be), but now it is needed by the tasktracker that is about to run the reduce task for the partition. Furthermore, the reduce task needs the map output for its particular partition from several map tasks across the cluster. The map tasks may finish at different times, so the reduce task starts copying their outputs as soon as each completes. This is known as the copy phase of the reduce task. The reduce task has a small number of copier threads so that it can fetch map outputs in parallel. The default is five threads, but this number can be changed by setting the mapred.reduce.parallel.copies property.
Re: how to fine tuning my map reduce job that is generating a lot of intermediate key-value pairs (a lot of I/O operations)
serge, i specify 15 instances, but only 14 end up being data/tasks nodes. 1 instance is reserved as the name node (job tracker). On Wed, Apr 4, 2012 at 1:17 PM, Serge Blazhievsky serge.blazhiyevs...@nice.com wrote: How many datanodes do you use fir your job? On 4/3/12 8:11 PM, Jane Wayne jane.wayne2...@gmail.com wrote: i don't have the option of setting the map heap size to 2 GB since my real environment is AWS EMR and the constraints are set. http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html this link is where i am currently reading on the meaning of io.sort.factor and io.sort.mb. it seems io.sort.mb tunes the map tasks and io.sort.factor tunes the shuffle/reduce task. am i correct to say then that io.sort.factor is not relevant here (yet, anways)? since i don't really make it to the reduce phase (except for only a very small data size). in that link above, here is the description for, io.sort.mb: The cumulative size of the serialization and accounting buffers storing records emitted from the map, in megabytes. there's a paragraph above the table that is value is simply the threshold that triggers a sort and spill to the disk. furthermore, it says, If either buffer fills completely while the spill is in progress, the map thread will block, which is what i believe is happening in my case. this sentence concerns me, Minimizing the number of spills to disk can decrease map time, but a larger buffer also decreases the memory available to the mapper. to minimize the number of spills, you need a larger buffer; however, this statement seems to suggest to NOT minimize the number of spills; a) you will not decrease map time, b) you will not decrease the memory available to the mapper. so, in your advice below, you say to increase, but i may actually want to decrease the value for io.sort.mb. (if i understood the documentation correctly, ) it seems these three map tuning parameters, io.sort.mb, io.sort.record.percent, and io.sort.spill.percent are a pain-point trading off between speed and memory. to me, if you set them high, more serialized data + metadata are stored in memory before a spill (an I/O operation is performed). you also get less merges (less I/O operation?), but the negatives are blocking map operations and more memory requirements. if you set them low, there are more frequent spills (more I/O operations), but less memory requirements. it just seems like no matter what you do, you are stuck: you may stall the mapper if the values are high because of the amount of time required to spill an enormous amount of data; you may stall the mapper if the values are low because of the amount of I/O operations required (spill/merge). i must be understanding something wrong here because everywhere i read, hadoop is supposed to be #1 at sorting. but here, in dealing with the intermediary key-value pairs, in the process of sorting, mappers can stall for any number of reasons. does anyone know any competitive dynamic hadoop clustering service like AWS EMR? the reason why i ask is because AWS EMR does not use HDFS (it uses S3), and therefore, data locality is not possible. also, i have read the TCP protocol is not efficient for network transfers; if the S3 node and task nodes are far, this distance will certainly exacerbate the situation of slow speed. it seems there are a lot of factors working against me. any help is appreciated. On Tue, Apr 3, 2012 at 7:48 AM, Bejoy Ks bejoy.had...@gmail.com wrote: Jane, From my first look, properties that can help you could be - Increase io sort factor to 100 - Increase io.sort.mb to 512Mb - increase map task heap size to 2GB. If the task still stalls, try providing lesser input for each mapper. Regards Bejoy KS On Tue, Apr 3, 2012 at 2:08 PM, Jane Wayne jane.wayne2...@gmail.com wrote: i have a map reduce job that is generating a lot of intermediate key-value pairs. for example, when i am 1/3 complete with my map phase, i may have generated over 130,000,000 output records (which is about 9 gigabytes). to get to the 1/3 complete mark is very fast (less than 10 minutes), but at the 1/3 complete mark, it seems to stall. when i look at the counter logs, i do not see any logging of spilling yet. however, on the web job UI, i see that FILE_BYTES_WRITTEN and Spilled Records keeps increasing. needless to say, i have to dig deeper to see what is going on. my question is, how do i fine tune my map reduce job with the above properties? namely, the property of generating a lot of intermediate key-value pairs? it seems the I/O operations are negatively impacting the job speed. there are so many map- and reduce-side tuning properties (see Tom White, Hadoop, 2nd edition, pp 181-182), i am a little unsure about just how to approach the tuning parameters. since the slow down is happening during the map-phase/task, i assume i should narrow down on the map-side tuning properties. by the way, i am using the
Re: Doubt from the book Definitive Guide
Hi Mohit, On Thu, Apr 5, 2012 at 5:26 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I am going through the chapter How mapreduce works and have some confusion: 1) Below description of Mapper says that reducers get the output file using HTTP call. But the description under The Reduce Side doesn't specifically say if it's copied using HTTP. So first confusion, Is the output copied from mapper - reducer or from reducer - mapper? And second, Is the call http:// or hdfs:// The flow is simple as this: 1. For M+R job, map completes its task after writing all partitions down into the tasktracker's local filesystem (under mapred.local.dir directories). 2. Reducers fetch completion locations from events at JobTracker, and query the TaskTracker there to provide it the specific partition it needs, which is done over the TaskTracker's HTTP service (50060). So to clear things up - map doesn't send it to reduce, nor does reduce ask the actual map task. It is the task tracker itself that makes the bridge here. Note however, that in Hadoop 2.0 the transfer via ShuffleHandler would be over Netty connections. This would be much more faster and reliable. 2) My understanding was that mapper output gets written to hdfs, since I've seen part-m-0 files in hdfs. If mapper output is written to HDFS then shouldn't reducers simply read it from hdfs instead of making http calls to tasktrackers location? A map-only job usually writes out to HDFS directly (no sorting done, cause no reducer is involved). If the job is a map+reduce one, the default output is collected to local filesystem for partitioning and sorting at map end, and eventually grouping at reduce end. Basically: Data you want to send to reducer from mapper goes to local FS for multiple actions to be performed on them, other data may directly go to HDFS. Reducers currently are scheduled pretty randomly but yes their scheduling can be improved for certain scenarios. However, if you are pointing that map partitions ought to be written to HDFS itself (with replication or without), I don't see performance improving. Note that the partitions aren't merely written but need to be sorted as well (at either end). To do that would need ability to spill frequently (cause we don't have infinite memory to do it all in RAM) and doing such a thing on HDFS would only mean slowdown. I hope this helps clear some things up for you. -- Harsh J
Re: Doubt from the book Definitive Guide
On Wed, Apr 4, 2012 at 8:42 PM, Harsh J ha...@cloudera.com wrote: Hi Mohit, On Thu, Apr 5, 2012 at 5:26 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I am going through the chapter How mapreduce works and have some confusion: 1) Below description of Mapper says that reducers get the output file using HTTP call. But the description under The Reduce Side doesn't specifically say if it's copied using HTTP. So first confusion, Is the output copied from mapper - reducer or from reducer - mapper? And second, Is the call http:// or hdfs:// The flow is simple as this: 1. For M+R job, map completes its task after writing all partitions down into the tasktracker's local filesystem (under mapred.local.dir directories). 2. Reducers fetch completion locations from events at JobTracker, and query the TaskTracker there to provide it the specific partition it needs, which is done over the TaskTracker's HTTP service (50060). So to clear things up - map doesn't send it to reduce, nor does reduce ask the actual map task. It is the task tracker itself that makes the bridge here. Note however, that in Hadoop 2.0 the transfer via ShuffleHandler would be over Netty connections. This would be much more faster and reliable. 2) My understanding was that mapper output gets written to hdfs, since I've seen part-m-0 files in hdfs. If mapper output is written to HDFS then shouldn't reducers simply read it from hdfs instead of making http calls to tasktrackers location? A map-only job usually writes out to HDFS directly (no sorting done, cause no reducer is involved). If the job is a map+reduce one, the default output is collected to local filesystem for partitioning and sorting at map end, and eventually grouping at reduce end. Basically: Data you want to send to reducer from mapper goes to local FS for multiple actions to be performed on them, other data may directly go to HDFS. Reducers currently are scheduled pretty randomly but yes their scheduling can be improved for certain scenarios. However, if you are pointing that map partitions ought to be written to HDFS itself (with replication or without), I don't see performance improving. Note that the partitions aren't merely written but need to be sorted as well (at either end). To do that would need ability to spill frequently (cause we don't have infinite memory to do it all in RAM) and doing such a thing on HDFS would only mean slowdown. Thanks for clearing my doubts. In this case I was merely suggesting that if the mapper output (merged output in the end or the shuffle output) is stored in HDFS then reducers can just retrieve it from HDFS instead of asking tasktracker for it. Once reducer threads read it they can continue to work locally. I hope this helps clear some things up for you. -- Harsh J
Re: Doubt from the book Definitive Guide
Hi Mohit, What would be the advantage? Reducers in most cases read data from all the mappers. In the case where mappers were to write to HDFS, a reducer would still require to read data from other datanodes across the cluster. Prashant On Apr 4, 2012, at 9:55 PM, Mohit Anchlia mohitanch...@gmail.com wrote: On Wed, Apr 4, 2012 at 8:42 PM, Harsh J ha...@cloudera.com wrote: Hi Mohit, On Thu, Apr 5, 2012 at 5:26 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I am going through the chapter How mapreduce works and have some confusion: 1) Below description of Mapper says that reducers get the output file using HTTP call. But the description under The Reduce Side doesn't specifically say if it's copied using HTTP. So first confusion, Is the output copied from mapper - reducer or from reducer - mapper? And second, Is the call http:// or hdfs:// The flow is simple as this: 1. For M+R job, map completes its task after writing all partitions down into the tasktracker's local filesystem (under mapred.local.dir directories). 2. Reducers fetch completion locations from events at JobTracker, and query the TaskTracker there to provide it the specific partition it needs, which is done over the TaskTracker's HTTP service (50060). So to clear things up - map doesn't send it to reduce, nor does reduce ask the actual map task. It is the task tracker itself that makes the bridge here. Note however, that in Hadoop 2.0 the transfer via ShuffleHandler would be over Netty connections. This would be much more faster and reliable. 2) My understanding was that mapper output gets written to hdfs, since I've seen part-m-0 files in hdfs. If mapper output is written to HDFS then shouldn't reducers simply read it from hdfs instead of making http calls to tasktrackers location? A map-only job usually writes out to HDFS directly (no sorting done, cause no reducer is involved). If the job is a map+reduce one, the default output is collected to local filesystem for partitioning and sorting at map end, and eventually grouping at reduce end. Basically: Data you want to send to reducer from mapper goes to local FS for multiple actions to be performed on them, other data may directly go to HDFS. Reducers currently are scheduled pretty randomly but yes their scheduling can be improved for certain scenarios. However, if you are pointing that map partitions ought to be written to HDFS itself (with replication or without), I don't see performance improving. Note that the partitions aren't merely written but need to be sorted as well (at either end). To do that would need ability to spill frequently (cause we don't have infinite memory to do it all in RAM) and doing such a thing on HDFS would only mean slowdown. Thanks for clearing my doubts. In this case I was merely suggesting that if the mapper output (merged output in the end or the shuffle output) is stored in HDFS then reducers can just retrieve it from HDFS instead of asking tasktracker for it. Once reducer threads read it they can continue to work locally. I hope this helps clear some things up for you. -- Harsh J