Yahoo Hadoop Tutorial with new APIs?

2012-04-04 Thread Marcos Ortiz

Regards to all the list.
There are many people that use the Hadoop Tutorial released by Yahoo at 
http://developer.yahoo.com/hadoop/tutorial/ 
http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
The main issue here is that, this tutorial is written with the old APIs? 
(Hadoop 0.18 I think).
Is there a project for update this tutorial to the new APIs? to Hadoop 
1.0.2 or YARN (Hadoop 0.23)


Best wishes

--
Marcos Luis Ortíz Valmaseda (@marcosluis2186)
 Data Engineer at UCI
 http://marcosluis2186.posterous.com



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: Yahoo Hadoop Tutorial with new APIs?

2012-04-04 Thread Jagat Singh
Hello Marcos

 Yes , Yahoo tutorials are pretty old but still they explain the concepts of 
Map Reduce , HDFS beautifully. The way in which tutorials have been defined 
into sub sections , each builing on previous one is awesome. I remember when i 
started i was digged in there for many days. The tutorials are lagging now from 
new API point of view.

 Lets have some documentation session one day , I would love to Volunteer to 
update those tutorials if people at Yahoo take input from outside world :)

 Regards,

 Jagat

- Original Message -
From: Marcos Ortiz
Sent: 04/04/12 08:32 AM
To: common-user@hadoop.apache.org, 'hdfs-u...@hadoop.apache.org', 
mapreduce-u...@hadoop.apache.org
Subject: Yahoo Hadoop Tutorial with new APIs?

Regards to all the list.
 There are many people that use the Hadoop Tutorial released by Yahoo at 
http://developer.yahoo.com/hadoop/tutorial/ 
http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining 
The main issue here is that, this tutorial is written with the old APIs? 
(Hadoop 0.18 I think).
 Is there a project for update this tutorial to the new APIs? to Hadoop 1.0.2 
or YARN (Hadoop 0.23)

 Best wishes
 -- Marcos Luis Ortíz Valmaseda (@marcosluis2186) Data Engineer at UCI 
http://marcosluis2186.posterous.com 
 http://www.uci.cu/


Re: Yahoo Hadoop Tutorial with new APIs?

2012-04-04 Thread Marcos Ortiz



On 04/04/2012 09:15 AM, Jagat Singh wrote:

Hello Marcos

Yes , Yahoo tutorials are pretty old but still they explain the 
concepts of Map Reduce , HDFS beautifully. The way in which tutorials 
have been defined into sub sections , each builing on previous one is 
awesome. I remember when i started i was digged in there for many 
days. The tutorials are lagging now from new API point of view.
Yes, for that reason, for its beauty, this tutorial is read by many new 
Hadoop comers, so, I think that it need an update.


Lets have some documentation session one day , I would love to 
Volunteer to update those tutorials if people at Yahoo take input from 
outside world :)
I want to help on this too, so, we need to talk with Hadoop colleagues 
to do this.

Regards and best wishes


Regards,

Jagat


- Original Message -

From: Marcos Ortiz

Sent: 04/04/12 08:32 AM

To: common-user@hadoop.apache.org, 'hdfs-u...@hadoop.apache.org', 
mapreduce-u...@hadoop.apache.org


Subject: Yahoo Hadoop Tutorial with new APIs?


Regards to all the list.
There are many people that use the Hadoop Tutorial released by Yahoo 
at http://developer.yahoo.com/hadoop/tutorial/ 
http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
The main issue here is that, this tutorial is written with the old 
APIs? (Hadoop 0.18 I think).
Is there a project for update this tutorial to the new APIs? to 
Hadoop 1.0.2 or YARN (Hadoop 0.23)


Best wishes

--
Marcos Luis Ortíz Valmaseda (@marcosluis2186)
  Data Engineer at UCI
  http://marcosluis2186.posterous.com  


http://www.uci.cu/








--
Marcos Luis Ortíz Valmaseda (@marcosluis2186)
 Data Engineer at UCI
 http://marcosluis2186.posterous.com



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

opensuse 12.1

2012-04-04 Thread Barry, Sean F
What is the best way to install hadoop on opensuse 12.1 for a 
small two node cluster.

-SB


FW: opensuse 12.1

2012-04-04 Thread Barry, Sean F


-Original Message-
From: Barry, Sean F [mailto:sean.f.ba...@intel.com] 
Sent: Wednesday, April 04, 2012 9:10 AM
To: common-user@hadoop.apache.org
Subject: opensuse 12.1

What is the best way to install hadoop on opensuse 12.1 for a 
small two node cluster.

-SB


Re: opensuse 12.1

2012-04-04 Thread Raj Vishwanathan
Lots of people seem to start with this.

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ 


Raj




 From: Barry, Sean F sean.f.ba...@intel.com
To: common-user@hadoop.apache.org common-user@hadoop.apache.org 
Sent: Wednesday, April 4, 2012 9:12 AM
Subject: FW: opensuse 12.1
 


-Original Message-
From: Barry, Sean F [mailto:sean.f.ba...@intel.com] 
Sent: Wednesday, April 04, 2012 9:10 AM
To: common-user@hadoop.apache.org
Subject: opensuse 12.1

                What is the best way to install hadoop on opensuse 12.1 for a 
small two node cluster.

-SB




Re: how to fine tuning my map reduce job that is generating a lot of intermediate key-value pairs (a lot of I/O operations)

2012-04-04 Thread Serge Blazhievsky
How many datanodes do you use fir your job?

On 4/3/12 8:11 PM, Jane Wayne jane.wayne2...@gmail.com wrote:

i don't have the option of setting the map heap size to 2 GB since my
real environment is AWS EMR and the constraints are set.

http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html this
link is where i am currently reading on the meaning of io.sort.factor
and io.sort.mb.

it seems io.sort.mb tunes the map tasks and io.sort.factor tunes the
shuffle/reduce task. am i correct to say then that io.sort.factor is
not relevant here (yet, anways)? since i don't really make it to the
reduce phase (except for only a very small data size).

in that link above, here is the description for, io.sort.mb:  The
cumulative size of the serialization and accounting buffers storing
records emitted from the map, in megabytes. there's a paragraph above
the table that is value is simply the threshold that triggers a sort
and spill to the disk. furthermore, it says, If either buffer fills
completely while the spill is in progress, the map thread will block,
which is what i believe is happening in my case.

this sentence concerns me, Minimizing the number of spills to disk
can decrease map time, but a larger buffer also decreases the memory
available to the mapper. to minimize the number of spills, you need a
larger buffer; however, this statement seems to suggest to NOT
minimize the number of spills; a) you will not decrease map time, b)
you will not decrease the memory available to the mapper. so, in your
advice below, you say to increase, but i may actually want to decrease
the value for io.sort.mb. (if i understood the documentation
correctly, )

it seems these three map tuning parameters, io.sort.mb,
io.sort.record.percent, and io.sort.spill.percent are a pain-point
trading off between speed and memory. to me, if you set them high,
more serialized data + metadata are stored in memory before a spill
(an I/O operation is performed). you also get less merges (less I/O
operation?), but the negatives are blocking map operations and more
memory requirements. if you set them low, there are more frequent
spills (more I/O operations), but less memory requirements. it just
seems like no matter what you do, you are stuck: you may stall the
mapper if the values are high because of the amount of time required
to spill an enormous amount of data; you may stall the mapper if the
values are low because of the amount of I/O operations required
(spill/merge).

i must be understanding something wrong here because everywhere i
read, hadoop is supposed to be #1 at sorting. but here, in dealing
with the intermediary key-value pairs, in the process of sorting,
mappers can stall for any number of reasons.

does anyone know any competitive dynamic hadoop clustering service
like AWS EMR? the reason why i ask is because AWS EMR does not use
HDFS (it uses S3), and therefore, data locality is not possible. also,
i have read the TCP protocol is not efficient for network transfers;
if the S3 node and task nodes are far, this distance will certainly
exacerbate the situation of slow speed. it seems there are a lot of
factors working against me.

any help is appreciated.

On Tue, Apr 3, 2012 at 7:48 AM, Bejoy Ks bejoy.had...@gmail.com wrote:

 Jane,
   From my first look, properties that can help you could be
 - Increase io sort factor to 100
 - Increase io.sort.mb to 512Mb
 - increase map task heap size to 2GB.

 If the task still stalls, try providing lesser input for each mapper.

 Regards
 Bejoy KS

 On Tue, Apr 3, 2012 at 2:08 PM, Jane Wayne jane.wayne2...@gmail.com
wrote:

  i have a map reduce job that is generating a lot of intermediate
key-value
  pairs. for example, when i am 1/3 complete with my map phase, i may
have
  generated over 130,000,000 output records (which is about 9
gigabytes). to
  get to the 1/3 complete mark is very fast (less than 10 minutes), but
at
  the 1/3 complete mark, it seems to stall. when i look at the counter
logs,
  i do not see any logging of spilling yet. however, on the web job UI,
i see
  that FILE_BYTES_WRITTEN and Spilled Records keeps increasing.
needless to
  say, i have to dig deeper to see what is going on.
 
  my question is, how do i fine tune my map reduce job with the above
  properties? namely, the property of generating a lot of intermediate
  key-value pairs? it seems the I/O operations are negatively impacting
the
  job speed. there are so many map- and reduce-side tuning properties
(see
  Tom White, Hadoop, 2nd edition, pp 181-182), i am a little unsure
about
  just how to approach the tuning parameters. since the slow down is
  happening during the map-phase/task, i assume i should narrow down on
the
  map-side tuning properties.
 
  by the way, i am using the CPU-intensive c1.medium instances of
amazon web
  service's (AWS) elastic map reduce (EMR) on hadoop v0.20. a compute
node
  has 2 mappers, 1 reducers, and 384 MB JVM memory per task. this
instance
  type is documented 

Re: opensuse 12.1

2012-04-04 Thread Marcos Ortiz
Like OpenSUSE is a RPM-based distribution, you can try with the Apache 
BigTop project [1], and look for the RPM packages and give them a try.
You have noticed that the RPM specification between OpenSUSE and Red 
Hat-based distributions ()  change a little, but it can be a starting point.

See the documentation for the project [2].

[1] http://incubator.apache.org/projects/bigtop.html
[2] 
https://cwiki.apache.org/confluence/display/BIGTOP/Index%3bjsessionid=AA31645DFDAE1F3282D0159DB9B6AE9A


Regards

On 04/04/2012 12:24 PM, Raj Vishwanathan wrote:

Lots of people seem to start with this.

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ 



Raj





From: Barry, Sean Fsean.f.ba...@intel.com
To: common-user@hadoop.apache.orgcommon-user@hadoop.apache.org
Sent: Wednesday, April 4, 2012 9:12 AM
Subject: FW: opensuse 12.1



-Original Message-
From: Barry, Sean F [mailto:sean.f.ba...@intel.com]
Sent: Wednesday, April 04, 2012 9:10 AM
To: common-user@hadoop.apache.org
Subject: opensuse 12.1

 What is the best way to install hadoop on opensuse 12.1 for a 
small two node cluster.

-SB




10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci




--
Marcos Luis Ortíz Valmaseda (@marcosluis2186)
 Data Engineer at UCI
 http://marcosluis2186.posterous.com



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: Yahoo Hadoop Tutorial with new APIs?

2012-04-04 Thread Robert Evans
I am dropping the cross posts and leaving this on common-user with the others 
BCCed.

Marcos,

That is a great idea to be able to update the tutorial, especially if the 
community is interested in helping to do so.  We are looking into the best way 
to do this.  The idea right now is to donate this to the Hadoop project so that 
the community can keep it up to date, but we need some time to jump through all 
of the corporate hoops to get this to happen.  We have a lot going on right 
now, so if you don't see any progress on this please feel free to ping me and 
bug me about it.

--
Bobby Evans


On 4/4/12 8:15 AM, Jagat Singh jagatsi...@gmail.com wrote:

Hello Marcos

 Yes , Yahoo tutorials are pretty old but still they explain the concepts of 
Map Reduce , HDFS beautifully. The way in which tutorials have been defined 
into sub sections , each builing on previous one is awesome. I remember when i 
started i was digged in there for many days. The tutorials are lagging now from 
new API point of view.

 Lets have some documentation session one day , I would love to Volunteer to 
update those tutorials if people at Yahoo take input from outside world :)

 Regards,

 Jagat

- Original Message -
From: Marcos Ortiz
Sent: 04/04/12 08:32 AM
To: common-user@hadoop.apache.org, 'hdfs-u...@hadoop.apache.org', 
mapreduce-u...@hadoop.apache.org
Subject: Yahoo Hadoop Tutorial with new APIs?

Regards to all the list.
 There are many people that use the Hadoop Tutorial released by Yahoo at 
http://developer.yahoo.com/hadoop/tutorial/ 
http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
The main issue here is that, this tutorial is written with the old APIs? 
(Hadoop 0.18 I think).
 Is there a project for update this tutorial to the new APIs? to Hadoop 1.0.2 
or YARN (Hadoop 0.23)

 Best wishes
 -- Marcos Luis Ortíz Valmaseda (@marcosluis2186) Data Engineer at UCI 
http://marcosluis2186.posterous.com
 http://www.uci.cu/



Re: Yahoo Hadoop Tutorial with new APIs?

2012-04-04 Thread Mark Kerzner
Hi,

any interest in joining with this effort of mine?
http://hadoopilluminated.com/ - I am also doing only for community benefit.
I have more chapters that I am putting out. But, I want to keep the fun,
informal style.

Thanks,
Mark

On Wed, Apr 4, 2012 at 4:29 PM, Robert Evans ev...@yahoo-inc.com wrote:

 I am dropping the cross posts and leaving this on common-user with the
 others BCCed.

 Marcos,

 That is a great idea to be able to update the tutorial, especially if the
 community is interested in helping to do so.  We are looking into the best
 way to do this.  The idea right now is to donate this to the Hadoop project
 so that the community can keep it up to date, but we need some time to jump
 through all of the corporate hoops to get this to happen.  We have a lot
 going on right now, so if you don't see any progress on this please feel
 free to ping me and bug me about it.

 --
 Bobby Evans


 On 4/4/12 8:15 AM, Jagat Singh jagatsi...@gmail.com wrote:

 Hello Marcos

  Yes , Yahoo tutorials are pretty old but still they explain the concepts
 of Map Reduce , HDFS beautifully. The way in which tutorials have been
 defined into sub sections , each builing on previous one is awesome. I
 remember when i started i was digged in there for many days. The tutorials
 are lagging now from new API point of view.

  Lets have some documentation session one day , I would love to Volunteer
 to update those tutorials if people at Yahoo take input from outside world
 :)

  Regards,

  Jagat

 - Original Message -
 From: Marcos Ortiz
 Sent: 04/04/12 08:32 AM
 To: common-user@hadoop.apache.org, 'hdfs-u...@hadoop.apache.org',
 mapreduce-u...@hadoop.apache.org
 Subject: Yahoo Hadoop Tutorial with new APIs?

 Regards to all the list.
  There are many people that use the Hadoop Tutorial released by Yahoo at
 http://developer.yahoo.com/hadoop/tutorial/
 http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
 The main issue here is that, this tutorial is written with the old APIs?
 (Hadoop 0.18 I think).
  Is there a project for update this tutorial to the new APIs? to Hadoop
 1.0.2 or YARN (Hadoop 0.23)

  Best wishes
  -- Marcos Luis Ortíz Valmaseda (@marcosluis2186) Data Engineer at UCI
 http://marcosluis2186.posterous.com
  http://www.uci.cu/




Re: Yahoo Hadoop Tutorial with new APIs?

2012-04-04 Thread Marcos Ortiz
Ok, Robert, I will be waiting for you then. There are many folks that 
use this tutorial, so I think this a good effort in favor of the Hadoop 
community.It would be nice
if Yahoo! donate this work, because, I have some ideas behind this, for 
example: to release a Spanish version of the tutorial.

Regards and best wishes

On 04/04/2012 05:29 PM, Robert Evans wrote:
I am dropping the cross posts and leaving this on common-user with the 
others BCCed.


Marcos,

That is a great idea to be able to update the tutorial, especially if 
the community is interested in helping to do so.  We are looking into 
the best way to do this.  The idea right now is to donate this to the 
Hadoop project so that the community can keep it up to date, but we 
need some time to jump through all of the corporate hoops to get this 
to happen.  We have a lot going on right now, so if you don't see any 
progress on this please feel free to ping me and bug me about it.


--
Bobby Evans


On 4/4/12 8:15 AM, Jagat Singh jagatsi...@gmail.com wrote:

Hello Marcos

 Yes , Yahoo tutorials are pretty old but still they explain the
concepts of Map Reduce , HDFS beautifully. The way in which
tutorials have been defined into sub sections , each builing on
previous one is awesome. I remember when i started i was digged in
there for many days. The tutorials are lagging now from new API
point of view.

 Lets have some documentation session one day , I would love to
Volunteer to update those tutorials if people at Yahoo take input
from outside world :)

 Regards,

 Jagat

- Original Message -
From: Marcos Ortiz
Sent: 04/04/12 08:32 AM
To: common-user@hadoop.apache.org, 'hdfs-u...@hadoop.apache.org
%27hdfs-u...@hadoop.apache.org', mapreduce-u...@hadoop.apache.org
Subject: Yahoo Hadoop Tutorial with new APIs?

Regards to all the list.
 There are many people that use the Hadoop Tutorial released by
Yahoo at http://developer.yahoo.com/hadoop/tutorial/
http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
The main issue here is that, this tutorial is written with the old
APIs? (Hadoop 0.18 I think).
 Is there a project for update this tutorial to the new APIs? to
Hadoop 1.0.2 or YARN (Hadoop 0.23)

 Best wishes
 -- Marcos Luis Ortíz Valmaseda (@marcosluis2186) Data Engineer at
UCI http://marcosluis2186.posterous.com
http://www.uci.cu/



http://www.uci.cu/


--
Marcos Luis Ortíz Valmaseda (@marcosluis2186)
 Data Engineer at UCI
 http://marcosluis2186.posterous.com



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: Yahoo Hadoop Tutorial with new APIs?

2012-04-04 Thread Edward Capriolo
Nathan but together the steps together on this blog.

http://blog.milford.io/2012/01/kicking-the-tires-on-hadoop-0-23-pseudo-distributed-mode/

Which fills out the missing details such as

 property
nameyarn.nodemanager.local-dirs/name
value/value
descriptionthe local directories used by the nodemanager/description
  /property

in the official docs.

http://hadoop.apache.org/common/docs/r0.23.1/hadoop-yarn/hadoop-yarn-site/SingleCluster.html





On Wed, Apr 4, 2012 at 5:43 PM, Marcos Ortiz mlor...@uci.cu wrote:
 Ok, Robert, I will be waiting for you then. There are many folks that use
 this tutorial, so I think this a good effort in favor of the Hadoop
 community.It would be nice
 if Yahoo! donate this work, because, I have some ideas behind this, for
 example: to release a Spanish version of the tutorial.
 Regards and best wishes

 On 04/04/2012 05:29 PM, Robert Evans wrote:

 I am dropping the cross posts and leaving this on common-user with the
 others BCCed.

 Marcos,

 That is a great idea to be able to update the tutorial, especially if the
 community is interested in helping to do so.  We are looking into the best
 way to do this.  The idea right now is to donate this to the Hadoop project
 so that the community can keep it up to date, but we need some time to jump
 through all of the corporate hoops to get this to happen.  We have a lot
 going on right now, so if you don't see any progress on this please feel
 free to ping me and bug me about it.

 --
 Bobby Evans


 On 4/4/12 8:15 AM, Jagat Singh jagatsi...@gmail.com wrote:

    Hello Marcos

     Yes , Yahoo tutorials are pretty old but still they explain the
    concepts of Map Reduce , HDFS beautifully. The way in which
    tutorials have been defined into sub sections , each builing on
    previous one is awesome. I remember when i started i was digged in
    there for many days. The tutorials are lagging now from new API
    point of view.

     Lets have some documentation session one day , I would love to
    Volunteer to update those tutorials if people at Yahoo take input
    from outside world :)

     Regards,

     Jagat

    - Original Message -
    From: Marcos Ortiz
    Sent: 04/04/12 08:32 AM
    To: common-user@hadoop.apache.org, 'hdfs-u...@hadoop.apache.org
    %27hdfs-u...@hadoop.apache.org', mapreduce-u...@hadoop.apache.org
    Subject: Yahoo Hadoop Tutorial with new APIs?

    Regards to all the list.
     There are many people that use the Hadoop Tutorial released by
    Yahoo at http://developer.yahoo.com/hadoop/tutorial/
    http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
    The main issue here is that, this tutorial is written with the old
    APIs? (Hadoop 0.18 I think).
     Is there a project for update this tutorial to the new APIs? to
    Hadoop 1.0.2 or YARN (Hadoop 0.23)

     Best wishes
     -- Marcos Luis Ortíz Valmaseda (@marcosluis2186) Data Engineer at
    UCI http://marcosluis2186.posterous.com
    http://www.uci.cu/



 http://www.uci.cu/


 --
 Marcos Luis Ortíz Valmaseda (@marcosluis2186)
  Data Engineer at UCI
  http://marcosluis2186.posterous.com



 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
 INFORMATICAS...
 CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

 http://www.uci.cu
 http://www.facebook.com/universidad.uci
 http://www.flickr.com/photos/universidad_uci


Doubt from the book Definitive Guide

2012-04-04 Thread Mohit Anchlia
I am going through the chapter How mapreduce works and have some
confusion:

1) Below description of Mapper says that reducers get the output file using
HTTP call. But the description under The Reduce Side doesn't specifically
say if it's copied using HTTP. So first confusion, Is the output copied
from mapper - reducer or from reducer - mapper? And second, Is the call
http:// or hdfs://

2) My understanding was that mapper output gets written to hdfs, since I've
seen part-m-0 files in hdfs. If mapper output is written to HDFS then
shouldn't reducers simply read it from hdfs instead of making http calls to
tasktrackers location?



- from the book ---
Mapper
The output file’s partitions are made available to the reducers over HTTP.
The number of worker threads used to serve the file partitions is
controlled by the tasktracker.http.threads property
this setting is per tasktracker, not per map task slot. The default of 40
may need increasing for large clusters running large jobs.6.4.2.

The Reduce Side
Let’s turn now to the reduce part of the process. The map output file is
sitting on the local disk of the tasktracker that ran the map task
(note that although map outputs always get written to the local disk of the
map tasktracker, reduce outputs may not be), but now it is needed by the
tasktracker
that is about to run the reduce task for the partition. Furthermore, the
reduce task needs the map output for its particular partition from several
map tasks across the cluster.
The map tasks may finish at different times, so the reduce task starts
copying their outputs as soon as each completes. This is known as the copy
phase of the reduce task.
The reduce task has a small number of copier threads so that it can fetch
map outputs in parallel.
The default is five threads, but this number can be changed by setting the
mapred.reduce.parallel.copies property.


Re: Doubt from the book Definitive Guide

2012-04-04 Thread Prashant Kommireddi
Answers inline.

On Wed, Apr 4, 2012 at 4:56 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 I am going through the chapter How mapreduce works and have some
 confusion:

 1) Below description of Mapper says that reducers get the output file using
 HTTP call. But the description under The Reduce Side doesn't specifically
 say if it's copied using HTTP. So first confusion, Is the output copied
 from mapper - reducer or from reducer - mapper? And second, Is the call
 http:// or hdfs://


Map output is written to local FS, not HDFS.


 2) My understanding was that mapper output gets written to hdfs, since I've
 seen part-m-0 files in hdfs. If mapper output is written to HDFS then
 shouldn't reducers simply read it from hdfs instead of making http calls to
 tasktrackers location?

 Map output is sent to HDFS when reducer is not used.



 - from the book ---
 Mapper
 The output file’s partitions are made available to the reducers over HTTP.
 The number of worker threads used to serve the file partitions is
 controlled by the tasktracker.http.threads property
 this setting is per tasktracker, not per map task slot. The default of 40
 may need increasing for large clusters running large jobs.6.4.2.

 The Reduce Side
 Let’s turn now to the reduce part of the process. The map output file is
 sitting on the local disk of the tasktracker that ran the map task
 (note that although map outputs always get written to the local disk of the
 map tasktracker, reduce outputs may not be), but now it is needed by the
 tasktracker
 that is about to run the reduce task for the partition. Furthermore, the
 reduce task needs the map output for its particular partition from several
 map tasks across the cluster.
 The map tasks may finish at different times, so the reduce task starts
 copying their outputs as soon as each completes. This is known as the copy
 phase of the reduce task.
 The reduce task has a small number of copier threads so that it can fetch
 map outputs in parallel.
 The default is five threads, but this number can be changed by setting the
 mapred.reduce.parallel.copies property.



Re: how to fine tuning my map reduce job that is generating a lot of intermediate key-value pairs (a lot of I/O operations)

2012-04-04 Thread Jane Wayne
serge, i specify 15 instances, but only 14 end up being data/tasks
nodes. 1 instance is reserved as the name node (job tracker).

On Wed, Apr 4, 2012 at 1:17 PM, Serge Blazhievsky
serge.blazhiyevs...@nice.com wrote:
 How many datanodes do you use fir your job?

 On 4/3/12 8:11 PM, Jane Wayne jane.wayne2...@gmail.com wrote:

i don't have the option of setting the map heap size to 2 GB since my
real environment is AWS EMR and the constraints are set.

http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html this
link is where i am currently reading on the meaning of io.sort.factor
and io.sort.mb.

it seems io.sort.mb tunes the map tasks and io.sort.factor tunes the
shuffle/reduce task. am i correct to say then that io.sort.factor is
not relevant here (yet, anways)? since i don't really make it to the
reduce phase (except for only a very small data size).

in that link above, here is the description for, io.sort.mb:  The
cumulative size of the serialization and accounting buffers storing
records emitted from the map, in megabytes. there's a paragraph above
the table that is value is simply the threshold that triggers a sort
and spill to the disk. furthermore, it says, If either buffer fills
completely while the spill is in progress, the map thread will block,
which is what i believe is happening in my case.

this sentence concerns me, Minimizing the number of spills to disk
can decrease map time, but a larger buffer also decreases the memory
available to the mapper. to minimize the number of spills, you need a
larger buffer; however, this statement seems to suggest to NOT
minimize the number of spills; a) you will not decrease map time, b)
you will not decrease the memory available to the mapper. so, in your
advice below, you say to increase, but i may actually want to decrease
the value for io.sort.mb. (if i understood the documentation
correctly, )

it seems these three map tuning parameters, io.sort.mb,
io.sort.record.percent, and io.sort.spill.percent are a pain-point
trading off between speed and memory. to me, if you set them high,
more serialized data + metadata are stored in memory before a spill
(an I/O operation is performed). you also get less merges (less I/O
operation?), but the negatives are blocking map operations and more
memory requirements. if you set them low, there are more frequent
spills (more I/O operations), but less memory requirements. it just
seems like no matter what you do, you are stuck: you may stall the
mapper if the values are high because of the amount of time required
to spill an enormous amount of data; you may stall the mapper if the
values are low because of the amount of I/O operations required
(spill/merge).

i must be understanding something wrong here because everywhere i
read, hadoop is supposed to be #1 at sorting. but here, in dealing
with the intermediary key-value pairs, in the process of sorting,
mappers can stall for any number of reasons.

does anyone know any competitive dynamic hadoop clustering service
like AWS EMR? the reason why i ask is because AWS EMR does not use
HDFS (it uses S3), and therefore, data locality is not possible. also,
i have read the TCP protocol is not efficient for network transfers;
if the S3 node and task nodes are far, this distance will certainly
exacerbate the situation of slow speed. it seems there are a lot of
factors working against me.

any help is appreciated.

On Tue, Apr 3, 2012 at 7:48 AM, Bejoy Ks bejoy.had...@gmail.com wrote:

 Jane,
       From my first look, properties that can help you could be
 - Increase io sort factor to 100
 - Increase io.sort.mb to 512Mb
 - increase map task heap size to 2GB.

 If the task still stalls, try providing lesser input for each mapper.

 Regards
 Bejoy KS

 On Tue, Apr 3, 2012 at 2:08 PM, Jane Wayne jane.wayne2...@gmail.com
wrote:

  i have a map reduce job that is generating a lot of intermediate
key-value
  pairs. for example, when i am 1/3 complete with my map phase, i may
have
  generated over 130,000,000 output records (which is about 9
gigabytes). to
  get to the 1/3 complete mark is very fast (less than 10 minutes), but
at
  the 1/3 complete mark, it seems to stall. when i look at the counter
logs,
  i do not see any logging of spilling yet. however, on the web job UI,
i see
  that FILE_BYTES_WRITTEN and Spilled Records keeps increasing.
needless to
  say, i have to dig deeper to see what is going on.
 
  my question is, how do i fine tune my map reduce job with the above
  properties? namely, the property of generating a lot of intermediate
  key-value pairs? it seems the I/O operations are negatively impacting
the
  job speed. there are so many map- and reduce-side tuning properties
(see
  Tom White, Hadoop, 2nd edition, pp 181-182), i am a little unsure
about
  just how to approach the tuning parameters. since the slow down is
  happening during the map-phase/task, i assume i should narrow down on
the
  map-side tuning properties.
 
  by the way, i am using the 

Re: Doubt from the book Definitive Guide

2012-04-04 Thread Harsh J
Hi Mohit,

On Thu, Apr 5, 2012 at 5:26 AM, Mohit Anchlia mohitanch...@gmail.com wrote:
 I am going through the chapter How mapreduce works and have some
 confusion:

 1) Below description of Mapper says that reducers get the output file using
 HTTP call. But the description under The Reduce Side doesn't specifically
 say if it's copied using HTTP. So first confusion, Is the output copied
 from mapper - reducer or from reducer - mapper? And second, Is the call
 http:// or hdfs://

The flow is simple as this:
1. For M+R job, map completes its task after writing all partitions
down into the tasktracker's local filesystem (under mapred.local.dir
directories).
2. Reducers fetch completion locations from events at JobTracker, and
query the TaskTracker there to provide it the specific partition it
needs, which is done over the TaskTracker's HTTP service (50060).

So to clear things up - map doesn't send it to reduce, nor does reduce
ask the actual map task. It is the task tracker itself that makes the
bridge here.

Note however, that in Hadoop 2.0 the transfer via ShuffleHandler would
be over Netty connections. This would be much more faster and
reliable.

 2) My understanding was that mapper output gets written to hdfs, since I've
 seen part-m-0 files in hdfs. If mapper output is written to HDFS then
 shouldn't reducers simply read it from hdfs instead of making http calls to
 tasktrackers location?

A map-only job usually writes out to HDFS directly (no sorting done,
cause no reducer is involved). If the job is a map+reduce one, the
default output is collected to local filesystem for partitioning and
sorting at map end, and eventually grouping at reduce end. Basically:
Data you want to send to reducer from mapper goes to local FS for
multiple actions to be performed on them, other data may directly go
to HDFS.

Reducers currently are scheduled pretty randomly but yes their
scheduling can be improved for certain scenarios. However, if you are
pointing that map partitions ought to be written to HDFS itself (with
replication or without), I don't see performance improving. Note that
the partitions aren't merely written but need to be sorted as well (at
either end). To do that would need ability to spill frequently (cause
we don't have infinite memory to do it all in RAM) and doing such a
thing on HDFS would only mean slowdown.

I hope this helps clear some things up for you.

-- 
Harsh J


Re: Doubt from the book Definitive Guide

2012-04-04 Thread Mohit Anchlia
On Wed, Apr 4, 2012 at 8:42 PM, Harsh J ha...@cloudera.com wrote:

 Hi Mohit,

 On Thu, Apr 5, 2012 at 5:26 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
  I am going through the chapter How mapreduce works and have some
  confusion:
 
  1) Below description of Mapper says that reducers get the output file
 using
  HTTP call. But the description under The Reduce Side doesn't
 specifically
  say if it's copied using HTTP. So first confusion, Is the output copied
  from mapper - reducer or from reducer - mapper? And second, Is the call
  http:// or hdfs://

 The flow is simple as this:
 1. For M+R job, map completes its task after writing all partitions
 down into the tasktracker's local filesystem (under mapred.local.dir
 directories).
 2. Reducers fetch completion locations from events at JobTracker, and
 query the TaskTracker there to provide it the specific partition it
 needs, which is done over the TaskTracker's HTTP service (50060).

 So to clear things up - map doesn't send it to reduce, nor does reduce
 ask the actual map task. It is the task tracker itself that makes the
 bridge here.

 Note however, that in Hadoop 2.0 the transfer via ShuffleHandler would
 be over Netty connections. This would be much more faster and
 reliable.

  2) My understanding was that mapper output gets written to hdfs, since
 I've
  seen part-m-0 files in hdfs. If mapper output is written to HDFS then
  shouldn't reducers simply read it from hdfs instead of making http calls
 to
  tasktrackers location?

 A map-only job usually writes out to HDFS directly (no sorting done,
 cause no reducer is involved). If the job is a map+reduce one, the
 default output is collected to local filesystem for partitioning and
 sorting at map end, and eventually grouping at reduce end. Basically:
 Data you want to send to reducer from mapper goes to local FS for
 multiple actions to be performed on them, other data may directly go
 to HDFS.

 Reducers currently are scheduled pretty randomly but yes their
 scheduling can be improved for certain scenarios. However, if you are
 pointing that map partitions ought to be written to HDFS itself (with
 replication or without), I don't see performance improving. Note that
 the partitions aren't merely written but need to be sorted as well (at
 either end). To do that would need ability to spill frequently (cause
 we don't have infinite memory to do it all in RAM) and doing such a
 thing on HDFS would only mean slowdown.

 Thanks for clearing my doubts. In this case I was merely suggesting that
if the mapper output (merged output in the end or the shuffle output) is
stored in HDFS then reducers can just retrieve it from HDFS instead of
asking tasktracker for it. Once reducer threads read it they can continue
to work locally.



 I hope this helps clear some things up for you.

 --
 Harsh J



Re: Doubt from the book Definitive Guide

2012-04-04 Thread Prashant Kommireddi
Hi Mohit,

What would be the advantage? Reducers in most cases read data from all
the mappers. In the case where mappers were to write to HDFS, a
reducer would still require to read data from other datanodes across
the cluster.

Prashant

On Apr 4, 2012, at 9:55 PM, Mohit Anchlia mohitanch...@gmail.com wrote:

 On Wed, Apr 4, 2012 at 8:42 PM, Harsh J ha...@cloudera.com wrote:

 Hi Mohit,

 On Thu, Apr 5, 2012 at 5:26 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
 I am going through the chapter How mapreduce works and have some
 confusion:

 1) Below description of Mapper says that reducers get the output file
 using
 HTTP call. But the description under The Reduce Side doesn't
 specifically
 say if it's copied using HTTP. So first confusion, Is the output copied
 from mapper - reducer or from reducer - mapper? And second, Is the call
 http:// or hdfs://

 The flow is simple as this:
 1. For M+R job, map completes its task after writing all partitions
 down into the tasktracker's local filesystem (under mapred.local.dir
 directories).
 2. Reducers fetch completion locations from events at JobTracker, and
 query the TaskTracker there to provide it the specific partition it
 needs, which is done over the TaskTracker's HTTP service (50060).

 So to clear things up - map doesn't send it to reduce, nor does reduce
 ask the actual map task. It is the task tracker itself that makes the
 bridge here.

 Note however, that in Hadoop 2.0 the transfer via ShuffleHandler would
 be over Netty connections. This would be much more faster and
 reliable.

 2) My understanding was that mapper output gets written to hdfs, since
 I've
 seen part-m-0 files in hdfs. If mapper output is written to HDFS then
 shouldn't reducers simply read it from hdfs instead of making http calls
 to
 tasktrackers location?

 A map-only job usually writes out to HDFS directly (no sorting done,
 cause no reducer is involved). If the job is a map+reduce one, the
 default output is collected to local filesystem for partitioning and
 sorting at map end, and eventually grouping at reduce end. Basically:
 Data you want to send to reducer from mapper goes to local FS for
 multiple actions to be performed on them, other data may directly go
 to HDFS.

 Reducers currently are scheduled pretty randomly but yes their
 scheduling can be improved for certain scenarios. However, if you are
 pointing that map partitions ought to be written to HDFS itself (with
 replication or without), I don't see performance improving. Note that
 the partitions aren't merely written but need to be sorted as well (at
 either end). To do that would need ability to spill frequently (cause
 we don't have infinite memory to do it all in RAM) and doing such a
 thing on HDFS would only mean slowdown.

 Thanks for clearing my doubts. In this case I was merely suggesting that
 if the mapper output (merged output in the end or the shuffle output) is
 stored in HDFS then reducers can just retrieve it from HDFS instead of
 asking tasktracker for it. Once reducer threads read it they can continue
 to work locally.



 I hope this helps clear some things up for you.

 --
 Harsh J