Re: correct pattern for using setOutputValueGroupingComparator?

2009-01-06 Thread Devaraj Das



On 1/6/09 9:47 AM, Meng Mao meng...@gmail.com wrote:

 Unfortunately, my team is on 0.15 :(. We are looking to upgrade to 0.18 as
 soon as we upgrade our hardware (long story).
 From comparing the 0.15 and 0.19 mapreduce tutorials, and looking at the
 4545 patch, I don't see anything that seems majorly different about the
 MapReduce API?
 - There's a Partitioner that's used, but that seems optional?
 - I see that 0.19 still provides setOutputValueGroupingComparator; is the
 setGroupingComparatorClass in the patch from the 0.20 API?
 
Yes, setGroupingComparator got defined in the new MapReduce API and is doing
the same thing.

 I have an associated question -- is it possible to use this
 GroupingComparator technique to perform essentially a one-to-many mapping?
 Let's say I have records like so:
 id_1  -   metadata
 id_2  -   metadata
 id_1  A  numbers
 id_2  B  numbers
 id_1  C  numbers
 
 Would it be possible for a key,value pair for id_1, -, metadata to map
 to both the groups for the keys id_1, A and id_1, C ?  The comparator
 seems easy to achieve; but I don't see multiple copies of a record being
 sent to multiple groups.  I know it's a bit unusual, but it would be useful
 for us to have this kind of wildcard behavior.
 
Not that's not possible without changing your app to generate that many
records. So for example, in your map, you could output multiple records
corresponding to the wild-card records..
 
 Meng
 
 
 
 On Mon, Jan 5, 2009 at 6:58 PM, Owen O'Malley omal...@apache.org wrote:
 
 This is exactly what the setOutputValueGroupingComparator is for. Take a
 look at HADOOP-4545, for an example using the secondary sort. If you are
 using trunk or 0.20, look at
 src/examples/org/apache/hadoop/examples/SecondarySort.java. The checked in
 example uses the new map/reduce api that was introduced in 0.20.
 
 -- Owen
 




IOExceptions cause Map tasks failed!

2009-01-06 Thread Andrew Sha

Hi,

I got the following IOExceptions which cause Map tasks failed when I run 
Nutch Fetcher2.


$ bin/nutch fetch2 segments/20090106154644

java.io.IOException: Task process exit with nonzero status of 134.
   at org.apache.hadoop.mapred.*TaskRunner*.run(*TaskRunner*.java:424)

Can anyone tell me why?

any help will be appreciated!


Thanks,

Andrew Sha


TestDFSIO delivers bad values of throughput and average IO rate

2009-01-06 Thread tienduc_dinh

Hello,

I'm now using hadoop-0.18.0 and testing it on a cluster with 1 master and 4
slaves. In hadoop-site.xml the value of mapred.map.tasks is 10. Because
the values throughput and average IO rate are similar, I just post the
values of throughput of the same command with 3 times running

-  hadoop-0.18.0/bin/hadoop jar testDFSIO.jar -write -fileSize 2048
-nrFiles 1

+ with dfs.replication = 1 = 33,60 / 31,48 / 30,95

+ with dfs.replication = 2 = 26,40 / 20,99 / 21,70

I find something strange while reading the source code. 

- The value of mapred.reduce.tasks is always set to 1 

job.setNumReduceTasks(1) in the function runIOTest()  and reduceFile = new
Path(WRITE_DIR, part-0) in analyzeResult().

So I think, if we properly have mapred.reduce.tasks = 2, we will have on the
file system 2 Paths to part-0 and part-1, e.g.
/benchmarks/TestDFSIO/io_write/part-0

- And i don't understand the line with double med = rate / 1000 / tasks.
Is it not double med = rate * tasks / 1000 
-- 
View this message in context: 
http://www.nabble.com/TestDFSIO-delivers-bad-values-of-%22throughput%22-and-%22average-IO-rate%22-tp21312088p21312088.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Combiner run specification and questions

2009-01-06 Thread Saptarshi Guha
I agree with the requirement that the key does not change. Of course,  
the values can change.
I am primarily worried that the combiner might not be run at all - I  
have 'successfully' integrated Hadoop and R i.e the user can provide  
map/reduce functions written in R. However, R is not great with memory  
management, and if I have N (N is huge) values for a given key K, then  
R will baulk when it comes to processing this.
Thus the combiner. The combiner will process n values for K, and  
ultimately, a few values for K in the reducer . If the combiner where  
not to run, R would collapse under the load.


1) I am guaranteed a reducer.
So,
The combiner, if defined, will run zero or more times on records  
emitted from the map, before being fed to the reduce.



This zero case possibility worries me. However you mention, that it  
occurs

collector spills in the map


I have noticed this happening - what 'spilling' mean?

Thank you
Saptarshi


On Jan 5, 2009, at 10:22 PM, Chris Douglas wrote:

The combiner, if defined, will run zero or more times on records  
emitted from the map, before being fed to the reduce. It is run when  
the collector spills in the map and in some merge cases. If the  
combiner transforms the key, it is illegal to change its type, the  
partition to which it is assigned, or its ordering.


For example, if you emit a record (k,v) from your map and (k',v)  
from the combiner, your comparator is C(K,K) and your partitioner  
function is P(K), it must be the case that P(k) == P(k') and C(k,k')  
== 0. If either of these does not hold, the semantics to the reduce  
are broken. Clearly, if k is not transformed (as in true for most  
combiners), this holds trivially.


As was mentioned earlier, the purpose of the combiner is to compress  
data pulled across the network and spilled to disk. It should not  
affect the correctness or, in most cases, the output of the job. -C


On Jan 2, 2009, at 9:57 AM, Saptarshi Guha wrote:


Hello,
I would just like to confirm, when does the Combiner run(since it
might not be run at all,see below). I read somewhere that it is run,
if there is at least one reduce (which in my case i can be sure of).
I also read, that the combiner is an optimization. However, it is  
also

a chance for a function to transform the key/value (keeping the class
the same i.e the combiner semantics are not changed) and deal with a
smaller set ( this could be done in the reducer but the number of
values for a key might be relatively large).

However, I guess it would be a mistake for reducer to expect its  
input

coming from a combiner? E.g if there are only 10 value corresponding
to a key(as outputted by the mapper), will these 10 values go  
straight

to the reducer or to the reducer via the combiner?

Here I am assuming my reduce operations does not need all the values
for a key to work(so that a combiner can be used) i.e additive
operations.

Thank you
Saptarshi


On Sun, Nov 16, 2008 at 6:18 PM, Owen O'Malley omal...@apache.org  
wrote:
The Combiner may be called 0, 1, or many times on each key between  
the
mapper and reducer. Combiners are just an application specific  
optimization
that compress the intermediate output. They should not have side  
effects or
transform the types. Unfortunately, since there isn't a separate  
interface
for Combiners, there is isn't a great place to document this  
requirement.

I've just filed HADOOP-4668 to improve the documentation.




--
Saptarshi Guha - saptarshi.g...@gmail.com




Saptarshi Guha | saptarshi.g...@gmail.com | http://www.stat.purdue.edu/~sguha
The way of the world is to praise dead saints and prosecute live ones.
-- Nathaniel Howe



Re: Problem loading hadoop-site.xml - dumping parameters

2009-01-06 Thread g00dn3ss
OK, I figured out my problem.  As expected, this was my silly mistake.  We
have a java program that runs on a machine outside of our hadoop cluster
that references the hadoop jar files.  This machine submits jobs to our
cluster but doesn't actually have hadoop installed.  I was configuring the
mapred.child.java.opts in the hadoop-site.xml used by the cluster.  However,
the value that the tasks actually use come from the JobConf of the submitted
job.  So I am now manually setting the options I want in our external
program that submits the jobs.  It seems to be working as expected now.  I
thought my old setup was working with the older version of hadoop, but it
may be that our jobs just recently started hitting the 200m heap boundary.

Not sure if this is also your problem, Saptarshi.  But hopefully it will
lead you in the right direction.

g00dn3ss



On Mon, Jan 5, 2009 at 2:26 PM, Saptarshi Guha saptarshi.g...@gmail.comwrote:

 Hello,
 I have set my HADOOP_CONF_DIR to the conf folder and still not
 loading. I have to manually set the options when I create my conf.
 Have you resolved this?

 Regards
 Saptarshi

 On Tue, Dec 30, 2008 at 5:25 PM, g00dn3ss g00dn...@gmail.com wrote:
  Hey all,
 
  I have a similar issue.  I am specifically having problems with the
 config
  option mapred.child.java.opts.  I set it to -Xmx1024m and it uses
 -Xmx200m
  regardless.  I am running Hadoop 0.18.2 and I'm pretty sure this option
 was
  working in the previous versions of Hadoop I was using.
 
  I am not explicitly setting HADOOP_CONF_DIR.  My site config is in
  ${HADOOP_HOME}/conf.  Just to test things further, I wrote a small map
 task
  to print out the ENV values and it has the correct value for HADOOP_HOME,
  HADOOP_LOG_DIR, HADOOP_OPTS, etc...  I also printed out the key/values in
  the JobConf passed to the mapper and it has my specified values for
  fs.default.name and mapred.job.tracker.  Other settings like
 dfs.name.dir,
  dfs.data.dir, and mapred.child.java.opts do not have my values.
 
  Any suggestion where to look at next?
 
  Thanks!
 
 
 
  On Mon, Dec 29, 2008 at 10:27 PM, Amareshwari Sriramadasu 
  amar...@yahoo-inc.com wrote:
 
  Saptarshi Guha wrote:
 
  Hello,
  I had previously emailed regarding heap size issue and have discovered
  that the hadoop-site.xml is not loading completely, i.e
   Configuration defaults = new Configuration();
 JobConf jobConf = new JobConf(defaults, XYZ.class);
 System.out.println(1:+jobConf.get(mapred.child.java.opts));
 System.out.println(2:+jobConf.get(mapred.map.tasks));
 System.out.println(3:+jobConf.get(mapred.reduce.tasks));
 
 
  
 System.out.println(3:+jobConf.get(mapred.tasktracker.reduce.tasks.maximum));
 
  returns -Xmx200m, 2,1,2 respectively, even though the numbers in the
  hadoop-site.xml are very different.
 
  Is there a way for hadoop to dump the parameters read in from
  hadoop-site.xml and hadoop-default.xml?
 
 
 
  Is your hadoop-site.xml present in the conf (HADOOP_CONF_DIR) directory?
 
 http://hadoop.apache.org/core/docs/r0.19.0/cluster_setup.html#Configuration
 
  -Amareshwari
 
 



 --
 Saptarshi Guha - saptarshi.g...@gmail.com



Re: I can run hadoop,

2009-01-06 Thread tienduc_dinh

I think, you should start the services. Try start-dfs.sh and start-mapred.sh


周辉 wrote:
 
 hi:
   I want to run hadoop,but there is a error,can you help me?
   The log is :
 
   2009-01-03 14:10:52,109 INFO
 org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:
 /
 STARTUP_MSG: Starting DataNode
 STARTUP_MSG:   host = f2/192.168.1.102
 STARTUP_MSG:   args = []
 STARTUP_MSG:   version = 0.19.0
 STARTUP_MSG:   build =
 https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19 -r
 713890;
 compiled by 'ndaley' on Fri Nov 14 03:12:29 UTC 2008
 /
 2009-01-03 14:10:54,963 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: /192.168.1.55:9000. Already tried 0 time(s).
 2009-01-03 14:10:55,965 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: /192.168.1.55:9000. Already tried 1 time(s).
 2009-01-03 14:10:56,969 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: /192.168.1.55:9000. Already tried 2 time(s).
 2009-01-03 14:10:57,972 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: /192.168.1.55:9000. Already tried 3 time(s).
 2009-01-03 14:10:58,974 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: /192.168.1.55:9000. Already tried 4 time(s).
 2009-01-03 14:10:59,976 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: /192.168.1.55:9000. Already tried 5 time(s).
 2009-01-03 14:11:00,978 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: /192.168.1.55:9000. Already tried 6 time(s).
 2009-01-03 14:11:01,981 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: /192.168.1.55:9000. Already tried 7 time(s).
 2009-01-03 14:11:02,986 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: /192.168.1.55:9000. Already tried 8 time(s).
 2009-01-03 14:11:04,002 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: /192.168.1.55:9000. Already tried 9 time(s).
 2009-01-03 14:11:04,061 ERROR
 org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Call
 to /192.168.1.55:9000 failed on local exception: No route to host
 at org.apache.hadoop.ipc.Client.call(Client.java:699)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
 at $Proxy4.getProtocolVersion(Unknown Source)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:306)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:343)
 at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:288)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:258)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:205)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1199)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1154)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1162)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1284)
 Caused by: java.net.NoRouteToHostException: No route to host
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at
 sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
 at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:100)
 at
 org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:299)
 at
 org.apache.hadoop.ipc.Client$Connection.access$1700(Client.java:176)
 at org.apache.hadoop.ipc.Client.getConnection(Client.java:772)
 at org.apache.hadoop.ipc.Client.call(Client.java:685)
 ... 12 more
 
 2009-01-03 14:11:04,072 INFO
 org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down DataNode at f2/192.168.1.102
 /
 
 

-- 
View this message in context: 
http://www.nabble.com/I-can-run-hadoop%2C-tp21263335p21320091.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: I can run hadoop,

2009-01-06 Thread tienduc_dinh

I think, you should start the services. Try start-dfs.sh and start-mapred.sh


周辉 wrote:
 
 hi:
   I want to run hadoop,but there is a error,can you help me?
   The log is :
 
   2009-01-03 14:10:52,109 INFO
 org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:
 /
 STARTUP_MSG: Starting DataNode
 STARTUP_MSG:   host = f2/192.168.1.102
 STARTUP_MSG:   args = []
 STARTUP_MSG:   version = 0.19.0
 STARTUP_MSG:   build =
 https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19 -r
 713890;
 compiled by 'ndaley' on Fri Nov 14 03:12:29 UTC 2008
 /
 2009-01-03 14:10:54,963 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: /192.168.1.55:9000. Already tried 0 time(s).
 2009-01-03 14:10:55,965 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: /192.168.1.55:9000. Already tried 1 time(s).
 2009-01-03 14:10:56,969 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: /192.168.1.55:9000. Already tried 2 time(s).
 2009-01-03 14:10:57,972 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: /192.168.1.55:9000. Already tried 3 time(s).
 2009-01-03 14:10:58,974 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: /192.168.1.55:9000. Already tried 4 time(s).
 2009-01-03 14:10:59,976 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: /192.168.1.55:9000. Already tried 5 time(s).
 2009-01-03 14:11:00,978 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: /192.168.1.55:9000. Already tried 6 time(s).
 2009-01-03 14:11:01,981 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: /192.168.1.55:9000. Already tried 7 time(s).
 2009-01-03 14:11:02,986 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: /192.168.1.55:9000. Already tried 8 time(s).
 2009-01-03 14:11:04,002 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: /192.168.1.55:9000. Already tried 9 time(s).
 2009-01-03 14:11:04,061 ERROR
 org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Call
 to /192.168.1.55:9000 failed on local exception: No route to host
 at org.apache.hadoop.ipc.Client.call(Client.java:699)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
 at $Proxy4.getProtocolVersion(Unknown Source)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:306)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:343)
 at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:288)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:258)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:205)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1199)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1154)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1162)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1284)
 Caused by: java.net.NoRouteToHostException: No route to host
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at
 sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
 at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:100)
 at
 org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:299)
 at
 org.apache.hadoop.ipc.Client$Connection.access$1700(Client.java:176)
 at org.apache.hadoop.ipc.Client.getConnection(Client.java:772)
 at org.apache.hadoop.ipc.Client.call(Client.java:685)
 ... 12 more
 
 2009-01-03 14:11:04,072 INFO
 org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down DataNode at f2/192.168.1.102
 /
 
 

-- 
View this message in context: 
http://www.nabble.com/I-can-run-hadoop%2C-tp21263335p21320109.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



RE: Concatenating PDF files

2009-01-06 Thread Zak, Richard [USA]
Thank you very much Tom, that seems to have done the trick!

conf.set(mapred.child.java.opts,-Xmx7000m); 
conf.setNumReduceTasks(0);

And I was able to churn through 4 directories each with 100 PDFs.  And yes,
from ps I could see that the processes were using the -Xmx7000m option.


Richard J. Zak

-Original Message-
From: Tom White [mailto:t...@cloudera.com] 
Sent: Monday, January 05, 2009 06:47
To: core-user@hadoop.apache.org
Subject: Re: Concatenating PDF files

Hi Richard,

Are you running out of memory after many PDFs have been processed by one
mapper, or during the first? The former would suggest that memory isn't
being released; the latter that the task VM doesn't have enough memory to
start with.

Are you setting the memory available to map tasks by setting
mapred.child.java.opts? You can try to see how much memory the processes are
using by logging into a machine when the job is running and running 'top' or
'ps'.

It won't help the memory problems, but it sounds like you could run with
zero reducers for this job (conf.setNumReduceTasks(0)). Also, EC2 XL
instances can run more than two tasks per node (they have 4 virtual cores,
see http://aws.amazon.com/ec2/instance-types/). And you should configure
them to take advantage of multiple disks -
https://issues.apache.org/jira/browse/HADOOP-4745.

Tom

On Fri, Jan 2, 2009 at 8:50 PM, Zak, Richard [USA] zak_rich...@bah.com
wrote:
 All, I have a project that I am working on involving PDF files in HDFS.
 There are X number of directories and each directory contains Y number 
 of PDFs, and per directory all the PDFs are to be concatenated.  At 
 the moment I am running a test with 5 directories and 15 PDFs in each 
 directory.  I am also using iText to handle the PDFs, and I wrote a 
 wrapper class to take PDFs and add them to an internal PDF that grows. 
 I am running this on Amazon's EC2 using Extra Large instances, which 
 have a total of 15 GB RAM.  Each Java process, two per Instance, has 
 7GB maximum (-Xmx7000m).  There is one Master Instance and 4 Slave 
 instances.  I am able to confirm that the Slave processes are 
 connected to the Master and have been working.  I am using Hadoop 0.19.0.

 The problem is that I run out of memory when the concatenation class 
 reads in a PDF.  I have tried both the iText library version 2.1.4 and 
 the Faceless PDF library, and both have the error in the middle of 
 concatenating the documents.  I looked into Multivalent, but that one 
 just uses Strings to determine paths and it opens the files directly, 
 while I am using a wrapper class to interact with items in HDFS, so 
 Multivalent is out.

 Since the PDFs aren't enourmous (17 MB or less) and each Instance has 
 tons of memory, so why am I running out of memory?

 The mapper works like this.  It gets a text file with a list of 
 directories, and per directory it reads in the contents and adds them 
 to the concatenation class.  The reducer pretty much does nothing.  Is 
 this the best way to do this, or is there a better way?

 Thank you!

 Richard J. Zak




smime.p7s
Description: S/MIME cryptographic signature


Re: Combiner run specification and questions

2009-01-06 Thread Chris Douglas
I agree with the requirement that the key does not change. Of  
course, the values can change.


Yes. Key attributes not relevant to ordering are also mutable. If your  
key is a (t1, t2) tuple ordered by t1, then t2 can change without  
affecting the expected semantics.


I am primarily worried that the combiner might not be run at all - I  
have 'successfully' integrated Hadoop and R i.e the user can provide  
map/reduce functions written in R. However, R is not great with  
memory management, and if I have N (N is huge) values for a given  
key K, then R will baulk when it comes to processing this.


That sounds really cool. The cases where the combiner is not run are  
currently obscure and future exceptions are unlikely to affect this  
particular case. For example, it doesn't make sense to run the  
combiner for a single record (i.e. there is nothing to combine), for a  
partition with no common keys, etc. So as long as the correctness of  
the computation doesn't rely on a transformation performed in the  
combiner, it should be OK. In general, it shouldn't be- and in the  
future may not be- run where there is little or no compression for it  
to effect, which doesn't exacerbate this case.


However, this restriction limits the scalability of your solution. It  
might be necessary to work around R's limitations by breaking up large  
computations into intermediate steps, possibly by explicitly  
instantiating and running the combiner in the reduce.



1) I am guaranteed a reducer.
So,
The combiner, if defined, will run zero or more times on records  
emitted from the map, before being fed to the reduce.



This zero case possibility worries me. However you mention, that it  
occurs

collector spills in the map


I have noticed this happening - what 'spilling' mean?


Records emitted from the map are serialized into a buffer, which is  
periodically written to disk when it is (sufficiently) full. Each of  
these batch writes is a spill. In casual usage, it refers to any  
time when records need to be written to disk. The merge of  
intermediate files into the final map output and merging in-memory  
segments to disk in the reduce are two examples. -C



Thank you
Saptarshi


On Jan 5, 2009, at 10:22 PM, Chris Douglas wrote:

The combiner, if defined, will run zero or more times on records  
emitted from the map, before being fed to the reduce. It is run  
when the collector spills in the map and in some merge cases. If  
the combiner transforms the key, it is illegal to change its type,  
the partition to which it is assigned, or its ordering.


For example, if you emit a record (k,v) from your map and (k',v)  
from the combiner, your comparator is C(K,K) and your partitioner  
function is P(K), it must be the case that P(k) == P(k') and  
C(k,k') == 0. If either of these does not hold, the semantics to  
the reduce are broken. Clearly, if k is not transformed (as in true  
for most combiners), this holds trivially.


As was mentioned earlier, the purpose of the combiner is to  
compress data pulled across the network and spilled to disk. It  
should not affect the correctness or, in most cases, the output of  
the job. -C


On Jan 2, 2009, at 9:57 AM, Saptarshi Guha wrote:


Hello,
I would just like to confirm, when does the Combiner run(since it
might not be run at all,see below). I read somewhere that it is run,
if there is at least one reduce (which in my case i can be sure of).
I also read, that the combiner is an optimization. However, it is  
also
a chance for a function to transform the key/value (keeping the  
class

the same i.e the combiner semantics are not changed) and deal with a
smaller set ( this could be done in the reducer but the number of
values for a key might be relatively large).

However, I guess it would be a mistake for reducer to expect its  
input

coming from a combiner? E.g if there are only 10 value corresponding
to a key(as outputted by the mapper), will these 10 values go  
straight

to the reducer or to the reducer via the combiner?

Here I am assuming my reduce operations does not need all the values
for a key to work(so that a combiner can be used) i.e additive
operations.

Thank you
Saptarshi


On Sun, Nov 16, 2008 at 6:18 PM, Owen O'Malley  
omal...@apache.org wrote:
The Combiner may be called 0, 1, or many times on each key  
between the
mapper and reducer. Combiners are just an application specific  
optimization
that compress the intermediate output. They should not have side  
effects or
transform the types. Unfortunately, since there isn't a separate  
interface
for Combiners, there is isn't a great place to document this  
requirement.

I've just filed HADOOP-4668 to improve the documentation.




--
Saptarshi Guha - saptarshi.g...@gmail.com




Saptarshi Guha | saptarshi.g...@gmail.com | http://www.stat.purdue.edu/~sguha
The way of the world is to praise dead saints and prosecute live ones.
-- Nathaniel Howe





Re: TestDFSIO delivers bad values of throughput and average IO rate

2009-01-06 Thread Konstantin Shvachko

Hi tienduc_dinh,

Just a bit of a background, which should help to answer your questions.
TestDFSIO mappers perform one operation (read or write) each, measure
the time taken by the operation and output the following three values:
(I am intentionally omitting some other output stuff.)
- size(i)
- time(i)
- rate(i) = size(i) / time(i)
i is the index of the map task 0 = i  N, and N is the -nrFiles value,
which equals the number of maps.

Then the reduce sums those values and writes them into part-0.
That is you get three fields in it
size = size(0) + ... + size(N-1)
time = time(0) + ... + time(N-1)
rate = rate(0) + ... + rate(N-1)

Then we calculate
throughput = size / time
averageIORate = rate / N

So answering your questions
- There should be only one reduce task, otherwise you will have to
manually sum corresponding values in part-0 and part-1.
- The value of the :rate after the reduce equals the sum of individual
rates of each operation. So if you want to have an average you should
divide it by the number tasks rather than multiply.

Now, in your case you create only one file -nrFiles 1, which means
you run only one map task.
Setting mapred.map.tasks to 10 in hadoop-site.xml defines the default
number of tasks per job. See here
http://hadoop.apache.org/core/docs/current/hadoop-default.html#mapred.map.tasks
In case of TestDFSIO it will be overridden by -nrFiles.

Hope this answers your questions.
Thanks,
--Konstantin



tienduc_dinh wrote:

Hello,

I'm now using hadoop-0.18.0 and testing it on a cluster with 1 master and 4
slaves. In hadoop-site.xml the value of mapred.map.tasks is 10. Because
the values throughput and average IO rate are similar, I just post the
values of throughput of the same command with 3 times running

-  hadoop-0.18.0/bin/hadoop jar testDFSIO.jar -write -fileSize 2048
-nrFiles 1

+ with dfs.replication = 1 = 33,60 / 31,48 / 30,95

+ with dfs.replication = 2 = 26,40 / 20,99 / 21,70

I find something strange while reading the source code. 

- The value of mapred.reduce.tasks is always set to 1 


job.setNumReduceTasks(1) in the function runIOTest()  and reduceFile = new
Path(WRITE_DIR, part-0) in analyzeResult().

So I think, if we properly have mapred.reduce.tasks = 2, we will have on the
file system 2 Paths to part-0 and part-1, e.g.
/benchmarks/TestDFSIO/io_write/part-0

- And i don't understand the line with double med = rate / 1000 / tasks.
Is it not double med = rate * tasks / 1000 


Re: Unusual Failure of jobs

2009-01-06 Thread Karl Anderson


On 22-Dec-08, at 10:24 AM, Nathan Marz wrote:

I have been experiencing some unusual behavior from Hadoop recently.  
When trying to run a job, some of the tasks fail with:


java.io.IOException: Task process exit with nonzero status of 1.
at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403


Not all the tasks fail, but enough tasks fail such that the job  
fails. Unfortunately, there are no further logs for these tasks.  
Trying to retrieve the logs produces:


HTTP ERROR: 410

Failed to retrieve stdout log for task:  
attempt_200811101232_0218_m_01_0


RequestURI=/tasklog


It seems like the tasktracker isn't able to even start the tasks on  
those machines. Has anyone seen anything like this before?


I see this on jobs that also get the too many open files task  
errors, or on subsequent jobs.  I've always assumed that it's another  
manifestation of the same problem.  Once I start getting these errors,  
I keep getting them until I shut down the cluster, although I don't  
always get enough to cause a job to fail.  I haven't bothered  
restarting individual boxes or services.


I haven't been able to reproduce it consistently, but it seems to  
happen when I have many small input files; a job with one large input  
file broke after I split the input up.  I'm using Streaming.


Karl Anderson
k...@monkey.org
http://monkey.org/~kra