Re: Measuring Shuffle time for MR job

2012-08-27 Thread Raj Vishwanathan
You can extract the shuffle time from the job log.

Take a look at 

https://github.com/rajvish/hadoop-summary 


Raj




 From: Bertrand Dechoux decho...@gmail.com
To: common-user@hadoop.apache.org 
Sent: Monday, August 27, 2012 12:57 AM
Subject: Re: Measuring Shuffle time for MR job
 
Shuffle time is considered as part of the reduce step. Without reduce,
there is no need for shuffling.
One way to measure it would be using the full reduce time with a
'/dev/null' reducer.

I am not aware of any way to measure it.

Regards

Bertrand

On Mon, Aug 27, 2012 at 8:18 AM, praveenesh kumar praveen...@gmail.comwrote:

 Is there a way to know the total shuffle time of a map-reduce job - I mean
 some command or output  that can tell that ?

 I want to measure total map, total shuffle and total reduce time for my MR
 job -- how can I achieve it ? I am using hadoop 0.20.205


 Regards,
 Praveenesh




-- 
Bertrand Dechoux




Re: doubt about reduce tasks and block writes

2012-08-26 Thread Raj Vishwanathan
Harsh

I did leave an escape route open witth a bit about corner cases :-) 

Anyway I agree that HDFS has no notion of block 0. I just meant that had the 
dfs.replication is 1, there will be,under normal circumstances :-), no blocks 
of output file will be written to node A.

Raj


- Original Message -
 From: Harsh J ha...@cloudera.com
 To: common-user@hadoop.apache.org; Raj Vishwanathan rajv...@yahoo.com
 Cc: 
 Sent: Saturday, August 25, 2012 4:02 AM
 Subject: Re: doubt about reduce tasks and block writes
 
 Raj's almost right. In times of high load or space fillup on a local
 DN, the NameNode may decide to instead pick a non-local DN for
 replica-writing. In this way, the Node A may get a copy 0 of a
 replica from a task. This is per the default block placement policy.
 
 P.s. Note that HDFS hardly makes any differences between replicas,
 hence there is no hard-concept of a copy 0 or copy 1 
 block, at the
 NN level, it treats all DNs in pipeline equally and same for replicas.
 
 On Sat, Aug 25, 2012 at 4:14 AM, Raj Vishwanathan rajv...@yahoo.com 
 wrote:
  But since node A has no TT running, it will not run map or reduce tasks. 
 When the reducer node writes the output file, the fist block will be written 
 on 
 the local node and never on node A.
 
  So, to answer the question, Node A will contain copies of blocks of all 
 output files. It wont contain the copy 0 of any output file.
 
 
  I am reasonably sure about this , but there could be corner cases in case 
 of node failure and such like! I need to look into the code.
 
 
  Raj
 
  From: Marc Sturlese marc.sturl...@gmail.com
 To: hadoop-u...@lucene.apache.org
 Sent: Friday, August 24, 2012 1:09 PM
 Subject: doubt about reduce tasks and block writes
 
 Hey there,
 I have a doubt about reduce tasks and block writes. Do a reduce task 
 always
 first write to hdfs in the node where they it is placed? (and then these
 blocks would be replicated to other nodes)
 In case yes, if I have a cluster of 5 nodes, 4 of them run DN and TT and 
 one
 (node A) just run DN, when running MR jobs, map tasks would never read 
 from
 node A? This would be because maps have data locality and if the reduce
 tasks write first to the node where they live, one replica of the block
 would always be in a node that has a TT. Node A would just contain 
 blocks
 created from replication by the framework as no reduce task would write
 there directly. Is this correct?
 Thanks in advance
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/doubt-about-reduce-tasks-and-block-writes-tp4003185.html
 Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
 
 
 
 
 
 
 -- 
 Harsh J



Re: doubt about reduce tasks and block writes

2012-08-24 Thread Raj Vishwanathan
But since node A has no TT running, it will not run map or reduce tasks. When 
the reducer node writes the output file, the fist block will be written on the 
local node and never on node A.

So, to answer the question, Node A will contain copies of blocks of all output 
files. It wont contain the copy 0 of any output file.


I am reasonably sure about this , but there could be corner cases in case of 
node failure and such like! I need to look into the code. 


Raj

 From: Marc Sturlese marc.sturl...@gmail.com
To: hadoop-u...@lucene.apache.org 
Sent: Friday, August 24, 2012 1:09 PM
Subject: doubt about reduce tasks and block writes
 
Hey there,
I have a doubt about reduce tasks and block writes. Do a reduce task always
first write to hdfs in the node where they it is placed? (and then these
blocks would be replicated to other nodes)
In case yes, if I have a cluster of 5 nodes, 4 of them run DN and TT and one
(node A) just run DN, when running MR jobs, map tasks would never read from
node A? This would be because maps have data locality and if the reduce
tasks write first to the node where they live, one replica of the block
would always be in a node that has a TT. Node A would just contain blocks
created from replication by the framework as no reduce task would write
there directly. Is this correct?
Thanks in advance



--
View this message in context: 
http://lucene.472066.n3.nabble.com/doubt-about-reduce-tasks-and-block-writes-tp4003185.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.





Re: Number of Maps running more than expected

2012-08-16 Thread Raj Vishwanathan
You probably have speculative execution on. Extra maps and reduce tasks are run 
in case some of them fail

Raj


Sent from my iPad
Please excuse the typos. 

On Aug 16, 2012, at 11:36 AM, in.abdul in.ab...@gmail.com wrote:

 Hi Gaurav,
   Number map is not depents upon number block . It is really depends upon
 number of input splits . If you had 100GB of data and you had 10 split
 means then you can see only 10 maps .
 
 Please correct me if i am wrong
 
 Thanks and regards,
 Syed abdul kather
 On Aug 16, 2012 7:44 PM, Gaurav Dasgupta [via Lucene] 
 ml-node+s472066n4001631...@n3.nabble.com wrote:
 
 Hi users,
 
 I am working on a CDH3 cluster of 12 nodes (Task Trackers running on all
 the 12 nodes and 1 node running the Job Tracker).
 In order to perform a WordCount benchmark test, I did the following:
 
   - Executed RandomTextWriter first to create 100 GB data (Note that I
   have changed the test.randomtextwrite.total_bytes parameter only, rest
   all are kept default).
   - Next, executed the WordCount program for that 100 GB dataset.
 
 The Block Size in hdfs-site.xml is set as 128 MB. Now, according to my
 calculation, total number of Maps to be executed by the wordcount job
 should be 100 GB / 128 MB or 102400 MB / 128 MB = 800.
 But when I am executing the job, it is running a total number of 900 Maps,
 i.e., 100 extra.
 So, why this extra number of Maps? Although, my job is completing
 successfully without any error.
 
 Again, if I don't execute the RandomTextWwriter job to create data for
 my wordcount, rather I put my own 100 GB text file in HDFS and run
 WordCount, I can then see the number of Maps are equivalent to my
 calculation, i.e., 800.
 
 Can anyone tell me why this odd behaviour of Hadoop regarding the number
 of Maps for WordCount only when the dataset is generated by
 RandomTextWriter? And what is the purpose of these extra number of Maps?
 
 Regards,
 Gaurav Dasgupta
 
 
 --
 If you reply to this email, your message will be added to the discussion
 below:
 
 http://lucene.472066.n3.nabble.com/Number-of-Maps-running-more-than-expected-tp4001631.html
 To unsubscribe from Lucene, click 
 herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472066code=aW4uYWJkdWxAZ21haWwuY29tfDQ3MjA2NnwxMDczOTUyNDEw
 .
 NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
 
 
 
 
 
 -
 THANKS AND REGARDS,
 SYED ABDUL KATHER
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Number-of-Maps-running-more-than-expected-tp4001631p4001683.html
 Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


Re: Multinode cluster only recognizes 1 node

2012-08-01 Thread Raj Vishwanathan
Sean 

Can you paste the name node and jobtracker logs.

It could be something as simple as disabling the firewall.

Raj




 From: Barry, Sean F sean.f.ba...@intel.com
To: common-user@hadoop.apache.org common-user@hadoop.apache.org 
Sent: Tuesday, July 31, 2012 1:23 PM
Subject: RE: Multinode cluster only recognizes 1 node
 
After a good number of hours trying to tinker with my set up I am still am 
only seeing results for 1, not 2 nodes.

This is the tut I followed
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/

After executing ./start-all.sh

starting namenode, logging to 
/usr/local/hadoop/logs/hadoop-hduser-namenode-master.out
slave: starting datanode, logging to 
/usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-slave.out
master: starting datanode, logging to 
/usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-master.out
master: starting secondarynamenode, logging to 
/usr/local/hadoop/bin/../logs/hadoop-hduser-secondarynamenode-master.out
starting jobtracker, logging to 
/usr/local/hadoop/logs/hadoop-hduser-jobtracker-master.out
slave: starting tasktracker, logging to 
/usr/local/hadoop/bin/../logs/hadoop-hduser-tasktracker-slave.out
master: starting tasktracker, logging to 
/usr/local/hadoop/bin/../logs/hadoop-hduser-tasktracker-master.out





-Original Message-
From: syed kather [mailto:in.ab...@gmail.com] 
Sent: Thursday, July 26, 2012 6:06 PM
To: common-user@hadoop.apache.org
Subject: Re: Multinode cluster only recognizes 1 node

Can you paste the information when you execute . start-all.sh in kernal..
when you do ssh on slave .. whether it is working fine...
On Jul 27, 2012 4:50 AM, Barry, Sean F sean.f.ba...@intel.com wrote:

 Hi,

 I just set up a 2 node POC cluster and I am currently having an issue 
 with it. I ran a wordcount MR test on my cluster to see if it was 
 working and noticed that the Web ui at localhost:50030 showed that I 
 only have 1 live node. I followed the tutorial step by step and I 
 cannot seem to figure out my problem. When I ran start-all.sh all of 
 the daemons on my master node and my slave node start up perfectly 
 fine. If you have any suggestions please let me know.

 -Sean





Re: Merge Reducers Output

2012-07-31 Thread Raj Vishwanathan
Is there a requirement for the final reduce file to be sorted? If not, wouldn't 
a map only job ( +  a combiner, ) and a merge only job provide the answer?

Raj




 From: Michael Segel michael_se...@hotmail.com
To: common-user@hadoop.apache.org 
Sent: Tuesday, July 31, 2012 5:24 AM
Subject: Re: Merge Reducers Output
 
You really don't want to run a single reducer unless you know that you don't 
have a lot of mappers. 

As long as the output data types and structure are the same as the input, you 
can run your code as the combiner, and then run it again as the reducer. 
Problem solved with one or two lines of code. 
If your input and output don't match, then you can use the existing code as a 
combiner, and then write a new reducer. It could as easily be an identity 
reducer too. (Don't know the exact problem.) 

So here's a silly question. Why wouldn't you want to run a combiner? 


On Jul 31, 2012, at 12:08 AM, Jay Vyas jayunit...@gmail.com wrote:

 Its not clear to me that you need custom input formats
 
 1) Getmerge might work or
 
 2) Simply run a SINGLE reducer job (have mappers output static final int
 key=1, or specify numReducers=1).
 
 In this case, only one reducer will be called, and it will read through all
 the values.
 
 On Tue, Jul 31, 2012 at 12:30 AM, Bejoy KS bejoy.had...@gmail.com wrote:
 
 Hi
 
 Why not use 'hadoop fs -getMerge outputFolderInHdfs
 targetFileNameInLfs' while copying files out of hdfs for the end users to
 consume. This will merge all the files in 'outputFolderInHdfs'  into one
 file and put it in lfs.
 
 Regards
 Bejoy KS
 
 Sent from handheld, please excuse typos.
 
 -Original Message-
 From: Michael Segel michael_se...@hotmail.com
 Date: Mon, 30 Jul 2012 21:08:22
 To: common-user@hadoop.apache.org
 Reply-To: common-user@hadoop.apache.org
 Subject: Re: Merge Reducers Output
 
 Why not use a combiner?
 
 On Jul 30, 2012, at 7:59 PM, Mike S wrote:
 
 Liked asked several times, I need to merge my reducers output files.
 Imagine I have many reducers which will generate 200 files. Now to
 merge them together, I have written another map reduce job where each
 mapper read a complete file in full in memory, and output that and
 then only one reducer has to merge them together. To do so, I had to
 write a custom fileinputreader that reads the complete file into
 memory and then another custom fileoutputfileformat to append the each
 reducer item bytes together. this how my mapper and reducers looks
 like
 
 public static class MapClass extends MapperNullWritable,
 BytesWritable, IntWritable, BytesWritable
      {
              @Override
              public void map(NullWritable key, BytesWritable value,
 Context
 context) throws IOException, InterruptedException
              {
                      context.write(key, value);
              }
      }
 
      public static class Reduce extends ReducerNullWritable,
 BytesWritable, NullWritable, BytesWritable
      {
              @Override
              public void reduce(NullWritable key,
 IterableBytesWritable values,
 Context context) throws IOException, InterruptedException
              {
                      for (BytesWritable value : values)
                      {
                              context.write(NullWritable.get(), value);
                      }
              }
      }
 
 I still have to have one reducers and that is a bottle neck. Please
 note that I must do this merging as the users of my MR job are outside
 my hadoop environment and the result as one file.
 
 Is there better way to merge reducers output files?
 
 
 
 
 
 -- 
 Jay Vyas
 MMSB/UCHC





Re: Datanode error

2012-07-20 Thread Raj Vishwanathan
Could also be due to network issues. Number of sockets could be less or number 
of threads could be less.

Raj




 From: Harsh J ha...@cloudera.com
To: common-user@hadoop.apache.org 
Sent: Friday, July 20, 2012 9:06 AM
Subject: Re: Datanode error
 
Pablo,

These all seem to be timeouts from clients when they wish to read a
block and drops from clients when they try to write a block. I
wouldn't think of them as critical errors. Aside of being worried that
a DN is logging these, are you noticing any usability issue in your
cluster? If not, I'd simply blame this on stuff like speculative
tasks, region servers, general HDFS client misbehavior, etc.

Please do also share if you're seeing an issue that you think is
related to these log messages.

On Fri, Jul 20, 2012 at 6:37 PM, Pablo Musa pa...@psafe.com wrote:
 Hey guys,
 I have a cluster with 11 nodes (1 NN and 10 DNs) which is running and 
 working.
 However my datanodes keep having the same errors, over and over.

 I googled the problems and tried different flags (ex: 
 -XX:MaxDirectMemorySize=2G)
 and different configs (xceivers=8192) but could not solve it.

 Does anyone know what is the problem and how can I solve it? (the stacktrace 
 is at the end)

 I am running:
 Java 1.7
 Hadoop 0.20.2
 Hbase 0.90.6
 Zoo 3.3.5

 % top - shows low load average (6% most of the time up to 60%), already 
 considering the number of cpus
 % vmstat - shows no swap at all
 % sar - shows 75% idle cpu in the worst case

 Hope you guys can help me.
 Thanks in advance,
 Pablo

 2012-07-20 00:03:44,455 INFO 
 org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
 /DN01:50010, dest:
 /DN01:43516, bytes: 396288, op: HDFS_READ, cliID: 
 DFSClient_hb_rs_DN01,60020,1342734302945_1342734303427, offset: 54956544, 
 srvID: DS-798921853-DN01-50010-1328651609047, blockid: 
 blk_914960691839012728_14061688, duration:
 480061254006
 2012-07-20 00:03:44,455 WARN 
 org.apache.hadoop.hdfs.server.datanode.DataNode: 
 DatanodeRegistration(DN01:50010, 
 storageID=DS-798921853-DN01-50010-1328651609047, infoPort=50075, 
 ipcPort=50020):Got exception while serving blk_914960691839012728_14061688 
 to /DN01:
 java.net.SocketTimeoutException: 48 millis timeout while waiting for 
 channel to be ready for write. ch : 
 java.nio.channels.SocketChannel[connected local=/DN01:50010 
 remote=/DN01:43516]
         at 
org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
         at 
org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
         at 
org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
         at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:397)
         at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:493)
         at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:279)
         at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:175)

 2012-07-20 00:03:44,455 ERROR 
 org.apache.hadoop.hdfs.server.datanode.DataNode: 
 DatanodeRegistration(DN01:50010, 
 storageID=DS-798921853-DN01-50010-1328651609047, infoPort=50075, 
 ipcPort=50020):DataXceiver
 java.net.SocketTimeoutException: 48 millis timeout while waiting for 
 channel to be ready for write. ch : 
 java.nio.channels.SocketChannel[connected local=/DN01:50010 
 remote=/DN01:43516]
         at 
org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
         at 
org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
         at 
org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
         at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:397)
         at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:493)
         at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:279)
         at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:175)

 2012-07-20 00:12:11,949 INFO 
 org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification 
 succeeded for blk_4602445008578088178_5707787
 2012-07-20 00:12:11,962 INFO 
 org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock 
 blk_-8916344806514717841_14081066 received exception 
 java.net.SocketTimeoutException: 63000 millis timeout while waiting for 
 channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
 local=/DN01:36634 remote=/DN03:50010]
 2012-07-20 00:12:11,962 ERROR 
 org.apache.hadoop.hdfs.server.datanode.DataNode: 
 DatanodeRegistration(DN01:50010, 
 storageID=DS-798921853-DN01-50010-1328651609047, infoPort=50075, 
 ipcPort=50020):DataXceiver
 java.net.SocketTimeoutException: 63000 millis timeout while waiting for 
 channel to be ready for read. ch : 

Re: Error: Too Many Fetch Failures

2012-06-19 Thread Raj Vishwanathan


You are probably having a very low somaxconn parameter ( default centos has it 
at 128 , if I remember correctly). You can check the value under 
/proc/sys/net/core/somaxconn

Can you also check the value of ulimit -n? It could be  low.

Raj




 From: Ellis H. Wilson III el...@cse.psu.edu
To: common-user@hadoop.apache.org 
Sent: Tuesday, June 19, 2012 12:32 PM
Subject: Re: Error: Too Many Fetch Failures
 
On 06/19/12 13:38, Vinod Kumar Vavilapalli wrote:
 
 Replies/more questions inline.
 
 
 I'm using Hadoop 0.23 on 50 machines, each connected with gigabit ethernet 
 and each having solely a single hard disk.  I am getting the following 
 error repeatably for the TeraSort benchmark.  TeraGen runs without error, 
 but TeraSort runs predictably until this error pops up between 64% and 70% 
 completion.  This doesn't occur for every execution of the benchmark, as 
 about one out of four times that I run the benchmark it does run to 
 completion (TeraValidate included).
 
 
 How many containers are you running per node?

Per my attached config files, I specify that 
yarn.nodemanager.resource.memory-mb = 3072, and the default /seems/ to be set 
at 1024MB for maps and reducers, so I have 3 containers running per node.  I 
have verified that this indeed is the case in the web client.  Three of these 
1GB slots in the cluster appear to be occupied by something else during the 
execution of TeraSort, so I specify that TeraGen create .5TB using 441 maps 
(3waves * (50nodes * 3containerslots - 3occupiedslots)), and TeraSort to use 
147 reducers.  This seems to give me the guarantees I had with Hadoop 1.0 that 
each node gets an equal number of reducers, and my job doesn't drag on due to 
straggler reducers.

 Clearly maps are getting killed because of fetch failures. Can you look at 
 the logs of the NodeManager where this particular map task ran. That may 
 have logs related to why reducers are not able to fetch map-outputs. It is 
 possible that because you have only one disk per node, some of these nodes 
 have bad or unfunctional disks and thereby causing fetch failures.

I will rerun and report the exact error messages from the NodeManagers.  Can 
you give me more exacting advice on collecting logs of this sort, for as I 
mentioned I'm new to doing so with the new version of Hadoop? I have been 
looking in /tmp/logs and hadoop/logs, but perhaps there is somewhere else to 
look as well?

Last, I am certain this is not related to failing disks, as this exact error 
occurs at much higher frequencies when I run Hadoop on a NAS box, which is the 
core of my research at the moment.  Nevertheless, I posted to this list 
instead of Dev as this was on vanilla CentOS-5.5 machines using just the HDDs 
within each, and therefore should be a highly typical setup.  In particular, I 
see these errors coming from numerous nodes all at once, and the subset of 
nodes giving the problems are not repeatable from one run to the next, though 
the resulting error is.

 If that is the case, either you can offline these nodes or bump up 
 mapreduce.reduce.shuffle.maxfetchfailures to tolerate these failures, the 
 default is 10. There are other some tweaks which I can tell if you can find 
 more details from your logs.

I'd prefer to not bump up maxfetchfailures, and would rather simply fix the 
issue that is causing the fetch to fail in the beginning.  This isn't a large 
cluster, having only 50 nodes, nor are the links (1gig) or storage 
capabilities (1 sata drive) great or strange relative to any normal 
installation.  I have to assume here that I've mis-configured something :(.

Best,

ellis




Re: Map works well, but Redue failed

2012-06-15 Thread Raj Vishwanathan
Most probably you have a network problem. Check your hostname and IP address 
mapping




 From: Yongwei Xing jdxyw2...@gmail.com
To: common-user@hadoop.apache.org 
Sent: Thursday, June 14, 2012 10:15 AM
Subject: Map works well, but Redue failed
 
Hi all

I run a simple sort program, however, I meet such error like below.

12/06/15 01:13:17 WARN mapred.JobClient: Error reading task outputServer
returned HTTP response code: 403 for URL:
http://192.168.1.106:50060/tasklog?plaintext=trueattemptid=attempt_201206150102_0002_m_01_1filter=stdout
12/06/15 01:13:18 WARN mapred.JobClient: Error reading task outputServer
returned HTTP response code: 403 for URL:
http://192.168.1.106:50060/tasklog?plaintext=trueattemptid=attempt_201206150102_0002_m_01_1filter=stderr
12/06/15 01:13:20 INFO mapred.JobClient:  map 50% reduce 0%
12/06/15 01:13:23 INFO mapred.JobClient:  map 100% reduce 0%
12/06/15 01:14:19 INFO mapred.JobClient: Task Id :
attempt_201206150102_0002_m_00_2, Status : FAILED
Too many fetch-failures
12/06/15 01:14:20 WARN mapred.JobClient: Error reading task outputServer
returned HTTP response code: 403 for URL:
http://192.168.1.106:50060/tasklog?plaintext=trueattemptid=attempt_201206150102_0002_m_00_2filter=stdout

Does anyone know what's the reason and how to resolve it?

Best Regards,

-- 
Welcome to my ET Blog http://www.jdxyw.com




Re: Map/Reduce Tasks Fails

2012-05-22 Thread Raj Vishwanathan






 From: Harsh J ha...@cloudera.com
To: common-user@hadoop.apache.org 
Sent: Tuesday, May 22, 2012 7:13 AM
Subject: Re: Map/Reduce Tasks Fails
 
Sandeep,

Is the same DN 10.0.25.149 reported across all failures? And do you
notice any machine patterns when observing the failed tasks (i.e. are
they clumped on any one or a few particular TTs repeatedly)?

On Tue, May 22, 2012 at 7:32 PM, Sandeep Reddy P
sandeepreddy.3...@gmail.com wrote:
 Hi,
 We have a 5node cdh3u4 cluster running. When i try to do teragen/terasort
 some of the map tasks are Failed/Killed and the logs show similar error on
 all machines.

 2012-05-22 09:43:50,831 INFO org.apache.hadoop.hdfs.DFSClient:
 Exception in createBlockOutputStream 10.0.25.149:50010
 java.net.SocketTimeoutException: 69000 millis timeout while waiting
 for channel to be ready for read. ch :
 java.nio.channels.SocketChannel[connected local=/10.0.25.149:55835
 remote=/10.0.25.149:50010]
 2012-05-22 09:44:25,968 INFO org.apache.hadoop.hdfs.DFSClient:
 Abandoning block blk_7260720956806950576_1825
 2012-05-22 09:44:25,973 INFO org.apache.hadoop.hdfs.DFSClient:
 Excluding datanode 10.0.25.149:50010
 2012-05-22 09:46:36,350 WARN org.apache.hadoop.mapred.Task: Parent
 died.  Exiting attempt_201205211504_0007_m_16_1.



 Are these kind of errors common?? Atleast 1 map task is failing due to
 above reason on all the machines.We are using 24 mappers for teragen.
 For us it took 3hrs 44min 17 sec to generate 50Gb data with 24 mappers
 and 17failed/8 killed task attempts.

 24min 10 sec for 5GB data with 24 mappers and 9 killed Task attempts.
 Cluster works good for small datasets.



-- 
Harsh J




Re: Map/Reduce Tasks Fails

2012-05-22 Thread Raj Vishwanathan
What kind of storage is attached to the data nodes ? This kind of error can 
happen when the CPU is really busy with I/O or interrupts.

Can you run top or dstat on some of the data nodes to see how the system is 
performing?

Raj




 From: Sandeep Reddy P sandeepreddy.3...@gmail.com
To: common-user@hadoop.apache.org 
Sent: Tuesday, May 22, 2012 7:23 AM
Subject: Re: Map/Reduce Tasks Fails
 
*Task Trackers* *Name**Host**# running tasks**Max Map Tasks**Max Reduce
Tasks**Task Failures**Directory Failures**Node Health Status**Seconds Since
Node Last Healthy**Total Tasks Since Start* *Succeeded Tasks Since
Start* *Total
Tasks Last Day* *Succeeded Tasks Last Day* *Total Tasks Last Hour* *Succeeded
Tasks Last Hour* *Seconds since heartbeat*
tracker_hadoop2.liaisondevqa.local:localhost/127.0.0.1:56225http://hadoop2.liaisondevqa.local:50060/
hadoop2.liaisondevqa.local062220N/A093 60 59 28 64 38 0
tracker_hadoop4.liaisondevqa.local:localhost/127.0.0.1:40363http://hadoop4.liaisondevqa.local:50060/
hadoop4.liaisondevqa.local062190N/A091 59 65 33 36 33 0
tracker_hadoop5.liaisondevqa.local:localhost/127.0.0.1:46605http://hadoop5.liaisondevqa.local:50060/
hadoop5.liaisondevqa.local162210N/A083 47 69 35 45 19 0
tracker_hadoop3.liaisondevqa.local:localhost/127.0.0.1:37305http://hadoop3.liaisondevqa.local:50060/
hadoop3.liaisondevqa.local062180N/A087 55 55 28 57 34 0  Highest Failures:
tracker_hadoop2.liaisondevqa.local:localhost/127.0.0.1:56225 with 22
failures




Re: High load on datanode startup

2012-05-10 Thread Raj Vishwanathan
Darrell

Are the new dn,nn and mapred directories on the same physical disk? Nothing on 
NFS , correct?

Could you be having some hardware issue? Any clue in /var/log/messages or dmesg?

A non responsive system indicates a CPU that is really busy either doing 
something or waiting for something and the fact that it happens only on some 
nodes indicates a local problem.

Raj




 From: Darrell Taylor darrell.tay...@gmail.com
To: common-user@hadoop.apache.org 
Cc: Raj Vishwanathan rajv...@yahoo.com 
Sent: Thursday, May 10, 2012 3:57 AM
Subject: Re: High load on datanode startup
 
On Thu, May 10, 2012 at 9:33 AM, Todd Lipcon t...@cloudera.com wrote:

 That's real weird..

 If you can reproduce this after a reboot, I'd recommend letting the DN
 run for a minute, and then capturing a jstack pid of dn as well as
 the output of top -H -p pid of dn -b -n 5 and send it to the list.


What I did after the reboot this morning was to move the my dn, nn, and
mapred directories out of the the way, create a new one, formatted it, and
restarted the node, it's now happy.

I'll try moving the directories back later and do the jstack as you suggest.



 What JVM/JDK are you using? What OS version?


root@pl446:/# dpkg --get-selections | grep java
java-common                                     install
libjaxp1.3-java                                 install
libjaxp1.3-java-gcj                             install
libmysql-java                                   install
libxerces2-java                                 install
libxerces2-java-gcj                             install
sun-java6-bin                                   install
sun-java6-javadb                                install
sun-java6-jdk                                   install
sun-java6-jre                                   install

root@pl446:/# java -version
java version 1.6.0_26
Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)

root@pl446:/# cat /etc/issue
Debian GNU/Linux 6.0 \n \l




 -Todd


 On Wed, May 9, 2012 at 11:57 PM, Darrell Taylor
 darrell.tay...@gmail.com wrote:
  On Wed, May 9, 2012 at 10:52 PM, Raj Vishwanathan rajv...@yahoo.com
 wrote:
 
  The picture either too small or too pixelated for my eyes :-)
 
 
  There should be a zoom option in the top right of the page that allows
 you
  to view it full size
 
 
 
  Can you login to the box and send the output of top? If the system is
  unresponsive, it has to be something more than an unbalanced hdfs
 cluster,
  methinks.
 
 
  Sorry, I'm unable to login to the box, it's completely unresponsive.
 
 
 
  Raj
 
 
 
  
   From: Darrell Taylor darrell.tay...@gmail.com
  To: common-user@hadoop.apache.org; Raj Vishwanathan rajv...@yahoo.com
 
  Sent: Wednesday, May 9, 2012 2:40 PM
  Subject: Re: High load on datanode startup
  
  On Wed, May 9, 2012 at 10:23 PM, Raj Vishwanathan rajv...@yahoo.com
  wrote:
  
   When you say 'load', what do you mean? CPU load or something else?
  
  
  I mean in the unix sense of load average, i.e. top would show a load of
  (currently) 376.
  
  Looking at Ganglia stats for the box it's not CPU load as such, the
 graphs
  shows actual CPU usage as 30%, but the number of running processes is
  simply growing in a linear manner - screen shot of ganglia page here :
  
  
 
 https://picasaweb.google.com/lh/photo/Q0uFSzyLiriDuDnvyRUikXVR0iWwMibMfH0upnTwi28?feat=directlink
  
  
  
  
   Raj
  
  
  
   
From: Darrell Taylor darrell.tay...@gmail.com
   To: common-user@hadoop.apache.org
   Sent: Wednesday, May 9, 2012 9:52 AM
   Subject: High load on datanode startup
   
   Hi,
   
   I wonder if someone could give some pointers with a problem I'm
 having?
   
   I have a 7 machine cluster setup for testing and we have been
 pouring
  data
   into it for a week without issue, have learnt several thing along
 the
  way
   and solved all the problems up to now by searching online, but now
 I'm
   stuck.  One of the data nodes decided to have a load of 70+ this
  morning,
   stopping datanode and tasktracker brought it back to normal, but
 every
   time
   I start the datanode again the load shoots through the roof, and
 all I
  get
   in the logs is :
   
   STARTUP_MSG: Starting DataNode
   
   
   STARTUP_MSG:   host = pl464/10.20.16.64
   
   
   STARTUP_MSG:   args = []
   
   
   STARTUP_MSG:   version = 0.20.2-cdh3u3
   
   
   STARTUP_MSG:   build =
  
  
 
 file:///data/1/tmp/nightly_2012-03-20_13-13-48_3/hadoop-0.20-0.20.2+923.197-1~squeeze
   -/
   
   
   2012-05-09 16:12:05,925 INFO
   org.apache.hadoop.security.UserGroupInformation: JAAS Configuration
   already
   set up for Hadoop, not re-installing.
   
   2012-05-09 16:12:06,139 INFO
   org.apache.hadoop.security.UserGroupInformation: JAAS Configuration
   already
   set up for Hadoop

Re: High load on datanode startup

2012-05-09 Thread Raj Vishwanathan
When you say 'load', what do you mean? CPU load or something else?

Raj




 From: Darrell Taylor darrell.tay...@gmail.com
To: common-user@hadoop.apache.org 
Sent: Wednesday, May 9, 2012 9:52 AM
Subject: High load on datanode startup
 
Hi,

I wonder if someone could give some pointers with a problem I'm having?

I have a 7 machine cluster setup for testing and we have been pouring data
into it for a week without issue, have learnt several thing along the way
and solved all the problems up to now by searching online, but now I'm
stuck.  One of the data nodes decided to have a load of 70+ this morning,
stopping datanode and tasktracker brought it back to normal, but every time
I start the datanode again the load shoots through the roof, and all I get
in the logs is :

STARTUP_MSG: Starting DataNode


STARTUP_MSG:   host = pl464/10.20.16.64


STARTUP_MSG:   args = []


STARTUP_MSG:   version = 0.20.2-cdh3u3


STARTUP_MSG:   build =
file:///data/1/tmp/nightly_2012-03-20_13-13-48_3/hadoop-0.20-0.20.2+923.197-1~squeeze
-/


2012-05-09 16:12:05,925 INFO
org.apache.hadoop.security.UserGroupInformation: JAAS Configuration already
set up for Hadoop, not re-installing.

2012-05-09 16:12:06,139 INFO
org.apache.hadoop.security.UserGroupInformation: JAAS Configuration already
set up for Hadoop, not re-installing.

Nothing else.

The load seems to max out only 1 of the CPUs, but the machine becomes
*very* unresponsive

Anybody got any pointers of things I can try?

Thanks
Darrell.




Re: High load on datanode startup

2012-05-09 Thread Raj Vishwanathan
The picture either too small or too pixelated for my eyes :-)

Can you login to the box and send the output of top? If the system is 
unresponsive, it has to be something more than an unbalanced hdfs cluster, 
methinks.

Raj




 From: Darrell Taylor darrell.tay...@gmail.com
To: common-user@hadoop.apache.org; Raj Vishwanathan rajv...@yahoo.com 
Sent: Wednesday, May 9, 2012 2:40 PM
Subject: Re: High load on datanode startup
 
On Wed, May 9, 2012 at 10:23 PM, Raj Vishwanathan rajv...@yahoo.com wrote:

 When you say 'load', what do you mean? CPU load or something else?


I mean in the unix sense of load average, i.e. top would show a load of
(currently) 376.

Looking at Ganglia stats for the box it's not CPU load as such, the graphs
shows actual CPU usage as 30%, but the number of running processes is
simply growing in a linear manner - screen shot of ganglia page here :

https://picasaweb.google.com/lh/photo/Q0uFSzyLiriDuDnvyRUikXVR0iWwMibMfH0upnTwi28?feat=directlink




 Raj



 
  From: Darrell Taylor darrell.tay...@gmail.com
 To: common-user@hadoop.apache.org
 Sent: Wednesday, May 9, 2012 9:52 AM
 Subject: High load on datanode startup
 
 Hi,
 
 I wonder if someone could give some pointers with a problem I'm having?
 
 I have a 7 machine cluster setup for testing and we have been pouring data
 into it for a week without issue, have learnt several thing along the way
 and solved all the problems up to now by searching online, but now I'm
 stuck.  One of the data nodes decided to have a load of 70+ this morning,
 stopping datanode and tasktracker brought it back to normal, but every
 time
 I start the datanode again the load shoots through the roof, and all I get
 in the logs is :
 
 STARTUP_MSG: Starting DataNode
 
 
 STARTUP_MSG:   host = pl464/10.20.16.64
 
 
 STARTUP_MSG:   args = []
 
 
 STARTUP_MSG:   version = 0.20.2-cdh3u3
 
 
 STARTUP_MSG:   build =

 file:///data/1/tmp/nightly_2012-03-20_13-13-48_3/hadoop-0.20-0.20.2+923.197-1~squeeze
 -/
 
 
 2012-05-09 16:12:05,925 INFO
 org.apache.hadoop.security.UserGroupInformation: JAAS Configuration
 already
 set up for Hadoop, not re-installing.
 
 2012-05-09 16:12:06,139 INFO
 org.apache.hadoop.security.UserGroupInformation: JAAS Configuration
 already
 set up for Hadoop, not re-installing.
 
 Nothing else.
 
 The load seems to max out only 1 of the CPUs, but the machine becomes
 *very* unresponsive
 
 Anybody got any pointers of things I can try?
 
 Thanks
 Darrell.
 
 
 





Re: Reduce Hangs at 66%

2012-05-03 Thread Raj Vishwanathan
Keith

What is the the output for ulimit -n? Your value for number of open files is 
probably too low.

Raj





 From: Keith Thompson kthom...@binghamton.edu
To: common-user@hadoop.apache.org 
Sent: Thursday, May 3, 2012 4:33 PM
Subject: Re: Reduce Hangs at 66%
 
I am not sure about ulimits, but I can answer the rest. It's a Cloudera
distribution of Hadoop 0.20.2. The HDFS has 9 TB free. In the reduce step,
I am taking keys in the form of (gridID, date), each with a value of 1. The
reduce step just sums the 1's as the final output value for the key (It's
counting how many people were in the gridID on a certain day).

I have been running other more complicated jobs with no problem, so I'm not
sure why this one is being peculiar. This is the code I used to execute the
program from the command line (the source is a file on the hdfs):

hadoop jar jarfile driver source /thompson/outputDensity/density1

The program then executes the map and gets to 66% of the reduce, then stops
responding. The cause of the error seems to be:

Error from attempt_201202240659_6432_r_00_1: java.io.IOException:
 The temporary job-output directory
 hdfs://analytix1:9000/thompson/outputDensity/density1/_temporary
 doesn't exist!

I don't understand what the _temporary is. I am assuming it's something
Hadoop creates automatically.



On Thu, May 3, 2012 at 5:02 AM, Michel Segel michael_se...@hotmail.comwrote:

 Well...
 Lots of information but still missing some of the basics...

 Which release and version?
 What are your ulimits set to?
 How much free disk space do you have?
 What are you attempting to do?

 Stuff like that.



 Sent from a remote device. Please excuse any typos...

 Mike Segel

 On May 2, 2012, at 4:49 PM, Keith Thompson kthom...@binghamton.edu
 wrote:

  I am running a task which gets to 66% of the Reduce step and then hangs
  indefinitely. Here is the log file (I apologize if I am putting too much
  here but I am not exactly sure what is relevant):
 
  2012-05-02 16:42:52,975 INFO org.apache.hadoop.mapred.JobTracker:
  Adding task (REDUCE) 'attempt_201202240659_6433_r_00_0' to tip
  task_201202240659_6433_r_00, for tracker
  'tracker_analytix7:localhost.localdomain/127.0.0.1:56515'
  2012-05-02 16:42:53,584 INFO org.apache.hadoop.mapred.JobInProgress:
  Task 'attempt_201202240659_6433_m_01_0' has completed
  task_201202240659_6433_m_01 successfully.
  2012-05-02 17:00:47,546 INFO org.apache.hadoop.mapred.TaskInProgress:
  Error from attempt_201202240659_6432_r_00_0: Task
  attempt_201202240659_6432_r_00_0 failed to report status for 1800
  seconds. Killing!
  2012-05-02 17:00:47,546 INFO org.apache.hadoop.mapred.JobTracker:
  Removing task 'attempt_201202240659_6432_r_00_0'
  2012-05-02 17:00:47,546 INFO org.apache.hadoop.mapred.JobTracker:
  Adding task (TASK_CLEANUP) 'attempt_201202240659_6432_r_00_0' to
  tip task_201202240659_6432_r_00, for tracker
  'tracker_analytix4:localhost.localdomain/127.0.0.1:44204'
  2012-05-02 17:00:48,763 INFO org.apache.hadoop.mapred.JobTracker:
  Removing task 'attempt_201202240659_6432_r_00_0'
  2012-05-02 17:00:48,957 INFO org.apache.hadoop.mapred.JobTracker:
  Adding task (REDUCE) 'attempt_201202240659_6432_r_00_1' to tip
  task_201202240659_6432_r_00, for tracker
  'tracker_analytix5:localhost.localdomain/127.0.0.1:59117'
  2012-05-02 17:00:56,559 INFO org.apache.hadoop.mapred.TaskInProgress:
  Error from attempt_201202240659_6432_r_00_1: java.io.IOException:
  The temporary job-output directory
  hdfs://analytix1:9000/thompson/outputDensity/density1/_temporary
  doesn't exist!
     at
 org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250)
     at
 org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:240)
     at
 org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:116)
     at
 org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:438)
     at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:416)
     at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
     at java.security.AccessController.doPrivileged(Native Method)
     at javax.security.auth.Subject.doAs(Subject.java:396)
     at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
     at org.apache.hadoop.mapred.Child.main(Child.java:262)
 
  2012-05-02 17:00:59,903 INFO org.apache.hadoop.mapred.JobTracker:
  Removing task 'attempt_201202240659_6432_r_00_1'
  2012-05-02 17:00:59,906 INFO org.apache.hadoop.mapred.JobTracker:
  Adding task (REDUCE) 'attempt_201202240659_6432_r_00_2' to tip
  task_201202240659_6432_r_00, for tracker
  'tracker_analytix3:localhost.localdomain/127.0.0.1:39980'
  2012-05-02 17:01:07,200 INFO org.apache.hadoop.mapred.TaskInProgress:
  Error from attempt_201202240659_6432_r_00_2: java.io.IOException:
  The temporary job-output directory
  

Re: hadoop streaming and a directory containing large number of .tgz files

2012-04-24 Thread Raj Vishwanathan
Sunil

You could use identity mappers, a single identity reducer and by not having 
output compression.,

Raj




 From: Sunil S Nandihalli sunil.nandiha...@gmail.com
To: common-user@hadoop.apache.org 
Sent: Tuesday, April 24, 2012 7:01 AM
Subject: Re: hadoop streaming and a directory containing large number of .tgz 
files
 
Sorry for reforwarding this email. I was not sure if it actually got
through since I just got the confirmation regarding my membership to the
mailing list.
Thanks,
Sunil.

On Tue, Apr 24, 2012 at 7:12 PM, Sunil S Nandihalli 
sunil.nandiha...@gmail.com wrote:

 Hi Everybody,
  I am a newbie to hadoop. I have about 40K .tgz files each of
 approximately 3MB . I would like to process this as if it were a single
 large file formed by
 cat list-of-files | gnuparallel 'tar -Oxvf {} | sed 1d'  output.txt
 how can I achieve this using hadoop-streaming or some-other similar
 library..


 thanks,
 Sunil.





Re: Hive Thrift help

2012-04-16 Thread Raj Vishwanathan
To a verify that a server is running on a port ( 10,000 in this case ) and to 
ensure that there are no firewall issues
run 
telnet servername 1

The connection should succeed.

Raj




 From: Edward Capriolo edlinuxg...@gmail.com
To: common-user@hadoop.apache.org 
Sent: Monday, April 16, 2012 2:55 PM
Subject: Re: Hive Thrift help
 
You can NOT connect to hive thrift to confirm it's status. Thrift is
thrift not http. But you are right to say HiveServer does not produce
and output by default.

if
netstat -nl | grep 1

shows status it is up.


On Mon, Apr 16, 2012 at 5:18 PM, Rahul Jain rja...@gmail.com wrote:
 I am assuming you read thru:

 https://cwiki.apache.org/Hive/hiveserver.html

 The server comes up on port 10,000 by default, did you verify that it is
 actually listening on the port ?  You can also connect to hive server using
 web browser to confirm its status.

 -Rahul

 On Mon, Apr 16, 2012 at 1:53 PM, Michael Wang 
 michael.w...@meredith.comwrote:

 we need to connect to HIVE from Microstrategy reports, and it requires the
 Hive Thrift server. But I
 tried to start it, and it just hangs as below.
 # hive --service hiveserver
 Starting Hive Thrift Server
 Any ideas?
 Thanks,
 Michael

 This electronic message, including any attachments, may contain
 proprietary, confidential or privileged information for the sole use of the
 intended recipient(s). You are hereby notified that any unauthorized
 disclosure, copying, distribution, or use of this message is prohibited. If
 you have received this message in error, please immediately notify the
 sender by reply e-mail and delete it.





Re: Is TeraGen's generated data deterministic?

2012-04-14 Thread Raj Vishwanathan
David

Since the data generation and sorting is different hadoop jobs, you can 
generate the data once and sort the same data as many times as as you want.

I don't think Teragen is deterministic.( or rather , the keys are random but 
the text is deterministic if I remember correctly ) 



Raj




 From: David Erickson halcyon1...@gmail.com
To: common-user@hadoop.apache.org 
Sent: Saturday, April 14, 2012 1:53 PM
Subject: Is TeraGen's generated data deterministic?
 
Hi we are doing some benchmarking of some of our infrastructure and
are using TeraGen/TeraSort to do the benchmarking.  I am wondering if
the data generated by TeraGen is deterministic, in that if I repeat
the same experiment multiple times with the same configuration options
if it will continue to generate and sort the exact same data?  And if
not, is there an easy mod to make this happen?

Thanks!
David




Re: Map Reduce Job Help

2012-04-11 Thread Raj Vishwanathan
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ 





 From: hellooperator silversl...@gmail.com
To: core-u...@hadoop.apache.org 
Sent: Wednesday, April 11, 2012 11:15 AM
Subject: Map Reduce Job Help
 

Hello,

I'm just starting out with Hadoop and writing some Map Reduce jobs.  I was
looking for help on writing a MR job in python that allows me to take some
emails and put them into HDFS so I can search on the text or attachments of
the email?

Thank you!
-- 
View this message in context: 
http://old.nabble.com/Map-Reduce-Job-Help-tp33670645p33670645.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.





Re: opensuse 12.1

2012-04-04 Thread Raj Vishwanathan
Lots of people seem to start with this.

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ 


Raj




 From: Barry, Sean F sean.f.ba...@intel.com
To: common-user@hadoop.apache.org common-user@hadoop.apache.org 
Sent: Wednesday, April 4, 2012 9:12 AM
Subject: FW: opensuse 12.1
 


-Original Message-
From: Barry, Sean F [mailto:sean.f.ba...@intel.com] 
Sent: Wednesday, April 04, 2012 9:10 AM
To: common-user@hadoop.apache.org
Subject: opensuse 12.1

                What is the best way to install hadoop on opensuse 12.1 for a 
small two node cluster.

-SB




Re: data distribution in HDFS

2012-04-02 Thread Raj Vishwanathan
Stijn,

The first block of the data , is always stored in the local node. Assuming that 
you had a replication factor of 3, the node that generates the data will get 
about 10GB of data and the other 20GB will be distributed among other nodes.

Raj 






 From: Stijn De Weirdt stijn.dewei...@ugent.be
To: common-user@hadoop.apache.org 
Sent: Monday, April 2, 2012 9:54 AM
Subject: data distribution in HDFS
 
hi all,

i'm just started to play around with hdfs+mapred. i'm currently playing with 
teragen/sort/validate to see if i understand all.

the test setup involves 5 nodes that all are tasktracker and datanode (and one 
node that is also jobtracker and namenode on top of that. (this one node is 
running both the namenode hadoop process as the datanode process)

when i do the in teragen run, the data is not distributed equally over all 
nodes. the node that is also namenode, get's a bigger portion of all the data. 
(as seen by df on the nodes and by using dsfadmin -report)
i also get this distribution when i ran the TestDFSIO write test (50 files of 
1GB)


i use basic command line  teragen $((100*1000*1000)) /benchmarks/teragen, so i 
expect 100M*0.1kb = 10GB of data. (if i add the volumes in use by hdfs, it's 
actually quite a bit more.)
4 data nodes are using 4.2-4.8GB, and the data+namenode has 9.4GB in use. so 
this one datanode is seen as 2 nodes.

when i do ls on the filesystem, i see that teragen created 250MB files, the 
current hdfs blocksize is 64MB.

is there a reason why one datanode is preferred over the others.
it is annoying since the terasort output behaves the same, and i can't use the 
full hdfs space for testing that way. also, since more IO comes to this one 
node, the performance isn't really balanced.

many thanks,

stijn




Re: data distribution in HDFS

2012-04-02 Thread Raj Vishwanathan
AFAIK there is no way to disable this feature . This is an optimization. It 
happens because in your case the node generating the data is also a data node.

Raj




 From: Stijn De Weirdt stijn.dewei...@ugent.be
To: common-user@hadoop.apache.org 
Sent: Monday, April 2, 2012 12:18 PM
Subject: Re: data distribution in HDFS
 
thanks serge.


is there a way to disable this feature (ie place first block always on 
local node)?
and is this because the local node is a datanode? or is there always a 
local node with datatransfers?

many thanks,

stijn

 Local node is a node from where you are coping data from

 If lets say you are using -copyFromLocal option


 Regards
 Serge

 On 4/2/12 11:53 AM, Stijn De Weirdtstijn.dewei...@ugent.be  wrote:

 hi raj,

 what is a local node? is it relative to the tasks that are started?


 stijn

 On 04/02/2012 07:28 PM, Raj Vishwanathan wrote:
 Stijn,

 The first block of the data , is always stored in the local node.
 Assuming that you had a replication factor of 3, the node that generates
 the data will get about 10GB of data and the other 20GB will be
 distributed among other nodes.

 Raj





 
 From: Stijn De Weirdtstijn.dewei...@ugent.be
 To: common-user@hadoop.apache.org
 Sent: Monday, April 2, 2012 9:54 AM
 Subject: data distribution in HDFS

 hi all,

 i'm just started to play around with hdfs+mapred. i'm currently
 playing with teragen/sort/validate to see if i understand all.

 the test setup involves 5 nodes that all are tasktracker and datanode
 (and one node that is also jobtracker and namenode on top of that.
 (this one node is running both the namenode hadoop process as the
 datanode process)

 when i do the in teragen run, the data is not distributed equally over
 all nodes. the node that is also namenode, get's a bigger portion of
 all the data. (as seen by df on the nodes and by using dsfadmin -report)
 i also get this distribution when i ran the TestDFSIO write test (50
 files of 1GB)


 i use basic command line  teragen $((100*1000*1000))
 /benchmarks/teragen, so i expect 100M*0.1kb = 10GB of data. (if i add
 the volumes in use by hdfs, it's actually quite a bit more.)
 4 data nodes are using 4.2-4.8GB, and the data+namenode has 9.4GB in
 use. so this one datanode is seen as 2 nodes.

 when i do ls on the filesystem, i see that teragen created 250MB
 files, the current hdfs blocksize is 64MB.

 is there a reason why one datanode is preferred over the others.
 it is annoying since the terasort output behaves the same, and i can't
 use the full hdfs space for testing that way. also, since more IO comes
 to this one node, the performance isn't really balanced.

 many thanks,

 stijn











Re: How to modify hadoop-wordcount example to display File-wise results.

2012-03-29 Thread Raj Vishwanathan
Aaron 

You can get the details of how much data each mapper processed, on which node ( 
IP address actually!) from the job logs.

Raj





 From: Ajay Srivastava ajay.srivast...@guavus.com
To: common-user@hadoop.apache.org common-user@hadoop.apache.org 
Cc: core-u...@hadoop.apache.org core-u...@hadoop.apache.org 
Sent: Thursday, March 29, 2012 5:57 PM
Subject: Re: How to modify hadoop-wordcount example to display File-wise 
results.
 
Hi Aaron,
I guess that it can be done by using counters.
You can define a counter for each node in your cluster and then, in map method 
increment a node specific counter either by checking hostname or ip address.
It's not a very good solution as you will need to modify your code whenever a 
node is added/removed from cluster and there will be as many if conditions in 
code as number of nodes. You can try this out if you do not find a cleaner 
solution. I wish that this counter should have been part of predefined 
counters. 


Regards,
Ajay Srivastava


On 30-Mar-2012, at 12:49 AM, aaron_v wrote:

 
 Hi people, Am new to Nabble and Hadoop. I was having a look at the wordcount
 program. Can someone please let me know how to find which data gets mapped
 to which node?In the sense, I have a master node 0 and 4 other nodes 1-4 
 and I ran the wordcount successfully. But I would like to print for each
 node how much data it got from the input data file. Any suggestions??
 
 us latha wrote:
 
 Hi,
 
 Inside Map method, performed following change for  Example: WordCount
 v1.0http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Example%3A+WordCount+v1.0at
 http://hadoop.apache.org/core/docs/current/mapred_tutorial.html
 --
 String filename = new String();
 ...
 filename =  ((FileSplit) reporter.getInputSplit()).getPath().toString();
 while (tokenizer.hasMoreTokens()) {
            word.set(tokenizer.nextToken()+ +filename);
 
 
 Worked great!! Thanks to everyone!
 
 Regards,
 Srilatha
 
 
 On Sat, Oct 18, 2008 at 6:24 PM, Latha usla...@gmail.com wrote:
 
 Hi All,
 
 Thankyou for your valuable inputs in suggesting me the possible solutions
 of creating an index file with following format.
 word1 filename count
 word2 filename count.
 
 However, following is not working for me. Please help me to resolve the
 same.
 
 --
 public static class Map extends MapReduceBase implements
 MapperLongWritable, Text, Text, Text {
          private Text word = new Text();
          private Text filename = new Text();
          public void map(LongWritable key, Text value,
 OutputCollectorText, Text  output, Reporter reporter) throws
 IOException {
          filename.set( ((FileSplit)
 reporter.getInputSplit()).getPath().toString());
          String line = value.toString();
          StringTokenizer tokenizer = new StringTokenizer(line);
          while (tokenizer.hasMoreTokens()) {
               word.set(tokenizer.nextToken());
               output.collect(word, filename);
              }
          }
  }
 
  public static class Reduce extends MapReduceBase implements
 ReducerText,
 Text , Text, Text {
      public void reduce(Text key, IteratorText values,
 OutputCollectorText, Text  output, Reporter reporter) throws
 IOException {
         int sum = 0;
         Text filename;
         while (values.hasNext()) {
             sum ++;
             filename.set(values.next().toString());
         }
       String file = filename.toString() +   + ( new
 IntWritable(sum)).toString();
       filename=new Text(file);
       output.collect(key, filename);
       }
  }
 
 --
 08/10/18 05:38:25 INFO mapred.JobClient: Task Id :
 task_200810170342_0010_m_00_2, Status : FAILED
 java.io.IOException: Type mismatch in value from map: expected
 org.apache.hadoop.io.IntWritable, recieved org.apache.hadoop.io.Text
        at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:427)
        at org.myorg.WordCount$Map.map(WordCount.java:23)
        at org.myorg.WordCount$Map.map(WordCount.java:13)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
        at
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
 
 
 Thanks
 Srilatha
 
 
 
 On Mon, Oct 6, 2008 at 11:38 AM, Owen O'Malley omal...@apache.org
 wrote:
 
 On Sun, Oct 5, 2008 at 12:46 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:
 
 What you need to do is snag access to the filename in the configure
 method
 of the mapper.
 
 
 You can also do it in the map method with:
 
 ((FileSplit) reporter.getInputSplit()).getPath()
 
 
 Then instead of outputting just the word as the key, output a pair
 containing the word and the file name as the key.  Everything
 downstream
 should remain the same.
 
 
 If you want to have each file handled by a single reduce, I'd suggest:
 
 class FileWordPair implements Writable {
 

Re: Separating mapper intermediate files

2012-03-27 Thread Raj Vishwanathan
Aayush

You can use the following. Just play around with the pattern

 property
  namekeep.task.files.pattern/name
  value.*_m_123456_0/value
  descriptionKeep all files from tasks whose task names match the given
               regular expression. Defaults to none./description
  /property


Raj




 From: aayush aayushgupta...@gmail.com
To: common-user@hadoop.apache.org 
Sent: Tuesday, March 27, 2012 5:18 AM
Subject: Re: Separating mapper intermediate files
 
Thanks Harsh.

I set the mapred.local.dir as you suggested. It creates 4 folders in it for 
jobtracker, tasktracker, tt_private etc. i could not see an attempt directory. 
Can you let me know exactly where to look in this directory structure?

Furthermore, it seems that all the intermediate spill and map output are 
cleaned up when the mapper finishes. I want to see those intermediate files 
and  don't want the cleanup of these files. How can I achieve it?

Thanks a lot

On Mar 27, 2012, at 1:16 AM, Harsh J-2 [via Hadoop 
Common]ml-node+s472056n3860389...@n3.nabble.com wrote:

 Hello Aayush, 
 
 Three things that'd help clear your confusion: 
 1. dfs.data.dir controls where HDFS blocks are to be stored. Set this 
 to a partition1 path. 
 2. mapred.local.dir controls where intermediate task data go to. Set 
 this to a partition2 path. 
 
  Furthermore, can someone also tell me how to save intermediate mapper 
  files(spill outputs) and where are they saved. 
 
 Intermediate outputs are handled by the framework itself (There is no 
 user/manual work involved), and are saved inside attempt directories 
 under mapred.local.dir. 
 
 On Tue, Mar 27, 2012 at 4:46 AM, aayush [hidden email] wrote: 
  I am a newbie to Hadoop and map reduce. I am running a single node hadoop 
  setup. I have created 2 partitions on my HDD. I want the mapper 
  intermediate 
  files (i.e. the spill files and the mapper output) to be sent to a file 
  system on Partition1 whereas everything else including HDFS should be run 
  on 
  partition2. I am struggling to find the appropriate parametes in the conf 
  files. I understand that there is hadoop.tmp.dir and mapred.local.dir but 
  am 
  not sure how to use what. I would really appreciate if someone could tell 
  me 
  exactly which parameters to modify to achieve the goal. 
 
 -- 
 Harsh J 
 
 
 If you reply to this email, your message will be added to the discussion 
 below:
 http://hadoop-common.472056.n3.nabble.com/Separating-mapper-intermediate-files-tp3859787p3860389.html
 To unsubscribe from Separating mapper intermediate files, click here.
 NAML


--
View this message in context: 
http://hadoop-common.472056.n3.nabble.com/Separating-mapper-intermediate-files-tp3859787p3861159.html
Sent from the Users mailing list archive at Nabble.com.




Re: Hadoop pain points?

2012-03-02 Thread Raj Vishwanathan
Lol!

Raj




 From: Mike Spreitzer mspre...@us.ibm.com
To: common-user@hadoop.apache.org 
Sent: Friday, March 2, 2012 8:31 AM
Subject: Re: Hadoop pain points?
 
Interesting question.  Do you want to be asking those who use Hadoop --- 
or those who find it too painful to use?

Regards,
Mike



From:   Kunaal kunalbha...@alumni.cmu.edu
To:    common-user@hadoop.apache.org
Date:   03/02/2012 11:23 AM
Subject:        Hadoop pain points?
Sent by:        kunaalbha...@gmail.com



I am doing a general poll on what are the most prevalent pain points that
people run into with Hadoop? These could be performance related (memory
usage, IO latencies), usage related or anything really.

The goal is to look for what areas this platform could benefit the most in
the near future.

Any feedback is much appreciated.

Thanks,
Kunal.





Re: Adding nodes

2012-03-01 Thread Raj Vishwanathan
The master and slave files, if I remember correctly are used to start the 
correct daemons on the correct nodes from the master node.


Raj



 From: Joey Echeverria j...@cloudera.com
To: common-user@hadoop.apache.org common-user@hadoop.apache.org 
Cc: common-user@hadoop.apache.org common-user@hadoop.apache.org 
Sent: Thursday, March 1, 2012 4:57 PM
Subject: Re: Adding nodes
 
Not quite. Datanodes get the namenode host from fs.defalt.name in 
core-site.xml. Task trackers find the job tracker from the mapred.job.tracker 
setting in mapred-site.xml. 

Sent from my iPhone

On Mar 1, 2012, at 18:49, Mohit Anchlia mohitanch...@gmail.com wrote:

 On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria j...@cloudera.com wrote:
 
 You only have to refresh nodes if you're making use of an allows file.
 
 Thanks does it mean that when tasktracker/datanode starts up it
 communicates with namenode using master file?
 
 Sent from my iPhone
 
 On Mar 1, 2012, at 18:29, Mohit Anchlia mohitanch...@gmail.com wrote:
 
 Is this the right procedure to add nodes? I took some from hadoop wiki
 FAQ:
 
 http://wiki.apache.org/hadoop/FAQ
 
 1. Update conf/slave
 2. on the slave nodes start datanode and tasktracker
 3. hadoop balancer
 
 Do I also need to run dfsadmin -refreshnodes?
 




Re: Adding nodes

2012-03-01 Thread Raj Vishwanathan
WHat Joey said is correct for both apache and cloudera distros. The DN/TT  
daemons  will connect to the NN/JT using the config files. The master and slave 
files are used for starting the correct daemons.




 From: anil gupta anilg...@buffalo.edu
To: common-user@hadoop.apache.org; Raj Vishwanathan rajv...@yahoo.com 
Sent: Thursday, March 1, 2012 5:42 PM
Subject: Re: Adding nodes
 
Whatever Joey said is correct for Cloudera's distribution. For same, I am
not confident about other distribution as i haven't tried them.

Thanks,
Anil

On Thu, Mar 1, 2012 at 5:10 PM, Raj Vishwanathan rajv...@yahoo.com wrote:

 The master and slave files, if I remember correctly are used to start the
 correct daemons on the correct nodes from the master node.


 Raj


 
  From: Joey Echeverria j...@cloudera.com
 To: common-user@hadoop.apache.org common-user@hadoop.apache.org
 Cc: common-user@hadoop.apache.org common-user@hadoop.apache.org
 Sent: Thursday, March 1, 2012 4:57 PM
 Subject: Re: Adding nodes
 
 Not quite. Datanodes get the namenode host from fs.defalt.name in
 core-site.xml. Task trackers find the job tracker from the
 mapred.job.tracker setting in mapred-site.xml.
 
 Sent from my iPhone
 
 On Mar 1, 2012, at 18:49, Mohit Anchlia mohitanch...@gmail.com wrote:
 
  On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria j...@cloudera.com
 wrote:
 
  You only have to refresh nodes if you're making use of an allows file.
 
  Thanks does it mean that when tasktracker/datanode starts up it
  communicates with namenode using master file?
 
  Sent from my iPhone
 
  On Mar 1, 2012, at 18:29, Mohit Anchlia mohitanch...@gmail.com
 wrote:
 
  Is this the right procedure to add nodes? I took some from hadoop wiki
  FAQ:
 
  http://wiki.apache.org/hadoop/FAQ
 
  1. Update conf/slave
  2. on the slave nodes start datanode and tasktracker
  3. hadoop balancer
 
  Do I also need to run dfsadmin -refreshnodes?
 
 
 
 




-- 
Thanks  Regards,
Anil Gupta




Re: Error in Formatting NameNode

2012-02-12 Thread Raj Vishwanathan
Manish

If you read the error message, it says connection refused. Big clue :-) 

You probably have firewall configured. 

Raj

Sent from my iPad
Please excuse the typos. 

On Feb 12, 2012, at 1:41 AM, Manish Maheshwari mylogi...@gmail.com wrote:

 Thanks,
 
 I tried with hadoop-1.0.0 and JRE6 and things are looking good. I was able
 to format the namenode and bring up the NameNode 'calvin-PC:47110' and
 Hadoop Map/Reduce Administration webpages.
 
 Further i tried the example of TestDFSIO but get the below error of
 connection refused.
 
 -bash-4.1$ cd share/hadoop
 -bash-4.1$ ../../bin/hadoop jar hadoop-test-1.0.0.jar TestDFSIO –write
 –nrFiles 1 –filesize 10
 Warning: $HADOOP_HOME is deprecated.
 
 TestDFSIO.0.0.4
 12/02/12 15:05:08 INFO fs.TestDFSIO: nrFiles = 1
 12/02/12 15:05:08 INFO fs.TestDFSIO: fileSize (MB) = 1
 12/02/12 15:05:08 INFO fs.TestDFSIO: bufferSize = 100
 12/02/12 15:05:08 INFO fs.TestDFSIO: creating control file: 1 mega bytes, 1
 files
 12/02/12 15:05:08 INFO fs.TestDFSIO: created control files for: 1 files
 12/02/12 15:05:11 INFO ipc.Client: Retrying connect to server: calvin-PC/
 127.0.0.1:8021. Already tried 0 time(s).
 12/02/12 15:05:13 INFO ipc.Client: Retrying connect to server: calvin-PC/
 127.0.0.1:8021. Already tried 1 time(s).
 12/02/12 15:05:15 INFO ipc.Client: Retrying connect to server: calvin-PC/
 127.0.0.1:8021. Already tried 2 time(s).
 12/02/12 15:05:17 INFO ipc.Client: Retrying connect to server: calvin-PC/
 127.0.0.1:8021. Already tried 3 time(s).
 12/02/12 15:05:19 INFO ipc.Client: Retrying connect to server: calvin-PC/
 127.0.0.1:8021. Already tried 4 time(s).
 12/02/12 15:05:21 INFO ipc.Client: Retrying connect to server: calvin-PC/
 127.0.0.1:8021. Already tried 5 time(s).
 12/02/12 15:05:23 INFO ipc.Client: Retrying connect to server: calvin-PC/
 127.0.0.1:8021. Already tried 6 time(s).
 12/02/12 15:05:25 INFO ipc.Client: Retrying connect to server: calvin-PC/
 127.0.0.1:8021. Already tried 7 time(s).
 12/02/12 15:05:27 INFO ipc.Client: Retrying connect to server: calvin-PC/
 127.0.0.1:8021. Already tried 8 time(s).
 12/02/12 15:05:29 INFO ipc.Client: Retrying connect to server: calvin-PC/
 127.0.0.1:8021. Already tried 9 time(s).
 java.net.ConnectException: Call to calvin-PC/127.0.0.1:8021 failed on
 connection exception: java.net.ConnectException: Connection refused: no
 further information
at org.apache.hadoop.ipc.Client.wrapException(Client.java:1095)
at org.apache.hadoop.ipc.Client.call(Client.java:1071)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
at org.apache.hadoop.mapred.$Proxy2.getProtocolVersion(Unknown
 Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
at
 org.apache.hadoop.mapred.JobClient.createRPCProxy(JobClient.java:480)
at org.apache.hadoop.mapred.JobClient.init(JobClient.java:474)
at org.apache.hadoop.mapred.JobClient.init(JobClient.java:457)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1260)
at org.apache.hadoop.fs.TestDFSIO.runIOTest(TestDFSIO.java:257)
at org.apache.hadoop.fs.TestDFSIO.readTest(TestDFSIO.java:295)
at org.apache.hadoop.fs.TestDFSIO.run(TestDFSIO.java:459)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.fs.TestDFSIO.main(TestDFSIO.java:317)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
 org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at
 org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.hadoop.test.AllTestDriver.main(AllTestDriver.java:81)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
 Caused by: java.net.ConnectException: Connection refused: no further
 information
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
 sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
at
 org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:656)
at
 org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:434)
at
 

Re: Combining MultithreadedMapper threadpool size map.tasks.maximum

2012-02-10 Thread Raj Vishwanathan



Here is what I understand 

The RecordReader for the MTMappert takes the input split and cycles the records 
among the available threads. It also ensures that the map outputs are 
synchronized. 

So what Bejoy says is what will happen for the wordcount program. 

Raj




 From: bejoy.had...@gmail.com bejoy.had...@gmail.com
To: common-user@hadoop.apache.org 
Sent: Friday, February 10, 2012 11:15 AM
Subject: Re: Combining MultithreadedMapper threadpool size  map.tasks.maximum
 
Hi Rob
       I'd try to answer this. From my understanding if you are using 
Multithreaded mapper on word count example with TextInputFormat and imagine 
you have 2 threads and 2 lines in your input split . RecordReader would read 
Line 1 and give it to map thread 1 and line 2 to map thread 2. So kind of 
identical process as defined would be happening with these two lines in 
parallel. This would be the default behavior.
Regards
Bejoy K S

From handheld, Please excuse typos.

-Original Message-
From: Rob Stewart robstewar...@gmail.com
Date: Fri, 10 Feb 2012 18:39:44 
To: common-user@hadoop.apache.org
Reply-To: common-user@hadoop.apache.org
Subject: Re: Combining MultithreadedMapper threadpool size  map.tasks.maximum

Thanks, this is a lot clearer. One final question...

On 10 February 2012 14:20, Harsh J ha...@cloudera.com wrote:
 Hello again,

 On Fri, Feb 10, 2012 at 7:31 PM, Rob Stewart robstewar...@gmail.com wrote:
 OK, take word count. The k,v to the map is null,foo bar lambda
 beta. The canonical Hadoop program would tokenize this line of text
 and output foo,1 and so on. How would the multithreadedmapper know
 how to further divide this line of text into, say: [null,foo
 bar,null,lambda beta] for 2 threads to run in parallel? Can you
 somehow provide an additional record reader to split the input to the
 map task into sub-inputs for each thread?

 In MultithreadedMapper, the IO work is still single threaded, while
 the map() calling post-read is multithreaded. But yes you could use a
 mix of CombineFileInputFormat and some custom logic to have multiple
 local splits per map task, and divide readers of them among your
 threads. But why do all this when thats what slots at the TT are for?

I'm still unsure how the multi-threaded mapper knows how to split the
input value into chunks, one chunk for each thread. There is only one
example in the Hadoop 0.23 trunk that offers an example:
hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/mapreduce/lib/map/TestMultithreadedMapper.java

And in that source code, there is no custom logic for local splits per
map task at all. Again, going back to the word count example. Given a
line of text as input to a map, which comprises of 6 words. I
specificy .setNumberOfThreads( 2 ), so ideally, I'd want 3 words
analysed by one thread, and the 3 to the other. Is what what would
happen? i.e. - I'm unsure whether the multithreadedmapper class does
the splitting of inputs to map tasks...

Regards,




Re: reference document which properties are set in which configuration file

2012-02-10 Thread Raj Vishwanathan
Harsh, All

This was one of the first questions that  I asked. It is sometimes not clear 
whether some parameters are site related  or jab related or whether it belongs 
to NN, JT , DN or TT.

If I get some time during the weekend , I will try and put this into a document 
and see if it helps

Raj




 From: Harsh J ha...@cloudera.com
To: common-user@hadoop.apache.org 
Sent: Friday, February 10, 2012 8:31 PM
Subject: Re: reference document which properties are set in which 
configuration file
 
As a thumb rule, all properties starting with mapred.* or mapreduce.*
go to mapred-site.xml, all properties starting with dfs.* go to
hdfs-site.xml, and the rest may be put in core-site.xml to be safe.

In case you notice MR or HDFS specific properties being outside of
this naming convention, please do report a JIRA so we can deprecate
the old name and rename it with a more appropriate prefix.

On Sat, Feb 11, 2012 at 9:27 AM, Praveen Sripati
praveensrip...@gmail.com wrote:
 The mapred.task.tracker.http.address will go in the mapred-site.xml file.

 In the Hadoop installation directory check the core-default.xml,
 hdfs-default,xml and mapred-default.xml files to know about the different
 properties. Some of the properties which might be in the code may not be
 mentioned in the xml files and will be defaulted.

 Praveen

 On Tue, Feb 7, 2012 at 3:30 PM, Kleegrewe, Christian 
 christian.kleegr...@siemens.com wrote:

 Dear all,

 while configuring our hadoop cluster I wonder whether there exists a
 reference document that contains information about which configuration
 property has to be specified in which properties file. Especially I do not
 know where the mapred.task.tracker.http.address has to be set. Is it in the
 mapre-site.xml or in the hdfs-site.xml?

 any hint will be appreciated

 thanks

 Christian


 8--
 Siemens AG
 Corporate Technology
 Corporate Research and Technologies
 CT T DE IT3
 Otto-Hahn-Ring 6
 81739 München, Deutschland
 Tel.: +49 89 636-42722
 Fax: +49 89 636-41423
 mailto:christian.kleegr...@siemens.com

 Siemens Aktiengesellschaft: Vorsitzender des Aufsichtsrats: Gerhard
 Cromme; Vorstand: Peter Löscher, Vorsitzender; Roland Busch, Brigitte
 Ederer, Klaus Helmrich, Joe Kaeser, Barbara Kux, Hermann Requardt,
 Siegfried Russwurm, Peter Y. Solmssen, Michael Süß; Sitz der Gesellschaft:
 Berlin und München, Deutschland; Registergericht: Berlin Charlottenburg,
 HRB 12300, München, HRB 6684; WEEE-Reg.-Nr. DE 23691322






-- 
Harsh J
Customer Ops. Engineer
Cloudera | http://tiny.cloudera.com/about




Re: Can I write to an compressed file which is located in hdfs?

2012-02-07 Thread Raj Vishwanathan
Hi

Here is a piece of code that does the reverse of what you want; it takes a 
bunch of compressed files ( gzip, in this case ) and converts them to text.

You can tweak the code to do the reverse

http://pastebin.com/mBHVHtrm 



Raj






 From: Xiaobin She xiaobin...@gmail.com
To: bejoy.had...@gmail.com 
Cc: common-user@hadoop.apache.org; David Sinclair 
dsincl...@chariotsolutions.com 
Sent: Tuesday, February 7, 2012 1:11 AM
Subject: Re: Can I write to an compressed file which is located in hdfs?
 
thank you Bejoy, I will look at that book.

Thanks again!



2012/2/7 bejoy.had...@gmail.com

 **
 Hi
 AFAIK I don't think it is possible to append into a compressed file.

 If you have files in hdfs on a dir and you need to compress the same (like
 files for an hour) you can use MapReduce to do that by setting
 mapred.output.compress = true and
 mapred.output.compression.codec='theCodecYouPrefer'
 You'd get the blocks compressed in the output dir.

 You can use the API to read from standard input like
 -get hadoop conf
 -register the required compression codec
 -write to CompressionOutputStream.

 You should get a well detailed explanation on the same from the book
 'Hadoop - The definitive guide' by Tom White.
 Regards
 Bejoy K S

 From handheld, Please excuse typos.
 --
 *From: * Xiaobin She xiaobin...@gmail.com
 *Date: *Tue, 7 Feb 2012 14:24:01 +0800
 *To: *common-user@hadoop.apache.org; bejoy.had...@gmail.com; David
 Sinclairdsincl...@chariotsolutions.com
 *Subject: *Re: Can I write to an compressed file which is located in hdfs?

 hi Bejoy and David,

 thank you for you help.

 So I can't directly write logs or append logs into an compressed file in
 hdfs, right?

 Can I compress an file which is already in hdfs and has not been
 compressed?

 If I can , how can I do that?

 Thanks!



 2012/2/6 bejoy.had...@gmail.com

 Hi
       I agree with David on the point, you can achieve step 1 of my
 previous response with flume. ie load real time inflow of data in
 compressed format into hdfs. You can specify a time interval or data size
 in flume collector that determines when to flush data on to hdfs.

 Regards
 Bejoy K S

 From handheld, Please excuse typos.

 -Original Message-
 From: David Sinclair dsincl...@chariotsolutions.com
 Date: Mon, 6 Feb 2012 09:06:00
 To: common-user@hadoop.apache.org
 Cc: bejoy.had...@gmail.com
 Subject: Re: Can I write to an compressed file which is located in hdfs?

 Hi,

 You may want to have a look at the Flume project from Cloudera. I use it
 for writing data into HDFS.

 https://ccp.cloudera.com/display/SUPPORT/Downloads

 dave

 2012/2/6 Xiaobin She xiaobin...@gmail.com

  hi Bejoy ,
 
  thank you for your reply.
 
  actually I have set up an test cluster which has one namenode/jobtracker
  and two datanode/tasktracker, and I have make an test on this cluster.
 
  I fetch the log file of one of our modules from the log collector
 machines
  by rsync, and then I use hive command line tool to load this log file
 into
  the hive warehouse which  simply copy the file from the local
 filesystem to
  hdfs.
 
  And I have run some analysis on these data with hive, all this run well.
 
  But now I want to avoid the fetch section which use rsync, and write the
  logs into hdfs files directly from the servers which generate these
 logs.
 
  And it seems easy to do this job if the file locate in the hdfs is not
  compressed.
 
  But how to write or append logs to an file that is compressed and
 located
  in hdfs?
 
  Is this possible?
 
  Or is this an bad practice?
 
  Thanks!
 
 
 
  2012/2/6 bejoy.had...@gmail.com
 
   Hi
       If you have log files enough to become at least one block size in
 an
   hour. You can go ahead as
   - run a scheduled job every hour that compresses the log files for
 that
   hour and stores them on to hdfs (can use LZO or even Snappy to
 compress)
   - if your hive does more frequent analysis on this data store it as
   PARTITIONED BY (Date,Hour) . While loading into hdfs also follow a
   directory - sub dir structure. Once data is in hdfs issue a Alter
 Table
  Add
   Partition statement on corresponding hive table.
   -in Hive DDL use the appropriate Input format (Hive has some ApacheLog
   Input Format already)
  
  
   Regards
   Bejoy K S
  
   From handheld, Please excuse typos.
  
   -Original Message-
   From: Xiaobin She xiaobin...@gmail.com
   Date: Mon, 6 Feb 2012 16:41:50
   To: common-user@hadoop.apache.org; 佘晓彬xiaobin...@gmail.com
   Reply-To: common-user@hadoop.apache.org
   Subject: Re: Can I write to an compressed file which is located in
 hdfs?
  
   sorry, this sentence is wrong,
  
   I can't compress these logs every hour and them put them into hdfs.
  
   it should be
  
   I can  compress these logs every hour and them put them into hdfs.
  
  
  
  
   2012/2/6 Xiaobin She xiaobin...@gmail.com
  
   
hi all,
   
I'm testing hadoop and hive, 

Re: Problem in reduce phase(critical)

2012-02-03 Thread Raj Vishwanathan
It says in the logs

org.apache.hadoop.mapred.ReduceTask: java.net.ConnectException: Connection 
refused 


You have a network problem. Do you have firewall enabled.

Raj

and please do not spam the list with the same message. 




 From: hadoop hive hadooph...@gmail.com
To: u...@hive.apache.org; common-user@hadoop.apache.org 
Sent: Friday, February 3, 2012 6:02 AM
Subject: Problem in reduce phase(critical)
 
On Fri, Feb 3, 2012 at 4:56 PM, hadoop hive hadooph...@gmail.com wrote:

 hey folks,

 i m getting this error while running mapreduce and these comes up in
 reduce phase..


 2012-02-03 16:41:19,780 WARN org.apache.hadoop.mapred.ReduceTask: 
 attempt_201201271626_5282_r_00_0 copy failed: 
 attempt_201201271626_5282_m_07_2 from hadoopdata3
 2012-02-03 16:41:19,954 WARN org.apache.hadoop.mapred.ReduceTask: 
 java.net.ConnectException: Connection refused
     at sun.reflect.GeneratedConstructorAccessor8.newInstance(Unknown Source)
     at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
     at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
     at 
 sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1345)
     at java.security.AccessController.doPrivileged(Native Method)
     at 
 sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1339)
     at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:993)
     at 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1447)
     at 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1349)
     at 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
     at 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
 Caused by: java.net.ConnectException: Connection refused
     at java.net.PlainSocketImpl.socketConnect(Native Method)
     at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
     at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
     at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
     at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
     at java.net.Socket.connect(Socket.java:519)
     at sun.net.NetworkClient.doConnect(NetworkClient.java:158)
     at sun.net.www.http.HttpClient.openServer(HttpClient.java:394)
     at sun.net.www.http.HttpClient.openServer(HttpClient.java:529)
     at sun.net.www.http.HttpClient.init(HttpClient.java:233)
     at sun.net.www.http.HttpClient.New(HttpClient.java:306)
     at sun.net.www.http.HttpClient.New(HttpClient.java:323)
     at 
 sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:837)
     at 
 sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:778)
     at 
 sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:703)
     at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1026)
     ... 4 more

 2012-02-03 16:41:19,995 INFO org.apache.hadoop.mapred.ReduceTask: Task 
 attempt_201201271626_5282_r_00_0: Failed fetch #4 from 
 attempt_201201271626_5282_m_07_2
 2012-02-03 16:41:19,995 INFO org.apache.hadoop.mapred.ReduceTask: Failed to 
 fetch map-output from attempt_201201271626_5282_m_07_2 even after 
 MAX_FETCH_RETRIES_PER_MAP retries...  reporting to the JobTracker
 2012-02-03 16:41:19,995 WARN org.apache.hadoop.mapred.ReduceTask: 
 attempt_201201271626_5282_r_00_0 adding host hadoopdata3 to penalty box, 
 next contact in 150 seconds
 2012-02-03 16:41:19,995 INFO org.apache.hadoop.mapred.ReduceTask: 
 attempt_201201271626_5282_r_00_0: Got 1 map-outputs from previous 
 failures
 2012-02-03 16:42:10,015 INFO org.apache.hadoop.mapred.ReduceTask: 
 attempt_201201271626_5282_r_00_0 Need another 1 map output(s) where 0 is 
 already in progress
 2012-02-03 16:42:10,015 INFO org.apache.hadoop.mapred.ReduceTask: 
 attempt_201201271626_5282_r_00_0 Scheduled 0 outputs (1 slow hosts and0 
 dup hosts)
 2012-02-03 16:42:10,015 INFO org.apache.hadoop.mapred.ReduceTask: 
 Penalized(slow) Hosts:
 2012-02-03 16:42:10,015 INFO org.apache.hadoop.mapred.ReduceTask: 
 hadoopdata3 Will be considered after: 99 seconds.
 2012-02-03 16:43:10,050 INFO org.apache.hadoop.mapred.ReduceTask: 
 attempt_201201271626_5282_r_00_0 Need another 1 map output(s) where 0 is 
 already in progress
 2012-02-03 16:43:10,050 INFO org.apache.hadoop.mapred.ReduceTask: 
 attempt_201201271626_5282_r_00_0 Scheduled 0 outputs (1 slow hosts and0 
 dup hosts)
 2012-02-03 16:43:10,050 INFO org.apache.hadoop.mapred.ReduceTask: 
 Penalized(slow) Hosts:
 2012-02-03 16:43:10,050 INFO org.apache.hadoop.mapred.ReduceTask: 
 hadoopdata3 Will be considered after: 39 seconds.






Re: SimpleKMeansCLustering - Failed to set permissions of path to 0700

2011-10-17 Thread Raj Vishwanathan
Can you run any map/reduce jobs suchas word count?
Raj

Sent from my iPad
Please excuse the typos. 

On Oct 17, 2011, at 5:18 PM, robpd robpodol...@yahoo.co.uk wrote:

 Hi
 
 I am new to Mahout and Hadoop. I'm currently trying to get the
 SimpleKMeansClustering example from the Maout in Action book to work. I am
 running the whole thing from under cygwin from a sh script (in which I
 explicitly add the necessary jars to the classpath).
 
 Unfortunately I get...
 
 Exception in thread main java.io.IOException: Failed to set permissions of
 path: file:/tmp/hadoop-Rob/mapred/staging/Rob1823346078/.staging to 0700
 at
 org.apache.hadoop.fs.RawLocalFileSystem.checkReturnValue(RawLocalFileSystem.java:525)
 at
 org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:499)
 at
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:318)
 at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:183)
 at
 org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:116)
 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:797)
 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:791)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Unknown Source)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
 at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:791)
 at org.apache.hadoop.mapreduce.Job.submit(Job.java:465)
 at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:494)
 at
 org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:362)
 at
 org.apache.mahout.clustering.kmeans.KMeansDriver.buildClustersMR(KMeansDriver.java:310)
 at
 org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:237)
 at
 org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:152)
 at kmeans.SimpleKMeansClustering.main(Unknown Source)
 
 This is presumably because, although I am running under cygwin, Windows will
 not allow a change of privilege like this? I have done the following to
 address the problem, but without any success...
 
 a) Ensure I started Hadoop prior to running the program (start-all.sh)
 
 b) Edited the Hadoop conf file hdfs-site.xml to switch off the file
 permissions in HDFS
 
 property
 namedfs.permissions/name
 valuefalse/value
 /property
 
 c) I issues a hadoop fs -chmod +rwx -R /tmp to ensure that everyone was
 allowed to write to anything under tmp
 
 I'd be very grateful of some help here if you have some ideas. Sorry if I am
 being green about things. It does seem like there's lots to learn.
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/SimpleKMeansCLustering-Failed-to-set-permissions-of-path-to-0700-tp3429867p3429867.html
 Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


Re: performance normal?

2011-10-08 Thread Raj Vishwanathan
Really horriblr performance

Sent from my iPad
Please excuse the typos. 

On Oct 8, 2011, at 12:12 AM, tom uno ltom...@gmail.com wrote:

 release 0.21.0
 Production System
 20 nodes
 read 100 megabits per second
 write 10 megabits per second
 performance normal?


Re: Maintaining map reduce job logs - The best practices

2011-09-23 Thread Raj Vishwanathan
Bejoy

You can find the job specific logs in two places. The first one is in the hdfs 
ouput directory. The second place is under $HADOOP_HOME/logs/history 
($HADOOP_HOME/logs/history/done)

Both these paces have the config file and the job logs for each submited job. 


Sent from my iPad
Please excuse the typos. 

On Sep 23, 2011, at 12:52 AM, Bejoy KS bejoy.had...@gmail.com wrote:

 Hi All
 I do have a query here on maintaining Hadoop map-reduce logs.
 In default the logs appear in respective task tracker nodes which you can
 easily drill down from the job tracker web UI at times of any failure.(Which
 I was following till now) . Now I need to get into the next level to manage
 the logs corresponding to individual  jobs. In my log I'm dumping some key
 parameters with respect to my business which could be used for business
 level debugging/analysis at time in the future if required . For this
 purpose, I need a central log file corresponding to a job. (not many files,
 ie one per task tracker because as the cluster grows the no of log files
 corresponding to a job also increases). A single point of reference makes
 things handy for  analysis by any business folks .
I think it would be a generic requirement of any enterprise application
 to manage  and archive the logs of each job execution. Hence definitely
 there would be best practices and standards identified and maintained by
 most of the core Hadoop enterprise users. Could you please help me out by
 sharing some of the better options for log management for Hadoop map-reduce
 logs. It could greatly help me choose the best practice that suit my
 environment and application needs.
 
 Thank You
 
 Regards
 Bejoy.K.S