Re: Measuring Shuffle time for MR job
You can extract the shuffle time from the job log. Take a look at https://github.com/rajvish/hadoop-summary Raj From: Bertrand Dechoux decho...@gmail.com To: common-user@hadoop.apache.org Sent: Monday, August 27, 2012 12:57 AM Subject: Re: Measuring Shuffle time for MR job Shuffle time is considered as part of the reduce step. Without reduce, there is no need for shuffling. One way to measure it would be using the full reduce time with a '/dev/null' reducer. I am not aware of any way to measure it. Regards Bertrand On Mon, Aug 27, 2012 at 8:18 AM, praveenesh kumar praveen...@gmail.comwrote: Is there a way to know the total shuffle time of a map-reduce job - I mean some command or output that can tell that ? I want to measure total map, total shuffle and total reduce time for my MR job -- how can I achieve it ? I am using hadoop 0.20.205 Regards, Praveenesh -- Bertrand Dechoux
Re: doubt about reduce tasks and block writes
Harsh I did leave an escape route open witth a bit about corner cases :-) Anyway I agree that HDFS has no notion of block 0. I just meant that had the dfs.replication is 1, there will be,under normal circumstances :-), no blocks of output file will be written to node A. Raj - Original Message - From: Harsh J ha...@cloudera.com To: common-user@hadoop.apache.org; Raj Vishwanathan rajv...@yahoo.com Cc: Sent: Saturday, August 25, 2012 4:02 AM Subject: Re: doubt about reduce tasks and block writes Raj's almost right. In times of high load or space fillup on a local DN, the NameNode may decide to instead pick a non-local DN for replica-writing. In this way, the Node A may get a copy 0 of a replica from a task. This is per the default block placement policy. P.s. Note that HDFS hardly makes any differences between replicas, hence there is no hard-concept of a copy 0 or copy 1 block, at the NN level, it treats all DNs in pipeline equally and same for replicas. On Sat, Aug 25, 2012 at 4:14 AM, Raj Vishwanathan rajv...@yahoo.com wrote: But since node A has no TT running, it will not run map or reduce tasks. When the reducer node writes the output file, the fist block will be written on the local node and never on node A. So, to answer the question, Node A will contain copies of blocks of all output files. It wont contain the copy 0 of any output file. I am reasonably sure about this , but there could be corner cases in case of node failure and such like! I need to look into the code. Raj From: Marc Sturlese marc.sturl...@gmail.com To: hadoop-u...@lucene.apache.org Sent: Friday, August 24, 2012 1:09 PM Subject: doubt about reduce tasks and block writes Hey there, I have a doubt about reduce tasks and block writes. Do a reduce task always first write to hdfs in the node where they it is placed? (and then these blocks would be replicated to other nodes) In case yes, if I have a cluster of 5 nodes, 4 of them run DN and TT and one (node A) just run DN, when running MR jobs, map tasks would never read from node A? This would be because maps have data locality and if the reduce tasks write first to the node where they live, one replica of the block would always be in a node that has a TT. Node A would just contain blocks created from replication by the framework as no reduce task would write there directly. Is this correct? Thanks in advance -- View this message in context: http://lucene.472066.n3.nabble.com/doubt-about-reduce-tasks-and-block-writes-tp4003185.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com. -- Harsh J
Re: doubt about reduce tasks and block writes
But since node A has no TT running, it will not run map or reduce tasks. When the reducer node writes the output file, the fist block will be written on the local node and never on node A. So, to answer the question, Node A will contain copies of blocks of all output files. It wont contain the copy 0 of any output file. I am reasonably sure about this , but there could be corner cases in case of node failure and such like! I need to look into the code. Raj From: Marc Sturlese marc.sturl...@gmail.com To: hadoop-u...@lucene.apache.org Sent: Friday, August 24, 2012 1:09 PM Subject: doubt about reduce tasks and block writes Hey there, I have a doubt about reduce tasks and block writes. Do a reduce task always first write to hdfs in the node where they it is placed? (and then these blocks would be replicated to other nodes) In case yes, if I have a cluster of 5 nodes, 4 of them run DN and TT and one (node A) just run DN, when running MR jobs, map tasks would never read from node A? This would be because maps have data locality and if the reduce tasks write first to the node where they live, one replica of the block would always be in a node that has a TT. Node A would just contain blocks created from replication by the framework as no reduce task would write there directly. Is this correct? Thanks in advance -- View this message in context: http://lucene.472066.n3.nabble.com/doubt-about-reduce-tasks-and-block-writes-tp4003185.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: Number of Maps running more than expected
You probably have speculative execution on. Extra maps and reduce tasks are run in case some of them fail Raj Sent from my iPad Please excuse the typos. On Aug 16, 2012, at 11:36 AM, in.abdul in.ab...@gmail.com wrote: Hi Gaurav, Number map is not depents upon number block . It is really depends upon number of input splits . If you had 100GB of data and you had 10 split means then you can see only 10 maps . Please correct me if i am wrong Thanks and regards, Syed abdul kather On Aug 16, 2012 7:44 PM, Gaurav Dasgupta [via Lucene] ml-node+s472066n4001631...@n3.nabble.com wrote: Hi users, I am working on a CDH3 cluster of 12 nodes (Task Trackers running on all the 12 nodes and 1 node running the Job Tracker). In order to perform a WordCount benchmark test, I did the following: - Executed RandomTextWriter first to create 100 GB data (Note that I have changed the test.randomtextwrite.total_bytes parameter only, rest all are kept default). - Next, executed the WordCount program for that 100 GB dataset. The Block Size in hdfs-site.xml is set as 128 MB. Now, according to my calculation, total number of Maps to be executed by the wordcount job should be 100 GB / 128 MB or 102400 MB / 128 MB = 800. But when I am executing the job, it is running a total number of 900 Maps, i.e., 100 extra. So, why this extra number of Maps? Although, my job is completing successfully without any error. Again, if I don't execute the RandomTextWwriter job to create data for my wordcount, rather I put my own 100 GB text file in HDFS and run WordCount, I can then see the number of Maps are equivalent to my calculation, i.e., 800. Can anyone tell me why this odd behaviour of Hadoop regarding the number of Maps for WordCount only when the dataset is generated by RandomTextWriter? And what is the purpose of these extra number of Maps? Regards, Gaurav Dasgupta -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/Number-of-Maps-running-more-than-expected-tp4001631.html To unsubscribe from Lucene, click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472066code=aW4uYWJkdWxAZ21haWwuY29tfDQ3MjA2NnwxMDczOTUyNDEw . NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml - THANKS AND REGARDS, SYED ABDUL KATHER -- View this message in context: http://lucene.472066.n3.nabble.com/Number-of-Maps-running-more-than-expected-tp4001631p4001683.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: Multinode cluster only recognizes 1 node
Sean Can you paste the name node and jobtracker logs. It could be something as simple as disabling the firewall. Raj From: Barry, Sean F sean.f.ba...@intel.com To: common-user@hadoop.apache.org common-user@hadoop.apache.org Sent: Tuesday, July 31, 2012 1:23 PM Subject: RE: Multinode cluster only recognizes 1 node After a good number of hours trying to tinker with my set up I am still am only seeing results for 1, not 2 nodes. This is the tut I followed http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ After executing ./start-all.sh starting namenode, logging to /usr/local/hadoop/logs/hadoop-hduser-namenode-master.out slave: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-slave.out master: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-master.out master: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-secondarynamenode-master.out starting jobtracker, logging to /usr/local/hadoop/logs/hadoop-hduser-jobtracker-master.out slave: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-tasktracker-slave.out master: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-tasktracker-master.out -Original Message- From: syed kather [mailto:in.ab...@gmail.com] Sent: Thursday, July 26, 2012 6:06 PM To: common-user@hadoop.apache.org Subject: Re: Multinode cluster only recognizes 1 node Can you paste the information when you execute . start-all.sh in kernal.. when you do ssh on slave .. whether it is working fine... On Jul 27, 2012 4:50 AM, Barry, Sean F sean.f.ba...@intel.com wrote: Hi, I just set up a 2 node POC cluster and I am currently having an issue with it. I ran a wordcount MR test on my cluster to see if it was working and noticed that the Web ui at localhost:50030 showed that I only have 1 live node. I followed the tutorial step by step and I cannot seem to figure out my problem. When I ran start-all.sh all of the daemons on my master node and my slave node start up perfectly fine. If you have any suggestions please let me know. -Sean
Re: Merge Reducers Output
Is there a requirement for the final reduce file to be sorted? If not, wouldn't a map only job ( + a combiner, ) and a merge only job provide the answer? Raj From: Michael Segel michael_se...@hotmail.com To: common-user@hadoop.apache.org Sent: Tuesday, July 31, 2012 5:24 AM Subject: Re: Merge Reducers Output You really don't want to run a single reducer unless you know that you don't have a lot of mappers. As long as the output data types and structure are the same as the input, you can run your code as the combiner, and then run it again as the reducer. Problem solved with one or two lines of code. If your input and output don't match, then you can use the existing code as a combiner, and then write a new reducer. It could as easily be an identity reducer too. (Don't know the exact problem.) So here's a silly question. Why wouldn't you want to run a combiner? On Jul 31, 2012, at 12:08 AM, Jay Vyas jayunit...@gmail.com wrote: Its not clear to me that you need custom input formats 1) Getmerge might work or 2) Simply run a SINGLE reducer job (have mappers output static final int key=1, or specify numReducers=1). In this case, only one reducer will be called, and it will read through all the values. On Tue, Jul 31, 2012 at 12:30 AM, Bejoy KS bejoy.had...@gmail.com wrote: Hi Why not use 'hadoop fs -getMerge outputFolderInHdfs targetFileNameInLfs' while copying files out of hdfs for the end users to consume. This will merge all the files in 'outputFolderInHdfs' into one file and put it in lfs. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Michael Segel michael_se...@hotmail.com Date: Mon, 30 Jul 2012 21:08:22 To: common-user@hadoop.apache.org Reply-To: common-user@hadoop.apache.org Subject: Re: Merge Reducers Output Why not use a combiner? On Jul 30, 2012, at 7:59 PM, Mike S wrote: Liked asked several times, I need to merge my reducers output files. Imagine I have many reducers which will generate 200 files. Now to merge them together, I have written another map reduce job where each mapper read a complete file in full in memory, and output that and then only one reducer has to merge them together. To do so, I had to write a custom fileinputreader that reads the complete file into memory and then another custom fileoutputfileformat to append the each reducer item bytes together. this how my mapper and reducers looks like public static class MapClass extends MapperNullWritable, BytesWritable, IntWritable, BytesWritable { @Override public void map(NullWritable key, BytesWritable value, Context context) throws IOException, InterruptedException { context.write(key, value); } } public static class Reduce extends ReducerNullWritable, BytesWritable, NullWritable, BytesWritable { @Override public void reduce(NullWritable key, IterableBytesWritable values, Context context) throws IOException, InterruptedException { for (BytesWritable value : values) { context.write(NullWritable.get(), value); } } } I still have to have one reducers and that is a bottle neck. Please note that I must do this merging as the users of my MR job are outside my hadoop environment and the result as one file. Is there better way to merge reducers output files? -- Jay Vyas MMSB/UCHC
Re: Datanode error
Could also be due to network issues. Number of sockets could be less or number of threads could be less. Raj From: Harsh J ha...@cloudera.com To: common-user@hadoop.apache.org Sent: Friday, July 20, 2012 9:06 AM Subject: Re: Datanode error Pablo, These all seem to be timeouts from clients when they wish to read a block and drops from clients when they try to write a block. I wouldn't think of them as critical errors. Aside of being worried that a DN is logging these, are you noticing any usability issue in your cluster? If not, I'd simply blame this on stuff like speculative tasks, region servers, general HDFS client misbehavior, etc. Please do also share if you're seeing an issue that you think is related to these log messages. On Fri, Jul 20, 2012 at 6:37 PM, Pablo Musa pa...@psafe.com wrote: Hey guys, I have a cluster with 11 nodes (1 NN and 10 DNs) which is running and working. However my datanodes keep having the same errors, over and over. I googled the problems and tried different flags (ex: -XX:MaxDirectMemorySize=2G) and different configs (xceivers=8192) but could not solve it. Does anyone know what is the problem and how can I solve it? (the stacktrace is at the end) I am running: Java 1.7 Hadoop 0.20.2 Hbase 0.90.6 Zoo 3.3.5 % top - shows low load average (6% most of the time up to 60%), already considering the number of cpus % vmstat - shows no swap at all % sar - shows 75% idle cpu in the worst case Hope you guys can help me. Thanks in advance, Pablo 2012-07-20 00:03:44,455 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /DN01:50010, dest: /DN01:43516, bytes: 396288, op: HDFS_READ, cliID: DFSClient_hb_rs_DN01,60020,1342734302945_1342734303427, offset: 54956544, srvID: DS-798921853-DN01-50010-1328651609047, blockid: blk_914960691839012728_14061688, duration: 480061254006 2012-07-20 00:03:44,455 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(DN01:50010, storageID=DS-798921853-DN01-50010-1328651609047, infoPort=50075, ipcPort=50020):Got exception while serving blk_914960691839012728_14061688 to /DN01: java.net.SocketTimeoutException: 48 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/DN01:50010 remote=/DN01:43516] at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:397) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:493) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:279) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:175) 2012-07-20 00:03:44,455 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(DN01:50010, storageID=DS-798921853-DN01-50010-1328651609047, infoPort=50075, ipcPort=50020):DataXceiver java.net.SocketTimeoutException: 48 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/DN01:50010 remote=/DN01:43516] at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:397) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:493) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:279) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:175) 2012-07-20 00:12:11,949 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification succeeded for blk_4602445008578088178_5707787 2012-07-20 00:12:11,962 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-8916344806514717841_14081066 received exception java.net.SocketTimeoutException: 63000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/DN01:36634 remote=/DN03:50010] 2012-07-20 00:12:11,962 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(DN01:50010, storageID=DS-798921853-DN01-50010-1328651609047, infoPort=50075, ipcPort=50020):DataXceiver java.net.SocketTimeoutException: 63000 millis timeout while waiting for channel to be ready for read. ch :
Re: Error: Too Many Fetch Failures
You are probably having a very low somaxconn parameter ( default centos has it at 128 , if I remember correctly). You can check the value under /proc/sys/net/core/somaxconn Can you also check the value of ulimit -n? It could be low. Raj From: Ellis H. Wilson III el...@cse.psu.edu To: common-user@hadoop.apache.org Sent: Tuesday, June 19, 2012 12:32 PM Subject: Re: Error: Too Many Fetch Failures On 06/19/12 13:38, Vinod Kumar Vavilapalli wrote: Replies/more questions inline. I'm using Hadoop 0.23 on 50 machines, each connected with gigabit ethernet and each having solely a single hard disk. I am getting the following error repeatably for the TeraSort benchmark. TeraGen runs without error, but TeraSort runs predictably until this error pops up between 64% and 70% completion. This doesn't occur for every execution of the benchmark, as about one out of four times that I run the benchmark it does run to completion (TeraValidate included). How many containers are you running per node? Per my attached config files, I specify that yarn.nodemanager.resource.memory-mb = 3072, and the default /seems/ to be set at 1024MB for maps and reducers, so I have 3 containers running per node. I have verified that this indeed is the case in the web client. Three of these 1GB slots in the cluster appear to be occupied by something else during the execution of TeraSort, so I specify that TeraGen create .5TB using 441 maps (3waves * (50nodes * 3containerslots - 3occupiedslots)), and TeraSort to use 147 reducers. This seems to give me the guarantees I had with Hadoop 1.0 that each node gets an equal number of reducers, and my job doesn't drag on due to straggler reducers. Clearly maps are getting killed because of fetch failures. Can you look at the logs of the NodeManager where this particular map task ran. That may have logs related to why reducers are not able to fetch map-outputs. It is possible that because you have only one disk per node, some of these nodes have bad or unfunctional disks and thereby causing fetch failures. I will rerun and report the exact error messages from the NodeManagers. Can you give me more exacting advice on collecting logs of this sort, for as I mentioned I'm new to doing so with the new version of Hadoop? I have been looking in /tmp/logs and hadoop/logs, but perhaps there is somewhere else to look as well? Last, I am certain this is not related to failing disks, as this exact error occurs at much higher frequencies when I run Hadoop on a NAS box, which is the core of my research at the moment. Nevertheless, I posted to this list instead of Dev as this was on vanilla CentOS-5.5 machines using just the HDDs within each, and therefore should be a highly typical setup. In particular, I see these errors coming from numerous nodes all at once, and the subset of nodes giving the problems are not repeatable from one run to the next, though the resulting error is. If that is the case, either you can offline these nodes or bump up mapreduce.reduce.shuffle.maxfetchfailures to tolerate these failures, the default is 10. There are other some tweaks which I can tell if you can find more details from your logs. I'd prefer to not bump up maxfetchfailures, and would rather simply fix the issue that is causing the fetch to fail in the beginning. This isn't a large cluster, having only 50 nodes, nor are the links (1gig) or storage capabilities (1 sata drive) great or strange relative to any normal installation. I have to assume here that I've mis-configured something :(. Best, ellis
Re: Map works well, but Redue failed
Most probably you have a network problem. Check your hostname and IP address mapping From: Yongwei Xing jdxyw2...@gmail.com To: common-user@hadoop.apache.org Sent: Thursday, June 14, 2012 10:15 AM Subject: Map works well, but Redue failed Hi all I run a simple sort program, however, I meet such error like below. 12/06/15 01:13:17 WARN mapred.JobClient: Error reading task outputServer returned HTTP response code: 403 for URL: http://192.168.1.106:50060/tasklog?plaintext=trueattemptid=attempt_201206150102_0002_m_01_1filter=stdout 12/06/15 01:13:18 WARN mapred.JobClient: Error reading task outputServer returned HTTP response code: 403 for URL: http://192.168.1.106:50060/tasklog?plaintext=trueattemptid=attempt_201206150102_0002_m_01_1filter=stderr 12/06/15 01:13:20 INFO mapred.JobClient: map 50% reduce 0% 12/06/15 01:13:23 INFO mapred.JobClient: map 100% reduce 0% 12/06/15 01:14:19 INFO mapred.JobClient: Task Id : attempt_201206150102_0002_m_00_2, Status : FAILED Too many fetch-failures 12/06/15 01:14:20 WARN mapred.JobClient: Error reading task outputServer returned HTTP response code: 403 for URL: http://192.168.1.106:50060/tasklog?plaintext=trueattemptid=attempt_201206150102_0002_m_00_2filter=stdout Does anyone know what's the reason and how to resolve it? Best Regards, -- Welcome to my ET Blog http://www.jdxyw.com
Re: Map/Reduce Tasks Fails
From: Harsh J ha...@cloudera.com To: common-user@hadoop.apache.org Sent: Tuesday, May 22, 2012 7:13 AM Subject: Re: Map/Reduce Tasks Fails Sandeep, Is the same DN 10.0.25.149 reported across all failures? And do you notice any machine patterns when observing the failed tasks (i.e. are they clumped on any one or a few particular TTs repeatedly)? On Tue, May 22, 2012 at 7:32 PM, Sandeep Reddy P sandeepreddy.3...@gmail.com wrote: Hi, We have a 5node cdh3u4 cluster running. When i try to do teragen/terasort some of the map tasks are Failed/Killed and the logs show similar error on all machines. 2012-05-22 09:43:50,831 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream 10.0.25.149:50010 java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.0.25.149:55835 remote=/10.0.25.149:50010] 2012-05-22 09:44:25,968 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_7260720956806950576_1825 2012-05-22 09:44:25,973 INFO org.apache.hadoop.hdfs.DFSClient: Excluding datanode 10.0.25.149:50010 2012-05-22 09:46:36,350 WARN org.apache.hadoop.mapred.Task: Parent died. Exiting attempt_201205211504_0007_m_16_1. Are these kind of errors common?? Atleast 1 map task is failing due to above reason on all the machines.We are using 24 mappers for teragen. For us it took 3hrs 44min 17 sec to generate 50Gb data with 24 mappers and 17failed/8 killed task attempts. 24min 10 sec for 5GB data with 24 mappers and 9 killed Task attempts. Cluster works good for small datasets. -- Harsh J
Re: Map/Reduce Tasks Fails
What kind of storage is attached to the data nodes ? This kind of error can happen when the CPU is really busy with I/O or interrupts. Can you run top or dstat on some of the data nodes to see how the system is performing? Raj From: Sandeep Reddy P sandeepreddy.3...@gmail.com To: common-user@hadoop.apache.org Sent: Tuesday, May 22, 2012 7:23 AM Subject: Re: Map/Reduce Tasks Fails *Task Trackers* *Name**Host**# running tasks**Max Map Tasks**Max Reduce Tasks**Task Failures**Directory Failures**Node Health Status**Seconds Since Node Last Healthy**Total Tasks Since Start* *Succeeded Tasks Since Start* *Total Tasks Last Day* *Succeeded Tasks Last Day* *Total Tasks Last Hour* *Succeeded Tasks Last Hour* *Seconds since heartbeat* tracker_hadoop2.liaisondevqa.local:localhost/127.0.0.1:56225http://hadoop2.liaisondevqa.local:50060/ hadoop2.liaisondevqa.local062220N/A093 60 59 28 64 38 0 tracker_hadoop4.liaisondevqa.local:localhost/127.0.0.1:40363http://hadoop4.liaisondevqa.local:50060/ hadoop4.liaisondevqa.local062190N/A091 59 65 33 36 33 0 tracker_hadoop5.liaisondevqa.local:localhost/127.0.0.1:46605http://hadoop5.liaisondevqa.local:50060/ hadoop5.liaisondevqa.local162210N/A083 47 69 35 45 19 0 tracker_hadoop3.liaisondevqa.local:localhost/127.0.0.1:37305http://hadoop3.liaisondevqa.local:50060/ hadoop3.liaisondevqa.local062180N/A087 55 55 28 57 34 0 Highest Failures: tracker_hadoop2.liaisondevqa.local:localhost/127.0.0.1:56225 with 22 failures
Re: High load on datanode startup
Darrell Are the new dn,nn and mapred directories on the same physical disk? Nothing on NFS , correct? Could you be having some hardware issue? Any clue in /var/log/messages or dmesg? A non responsive system indicates a CPU that is really busy either doing something or waiting for something and the fact that it happens only on some nodes indicates a local problem. Raj From: Darrell Taylor darrell.tay...@gmail.com To: common-user@hadoop.apache.org Cc: Raj Vishwanathan rajv...@yahoo.com Sent: Thursday, May 10, 2012 3:57 AM Subject: Re: High load on datanode startup On Thu, May 10, 2012 at 9:33 AM, Todd Lipcon t...@cloudera.com wrote: That's real weird.. If you can reproduce this after a reboot, I'd recommend letting the DN run for a minute, and then capturing a jstack pid of dn as well as the output of top -H -p pid of dn -b -n 5 and send it to the list. What I did after the reboot this morning was to move the my dn, nn, and mapred directories out of the the way, create a new one, formatted it, and restarted the node, it's now happy. I'll try moving the directories back later and do the jstack as you suggest. What JVM/JDK are you using? What OS version? root@pl446:/# dpkg --get-selections | grep java java-common install libjaxp1.3-java install libjaxp1.3-java-gcj install libmysql-java install libxerces2-java install libxerces2-java-gcj install sun-java6-bin install sun-java6-javadb install sun-java6-jdk install sun-java6-jre install root@pl446:/# java -version java version 1.6.0_26 Java(TM) SE Runtime Environment (build 1.6.0_26-b03) Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode) root@pl446:/# cat /etc/issue Debian GNU/Linux 6.0 \n \l -Todd On Wed, May 9, 2012 at 11:57 PM, Darrell Taylor darrell.tay...@gmail.com wrote: On Wed, May 9, 2012 at 10:52 PM, Raj Vishwanathan rajv...@yahoo.com wrote: The picture either too small or too pixelated for my eyes :-) There should be a zoom option in the top right of the page that allows you to view it full size Can you login to the box and send the output of top? If the system is unresponsive, it has to be something more than an unbalanced hdfs cluster, methinks. Sorry, I'm unable to login to the box, it's completely unresponsive. Raj From: Darrell Taylor darrell.tay...@gmail.com To: common-user@hadoop.apache.org; Raj Vishwanathan rajv...@yahoo.com Sent: Wednesday, May 9, 2012 2:40 PM Subject: Re: High load on datanode startup On Wed, May 9, 2012 at 10:23 PM, Raj Vishwanathan rajv...@yahoo.com wrote: When you say 'load', what do you mean? CPU load or something else? I mean in the unix sense of load average, i.e. top would show a load of (currently) 376. Looking at Ganglia stats for the box it's not CPU load as such, the graphs shows actual CPU usage as 30%, but the number of running processes is simply growing in a linear manner - screen shot of ganglia page here : https://picasaweb.google.com/lh/photo/Q0uFSzyLiriDuDnvyRUikXVR0iWwMibMfH0upnTwi28?feat=directlink Raj From: Darrell Taylor darrell.tay...@gmail.com To: common-user@hadoop.apache.org Sent: Wednesday, May 9, 2012 9:52 AM Subject: High load on datanode startup Hi, I wonder if someone could give some pointers with a problem I'm having? I have a 7 machine cluster setup for testing and we have been pouring data into it for a week without issue, have learnt several thing along the way and solved all the problems up to now by searching online, but now I'm stuck. One of the data nodes decided to have a load of 70+ this morning, stopping datanode and tasktracker brought it back to normal, but every time I start the datanode again the load shoots through the roof, and all I get in the logs is : STARTUP_MSG: Starting DataNode STARTUP_MSG: host = pl464/10.20.16.64 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.2-cdh3u3 STARTUP_MSG: build = file:///data/1/tmp/nightly_2012-03-20_13-13-48_3/hadoop-0.20-0.20.2+923.197-1~squeeze -/ 2012-05-09 16:12:05,925 INFO org.apache.hadoop.security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing. 2012-05-09 16:12:06,139 INFO org.apache.hadoop.security.UserGroupInformation: JAAS Configuration already set up for Hadoop
Re: High load on datanode startup
When you say 'load', what do you mean? CPU load or something else? Raj From: Darrell Taylor darrell.tay...@gmail.com To: common-user@hadoop.apache.org Sent: Wednesday, May 9, 2012 9:52 AM Subject: High load on datanode startup Hi, I wonder if someone could give some pointers with a problem I'm having? I have a 7 machine cluster setup for testing and we have been pouring data into it for a week without issue, have learnt several thing along the way and solved all the problems up to now by searching online, but now I'm stuck. One of the data nodes decided to have a load of 70+ this morning, stopping datanode and tasktracker brought it back to normal, but every time I start the datanode again the load shoots through the roof, and all I get in the logs is : STARTUP_MSG: Starting DataNode STARTUP_MSG: host = pl464/10.20.16.64 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.2-cdh3u3 STARTUP_MSG: build = file:///data/1/tmp/nightly_2012-03-20_13-13-48_3/hadoop-0.20-0.20.2+923.197-1~squeeze -/ 2012-05-09 16:12:05,925 INFO org.apache.hadoop.security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing. 2012-05-09 16:12:06,139 INFO org.apache.hadoop.security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing. Nothing else. The load seems to max out only 1 of the CPUs, but the machine becomes *very* unresponsive Anybody got any pointers of things I can try? Thanks Darrell.
Re: High load on datanode startup
The picture either too small or too pixelated for my eyes :-) Can you login to the box and send the output of top? If the system is unresponsive, it has to be something more than an unbalanced hdfs cluster, methinks. Raj From: Darrell Taylor darrell.tay...@gmail.com To: common-user@hadoop.apache.org; Raj Vishwanathan rajv...@yahoo.com Sent: Wednesday, May 9, 2012 2:40 PM Subject: Re: High load on datanode startup On Wed, May 9, 2012 at 10:23 PM, Raj Vishwanathan rajv...@yahoo.com wrote: When you say 'load', what do you mean? CPU load or something else? I mean in the unix sense of load average, i.e. top would show a load of (currently) 376. Looking at Ganglia stats for the box it's not CPU load as such, the graphs shows actual CPU usage as 30%, but the number of running processes is simply growing in a linear manner - screen shot of ganglia page here : https://picasaweb.google.com/lh/photo/Q0uFSzyLiriDuDnvyRUikXVR0iWwMibMfH0upnTwi28?feat=directlink Raj From: Darrell Taylor darrell.tay...@gmail.com To: common-user@hadoop.apache.org Sent: Wednesday, May 9, 2012 9:52 AM Subject: High load on datanode startup Hi, I wonder if someone could give some pointers with a problem I'm having? I have a 7 machine cluster setup for testing and we have been pouring data into it for a week without issue, have learnt several thing along the way and solved all the problems up to now by searching online, but now I'm stuck. One of the data nodes decided to have a load of 70+ this morning, stopping datanode and tasktracker brought it back to normal, but every time I start the datanode again the load shoots through the roof, and all I get in the logs is : STARTUP_MSG: Starting DataNode STARTUP_MSG: host = pl464/10.20.16.64 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.2-cdh3u3 STARTUP_MSG: build = file:///data/1/tmp/nightly_2012-03-20_13-13-48_3/hadoop-0.20-0.20.2+923.197-1~squeeze -/ 2012-05-09 16:12:05,925 INFO org.apache.hadoop.security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing. 2012-05-09 16:12:06,139 INFO org.apache.hadoop.security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing. Nothing else. The load seems to max out only 1 of the CPUs, but the machine becomes *very* unresponsive Anybody got any pointers of things I can try? Thanks Darrell.
Re: Reduce Hangs at 66%
Keith What is the the output for ulimit -n? Your value for number of open files is probably too low. Raj From: Keith Thompson kthom...@binghamton.edu To: common-user@hadoop.apache.org Sent: Thursday, May 3, 2012 4:33 PM Subject: Re: Reduce Hangs at 66% I am not sure about ulimits, but I can answer the rest. It's a Cloudera distribution of Hadoop 0.20.2. The HDFS has 9 TB free. In the reduce step, I am taking keys in the form of (gridID, date), each with a value of 1. The reduce step just sums the 1's as the final output value for the key (It's counting how many people were in the gridID on a certain day). I have been running other more complicated jobs with no problem, so I'm not sure why this one is being peculiar. This is the code I used to execute the program from the command line (the source is a file on the hdfs): hadoop jar jarfile driver source /thompson/outputDensity/density1 The program then executes the map and gets to 66% of the reduce, then stops responding. The cause of the error seems to be: Error from attempt_201202240659_6432_r_00_1: java.io.IOException: The temporary job-output directory hdfs://analytix1:9000/thompson/outputDensity/density1/_temporary doesn't exist! I don't understand what the _temporary is. I am assuming it's something Hadoop creates automatically. On Thu, May 3, 2012 at 5:02 AM, Michel Segel michael_se...@hotmail.comwrote: Well... Lots of information but still missing some of the basics... Which release and version? What are your ulimits set to? How much free disk space do you have? What are you attempting to do? Stuff like that. Sent from a remote device. Please excuse any typos... Mike Segel On May 2, 2012, at 4:49 PM, Keith Thompson kthom...@binghamton.edu wrote: I am running a task which gets to 66% of the Reduce step and then hangs indefinitely. Here is the log file (I apologize if I am putting too much here but I am not exactly sure what is relevant): 2012-05-02 16:42:52,975 INFO org.apache.hadoop.mapred.JobTracker: Adding task (REDUCE) 'attempt_201202240659_6433_r_00_0' to tip task_201202240659_6433_r_00, for tracker 'tracker_analytix7:localhost.localdomain/127.0.0.1:56515' 2012-05-02 16:42:53,584 INFO org.apache.hadoop.mapred.JobInProgress: Task 'attempt_201202240659_6433_m_01_0' has completed task_201202240659_6433_m_01 successfully. 2012-05-02 17:00:47,546 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201202240659_6432_r_00_0: Task attempt_201202240659_6432_r_00_0 failed to report status for 1800 seconds. Killing! 2012-05-02 17:00:47,546 INFO org.apache.hadoop.mapred.JobTracker: Removing task 'attempt_201202240659_6432_r_00_0' 2012-05-02 17:00:47,546 INFO org.apache.hadoop.mapred.JobTracker: Adding task (TASK_CLEANUP) 'attempt_201202240659_6432_r_00_0' to tip task_201202240659_6432_r_00, for tracker 'tracker_analytix4:localhost.localdomain/127.0.0.1:44204' 2012-05-02 17:00:48,763 INFO org.apache.hadoop.mapred.JobTracker: Removing task 'attempt_201202240659_6432_r_00_0' 2012-05-02 17:00:48,957 INFO org.apache.hadoop.mapred.JobTracker: Adding task (REDUCE) 'attempt_201202240659_6432_r_00_1' to tip task_201202240659_6432_r_00, for tracker 'tracker_analytix5:localhost.localdomain/127.0.0.1:59117' 2012-05-02 17:00:56,559 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201202240659_6432_r_00_1: java.io.IOException: The temporary job-output directory hdfs://analytix1:9000/thompson/outputDensity/density1/_temporary doesn't exist! at org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250) at org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:240) at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:116) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:438) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:416) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115) at org.apache.hadoop.mapred.Child.main(Child.java:262) 2012-05-02 17:00:59,903 INFO org.apache.hadoop.mapred.JobTracker: Removing task 'attempt_201202240659_6432_r_00_1' 2012-05-02 17:00:59,906 INFO org.apache.hadoop.mapred.JobTracker: Adding task (REDUCE) 'attempt_201202240659_6432_r_00_2' to tip task_201202240659_6432_r_00, for tracker 'tracker_analytix3:localhost.localdomain/127.0.0.1:39980' 2012-05-02 17:01:07,200 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201202240659_6432_r_00_2: java.io.IOException: The temporary job-output directory
Re: hadoop streaming and a directory containing large number of .tgz files
Sunil You could use identity mappers, a single identity reducer and by not having output compression., Raj From: Sunil S Nandihalli sunil.nandiha...@gmail.com To: common-user@hadoop.apache.org Sent: Tuesday, April 24, 2012 7:01 AM Subject: Re: hadoop streaming and a directory containing large number of .tgz files Sorry for reforwarding this email. I was not sure if it actually got through since I just got the confirmation regarding my membership to the mailing list. Thanks, Sunil. On Tue, Apr 24, 2012 at 7:12 PM, Sunil S Nandihalli sunil.nandiha...@gmail.com wrote: Hi Everybody, I am a newbie to hadoop. I have about 40K .tgz files each of approximately 3MB . I would like to process this as if it were a single large file formed by cat list-of-files | gnuparallel 'tar -Oxvf {} | sed 1d' output.txt how can I achieve this using hadoop-streaming or some-other similar library.. thanks, Sunil.
Re: Hive Thrift help
To a verify that a server is running on a port ( 10,000 in this case ) and to ensure that there are no firewall issues run telnet servername 1 The connection should succeed. Raj From: Edward Capriolo edlinuxg...@gmail.com To: common-user@hadoop.apache.org Sent: Monday, April 16, 2012 2:55 PM Subject: Re: Hive Thrift help You can NOT connect to hive thrift to confirm it's status. Thrift is thrift not http. But you are right to say HiveServer does not produce and output by default. if netstat -nl | grep 1 shows status it is up. On Mon, Apr 16, 2012 at 5:18 PM, Rahul Jain rja...@gmail.com wrote: I am assuming you read thru: https://cwiki.apache.org/Hive/hiveserver.html The server comes up on port 10,000 by default, did you verify that it is actually listening on the port ? You can also connect to hive server using web browser to confirm its status. -Rahul On Mon, Apr 16, 2012 at 1:53 PM, Michael Wang michael.w...@meredith.comwrote: we need to connect to HIVE from Microstrategy reports, and it requires the Hive Thrift server. But I tried to start it, and it just hangs as below. # hive --service hiveserver Starting Hive Thrift Server Any ideas? Thanks, Michael This electronic message, including any attachments, may contain proprietary, confidential or privileged information for the sole use of the intended recipient(s). You are hereby notified that any unauthorized disclosure, copying, distribution, or use of this message is prohibited. If you have received this message in error, please immediately notify the sender by reply e-mail and delete it.
Re: Is TeraGen's generated data deterministic?
David Since the data generation and sorting is different hadoop jobs, you can generate the data once and sort the same data as many times as as you want. I don't think Teragen is deterministic.( or rather , the keys are random but the text is deterministic if I remember correctly ) Raj From: David Erickson halcyon1...@gmail.com To: common-user@hadoop.apache.org Sent: Saturday, April 14, 2012 1:53 PM Subject: Is TeraGen's generated data deterministic? Hi we are doing some benchmarking of some of our infrastructure and are using TeraGen/TeraSort to do the benchmarking. I am wondering if the data generated by TeraGen is deterministic, in that if I repeat the same experiment multiple times with the same configuration options if it will continue to generate and sort the exact same data? And if not, is there an easy mod to make this happen? Thanks! David
Re: Map Reduce Job Help
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ From: hellooperator silversl...@gmail.com To: core-u...@hadoop.apache.org Sent: Wednesday, April 11, 2012 11:15 AM Subject: Map Reduce Job Help Hello, I'm just starting out with Hadoop and writing some Map Reduce jobs. I was looking for help on writing a MR job in python that allows me to take some emails and put them into HDFS so I can search on the text or attachments of the email? Thank you! -- View this message in context: http://old.nabble.com/Map-Reduce-Job-Help-tp33670645p33670645.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: opensuse 12.1
Lots of people seem to start with this. http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ Raj From: Barry, Sean F sean.f.ba...@intel.com To: common-user@hadoop.apache.org common-user@hadoop.apache.org Sent: Wednesday, April 4, 2012 9:12 AM Subject: FW: opensuse 12.1 -Original Message- From: Barry, Sean F [mailto:sean.f.ba...@intel.com] Sent: Wednesday, April 04, 2012 9:10 AM To: common-user@hadoop.apache.org Subject: opensuse 12.1 What is the best way to install hadoop on opensuse 12.1 for a small two node cluster. -SB
Re: data distribution in HDFS
Stijn, The first block of the data , is always stored in the local node. Assuming that you had a replication factor of 3, the node that generates the data will get about 10GB of data and the other 20GB will be distributed among other nodes. Raj From: Stijn De Weirdt stijn.dewei...@ugent.be To: common-user@hadoop.apache.org Sent: Monday, April 2, 2012 9:54 AM Subject: data distribution in HDFS hi all, i'm just started to play around with hdfs+mapred. i'm currently playing with teragen/sort/validate to see if i understand all. the test setup involves 5 nodes that all are tasktracker and datanode (and one node that is also jobtracker and namenode on top of that. (this one node is running both the namenode hadoop process as the datanode process) when i do the in teragen run, the data is not distributed equally over all nodes. the node that is also namenode, get's a bigger portion of all the data. (as seen by df on the nodes and by using dsfadmin -report) i also get this distribution when i ran the TestDFSIO write test (50 files of 1GB) i use basic command line teragen $((100*1000*1000)) /benchmarks/teragen, so i expect 100M*0.1kb = 10GB of data. (if i add the volumes in use by hdfs, it's actually quite a bit more.) 4 data nodes are using 4.2-4.8GB, and the data+namenode has 9.4GB in use. so this one datanode is seen as 2 nodes. when i do ls on the filesystem, i see that teragen created 250MB files, the current hdfs blocksize is 64MB. is there a reason why one datanode is preferred over the others. it is annoying since the terasort output behaves the same, and i can't use the full hdfs space for testing that way. also, since more IO comes to this one node, the performance isn't really balanced. many thanks, stijn
Re: data distribution in HDFS
AFAIK there is no way to disable this feature . This is an optimization. It happens because in your case the node generating the data is also a data node. Raj From: Stijn De Weirdt stijn.dewei...@ugent.be To: common-user@hadoop.apache.org Sent: Monday, April 2, 2012 12:18 PM Subject: Re: data distribution in HDFS thanks serge. is there a way to disable this feature (ie place first block always on local node)? and is this because the local node is a datanode? or is there always a local node with datatransfers? many thanks, stijn Local node is a node from where you are coping data from If lets say you are using -copyFromLocal option Regards Serge On 4/2/12 11:53 AM, Stijn De Weirdtstijn.dewei...@ugent.be wrote: hi raj, what is a local node? is it relative to the tasks that are started? stijn On 04/02/2012 07:28 PM, Raj Vishwanathan wrote: Stijn, The first block of the data , is always stored in the local node. Assuming that you had a replication factor of 3, the node that generates the data will get about 10GB of data and the other 20GB will be distributed among other nodes. Raj From: Stijn De Weirdtstijn.dewei...@ugent.be To: common-user@hadoop.apache.org Sent: Monday, April 2, 2012 9:54 AM Subject: data distribution in HDFS hi all, i'm just started to play around with hdfs+mapred. i'm currently playing with teragen/sort/validate to see if i understand all. the test setup involves 5 nodes that all are tasktracker and datanode (and one node that is also jobtracker and namenode on top of that. (this one node is running both the namenode hadoop process as the datanode process) when i do the in teragen run, the data is not distributed equally over all nodes. the node that is also namenode, get's a bigger portion of all the data. (as seen by df on the nodes and by using dsfadmin -report) i also get this distribution when i ran the TestDFSIO write test (50 files of 1GB) i use basic command line teragen $((100*1000*1000)) /benchmarks/teragen, so i expect 100M*0.1kb = 10GB of data. (if i add the volumes in use by hdfs, it's actually quite a bit more.) 4 data nodes are using 4.2-4.8GB, and the data+namenode has 9.4GB in use. so this one datanode is seen as 2 nodes. when i do ls on the filesystem, i see that teragen created 250MB files, the current hdfs blocksize is 64MB. is there a reason why one datanode is preferred over the others. it is annoying since the terasort output behaves the same, and i can't use the full hdfs space for testing that way. also, since more IO comes to this one node, the performance isn't really balanced. many thanks, stijn
Re: How to modify hadoop-wordcount example to display File-wise results.
Aaron You can get the details of how much data each mapper processed, on which node ( IP address actually!) from the job logs. Raj From: Ajay Srivastava ajay.srivast...@guavus.com To: common-user@hadoop.apache.org common-user@hadoop.apache.org Cc: core-u...@hadoop.apache.org core-u...@hadoop.apache.org Sent: Thursday, March 29, 2012 5:57 PM Subject: Re: How to modify hadoop-wordcount example to display File-wise results. Hi Aaron, I guess that it can be done by using counters. You can define a counter for each node in your cluster and then, in map method increment a node specific counter either by checking hostname or ip address. It's not a very good solution as you will need to modify your code whenever a node is added/removed from cluster and there will be as many if conditions in code as number of nodes. You can try this out if you do not find a cleaner solution. I wish that this counter should have been part of predefined counters. Regards, Ajay Srivastava On 30-Mar-2012, at 12:49 AM, aaron_v wrote: Hi people, Am new to Nabble and Hadoop. I was having a look at the wordcount program. Can someone please let me know how to find which data gets mapped to which node?In the sense, I have a master node 0 and 4 other nodes 1-4 and I ran the wordcount successfully. But I would like to print for each node how much data it got from the input data file. Any suggestions?? us latha wrote: Hi, Inside Map method, performed following change for Example: WordCount v1.0http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Example%3A+WordCount+v1.0at http://hadoop.apache.org/core/docs/current/mapred_tutorial.html -- String filename = new String(); ... filename = ((FileSplit) reporter.getInputSplit()).getPath().toString(); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()+ +filename); Worked great!! Thanks to everyone! Regards, Srilatha On Sat, Oct 18, 2008 at 6:24 PM, Latha usla...@gmail.com wrote: Hi All, Thankyou for your valuable inputs in suggesting me the possible solutions of creating an index file with following format. word1 filename count word2 filename count. However, following is not working for me. Please help me to resolve the same. -- public static class Map extends MapReduceBase implements MapperLongWritable, Text, Text, Text { private Text word = new Text(); private Text filename = new Text(); public void map(LongWritable key, Text value, OutputCollectorText, Text output, Reporter reporter) throws IOException { filename.set( ((FileSplit) reporter.getInputSplit()).getPath().toString()); String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, filename); } } } public static class Reduce extends MapReduceBase implements ReducerText, Text , Text, Text { public void reduce(Text key, IteratorText values, OutputCollectorText, Text output, Reporter reporter) throws IOException { int sum = 0; Text filename; while (values.hasNext()) { sum ++; filename.set(values.next().toString()); } String file = filename.toString() + + ( new IntWritable(sum)).toString(); filename=new Text(file); output.collect(key, filename); } } -- 08/10/18 05:38:25 INFO mapred.JobClient: Task Id : task_200810170342_0010_m_00_2, Status : FAILED java.io.IOException: Type mismatch in value from map: expected org.apache.hadoop.io.IntWritable, recieved org.apache.hadoop.io.Text at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:427) at org.myorg.WordCount$Map.map(WordCount.java:23) at org.myorg.WordCount$Map.map(WordCount.java:13) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122) Thanks Srilatha On Mon, Oct 6, 2008 at 11:38 AM, Owen O'Malley omal...@apache.org wrote: On Sun, Oct 5, 2008 at 12:46 PM, Ted Dunning ted.dunn...@gmail.com wrote: What you need to do is snag access to the filename in the configure method of the mapper. You can also do it in the map method with: ((FileSplit) reporter.getInputSplit()).getPath() Then instead of outputting just the word as the key, output a pair containing the word and the file name as the key. Everything downstream should remain the same. If you want to have each file handled by a single reduce, I'd suggest: class FileWordPair implements Writable {
Re: Separating mapper intermediate files
Aayush You can use the following. Just play around with the pattern property namekeep.task.files.pattern/name value.*_m_123456_0/value descriptionKeep all files from tasks whose task names match the given regular expression. Defaults to none./description /property Raj From: aayush aayushgupta...@gmail.com To: common-user@hadoop.apache.org Sent: Tuesday, March 27, 2012 5:18 AM Subject: Re: Separating mapper intermediate files Thanks Harsh. I set the mapred.local.dir as you suggested. It creates 4 folders in it for jobtracker, tasktracker, tt_private etc. i could not see an attempt directory. Can you let me know exactly where to look in this directory structure? Furthermore, it seems that all the intermediate spill and map output are cleaned up when the mapper finishes. I want to see those intermediate files and don't want the cleanup of these files. How can I achieve it? Thanks a lot On Mar 27, 2012, at 1:16 AM, Harsh J-2 [via Hadoop Common]ml-node+s472056n3860389...@n3.nabble.com wrote: Hello Aayush, Three things that'd help clear your confusion: 1. dfs.data.dir controls where HDFS blocks are to be stored. Set this to a partition1 path. 2. mapred.local.dir controls where intermediate task data go to. Set this to a partition2 path. Furthermore, can someone also tell me how to save intermediate mapper files(spill outputs) and where are they saved. Intermediate outputs are handled by the framework itself (There is no user/manual work involved), and are saved inside attempt directories under mapred.local.dir. On Tue, Mar 27, 2012 at 4:46 AM, aayush [hidden email] wrote: I am a newbie to Hadoop and map reduce. I am running a single node hadoop setup. I have created 2 partitions on my HDD. I want the mapper intermediate files (i.e. the spill files and the mapper output) to be sent to a file system on Partition1 whereas everything else including HDFS should be run on partition2. I am struggling to find the appropriate parametes in the conf files. I understand that there is hadoop.tmp.dir and mapred.local.dir but am not sure how to use what. I would really appreciate if someone could tell me exactly which parameters to modify to achieve the goal. -- Harsh J If you reply to this email, your message will be added to the discussion below: http://hadoop-common.472056.n3.nabble.com/Separating-mapper-intermediate-files-tp3859787p3860389.html To unsubscribe from Separating mapper intermediate files, click here. NAML -- View this message in context: http://hadoop-common.472056.n3.nabble.com/Separating-mapper-intermediate-files-tp3859787p3861159.html Sent from the Users mailing list archive at Nabble.com.
Re: Hadoop pain points?
Lol! Raj From: Mike Spreitzer mspre...@us.ibm.com To: common-user@hadoop.apache.org Sent: Friday, March 2, 2012 8:31 AM Subject: Re: Hadoop pain points? Interesting question. Do you want to be asking those who use Hadoop --- or those who find it too painful to use? Regards, Mike From: Kunaal kunalbha...@alumni.cmu.edu To: common-user@hadoop.apache.org Date: 03/02/2012 11:23 AM Subject: Hadoop pain points? Sent by: kunaalbha...@gmail.com I am doing a general poll on what are the most prevalent pain points that people run into with Hadoop? These could be performance related (memory usage, IO latencies), usage related or anything really. The goal is to look for what areas this platform could benefit the most in the near future. Any feedback is much appreciated. Thanks, Kunal.
Re: Adding nodes
The master and slave files, if I remember correctly are used to start the correct daemons on the correct nodes from the master node. Raj From: Joey Echeverria j...@cloudera.com To: common-user@hadoop.apache.org common-user@hadoop.apache.org Cc: common-user@hadoop.apache.org common-user@hadoop.apache.org Sent: Thursday, March 1, 2012 4:57 PM Subject: Re: Adding nodes Not quite. Datanodes get the namenode host from fs.defalt.name in core-site.xml. Task trackers find the job tracker from the mapred.job.tracker setting in mapred-site.xml. Sent from my iPhone On Mar 1, 2012, at 18:49, Mohit Anchlia mohitanch...@gmail.com wrote: On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria j...@cloudera.com wrote: You only have to refresh nodes if you're making use of an allows file. Thanks does it mean that when tasktracker/datanode starts up it communicates with namenode using master file? Sent from my iPhone On Mar 1, 2012, at 18:29, Mohit Anchlia mohitanch...@gmail.com wrote: Is this the right procedure to add nodes? I took some from hadoop wiki FAQ: http://wiki.apache.org/hadoop/FAQ 1. Update conf/slave 2. on the slave nodes start datanode and tasktracker 3. hadoop balancer Do I also need to run dfsadmin -refreshnodes?
Re: Adding nodes
WHat Joey said is correct for both apache and cloudera distros. The DN/TT daemons will connect to the NN/JT using the config files. The master and slave files are used for starting the correct daemons. From: anil gupta anilg...@buffalo.edu To: common-user@hadoop.apache.org; Raj Vishwanathan rajv...@yahoo.com Sent: Thursday, March 1, 2012 5:42 PM Subject: Re: Adding nodes Whatever Joey said is correct for Cloudera's distribution. For same, I am not confident about other distribution as i haven't tried them. Thanks, Anil On Thu, Mar 1, 2012 at 5:10 PM, Raj Vishwanathan rajv...@yahoo.com wrote: The master and slave files, if I remember correctly are used to start the correct daemons on the correct nodes from the master node. Raj From: Joey Echeverria j...@cloudera.com To: common-user@hadoop.apache.org common-user@hadoop.apache.org Cc: common-user@hadoop.apache.org common-user@hadoop.apache.org Sent: Thursday, March 1, 2012 4:57 PM Subject: Re: Adding nodes Not quite. Datanodes get the namenode host from fs.defalt.name in core-site.xml. Task trackers find the job tracker from the mapred.job.tracker setting in mapred-site.xml. Sent from my iPhone On Mar 1, 2012, at 18:49, Mohit Anchlia mohitanch...@gmail.com wrote: On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria j...@cloudera.com wrote: You only have to refresh nodes if you're making use of an allows file. Thanks does it mean that when tasktracker/datanode starts up it communicates with namenode using master file? Sent from my iPhone On Mar 1, 2012, at 18:29, Mohit Anchlia mohitanch...@gmail.com wrote: Is this the right procedure to add nodes? I took some from hadoop wiki FAQ: http://wiki.apache.org/hadoop/FAQ 1. Update conf/slave 2. on the slave nodes start datanode and tasktracker 3. hadoop balancer Do I also need to run dfsadmin -refreshnodes? -- Thanks Regards, Anil Gupta
Re: Error in Formatting NameNode
Manish If you read the error message, it says connection refused. Big clue :-) You probably have firewall configured. Raj Sent from my iPad Please excuse the typos. On Feb 12, 2012, at 1:41 AM, Manish Maheshwari mylogi...@gmail.com wrote: Thanks, I tried with hadoop-1.0.0 and JRE6 and things are looking good. I was able to format the namenode and bring up the NameNode 'calvin-PC:47110' and Hadoop Map/Reduce Administration webpages. Further i tried the example of TestDFSIO but get the below error of connection refused. -bash-4.1$ cd share/hadoop -bash-4.1$ ../../bin/hadoop jar hadoop-test-1.0.0.jar TestDFSIO –write –nrFiles 1 –filesize 10 Warning: $HADOOP_HOME is deprecated. TestDFSIO.0.0.4 12/02/12 15:05:08 INFO fs.TestDFSIO: nrFiles = 1 12/02/12 15:05:08 INFO fs.TestDFSIO: fileSize (MB) = 1 12/02/12 15:05:08 INFO fs.TestDFSIO: bufferSize = 100 12/02/12 15:05:08 INFO fs.TestDFSIO: creating control file: 1 mega bytes, 1 files 12/02/12 15:05:08 INFO fs.TestDFSIO: created control files for: 1 files 12/02/12 15:05:11 INFO ipc.Client: Retrying connect to server: calvin-PC/ 127.0.0.1:8021. Already tried 0 time(s). 12/02/12 15:05:13 INFO ipc.Client: Retrying connect to server: calvin-PC/ 127.0.0.1:8021. Already tried 1 time(s). 12/02/12 15:05:15 INFO ipc.Client: Retrying connect to server: calvin-PC/ 127.0.0.1:8021. Already tried 2 time(s). 12/02/12 15:05:17 INFO ipc.Client: Retrying connect to server: calvin-PC/ 127.0.0.1:8021. Already tried 3 time(s). 12/02/12 15:05:19 INFO ipc.Client: Retrying connect to server: calvin-PC/ 127.0.0.1:8021. Already tried 4 time(s). 12/02/12 15:05:21 INFO ipc.Client: Retrying connect to server: calvin-PC/ 127.0.0.1:8021. Already tried 5 time(s). 12/02/12 15:05:23 INFO ipc.Client: Retrying connect to server: calvin-PC/ 127.0.0.1:8021. Already tried 6 time(s). 12/02/12 15:05:25 INFO ipc.Client: Retrying connect to server: calvin-PC/ 127.0.0.1:8021. Already tried 7 time(s). 12/02/12 15:05:27 INFO ipc.Client: Retrying connect to server: calvin-PC/ 127.0.0.1:8021. Already tried 8 time(s). 12/02/12 15:05:29 INFO ipc.Client: Retrying connect to server: calvin-PC/ 127.0.0.1:8021. Already tried 9 time(s). java.net.ConnectException: Call to calvin-PC/127.0.0.1:8021 failed on connection exception: java.net.ConnectException: Connection refused: no further information at org.apache.hadoop.ipc.Client.wrapException(Client.java:1095) at org.apache.hadoop.ipc.Client.call(Client.java:1071) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) at org.apache.hadoop.mapred.$Proxy2.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379) at org.apache.hadoop.mapred.JobClient.createRPCProxy(JobClient.java:480) at org.apache.hadoop.mapred.JobClient.init(JobClient.java:474) at org.apache.hadoop.mapred.JobClient.init(JobClient.java:457) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1260) at org.apache.hadoop.fs.TestDFSIO.runIOTest(TestDFSIO.java:257) at org.apache.hadoop.fs.TestDFSIO.readTest(TestDFSIO.java:295) at org.apache.hadoop.fs.TestDFSIO.run(TestDFSIO.java:459) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.fs.TestDFSIO.main(TestDFSIO.java:317) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.hadoop.test.AllTestDriver.main(AllTestDriver.java:81) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: java.net.ConnectException: Connection refused: no further information at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:656) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:434) at
Re: Combining MultithreadedMapper threadpool size map.tasks.maximum
Here is what I understand The RecordReader for the MTMappert takes the input split and cycles the records among the available threads. It also ensures that the map outputs are synchronized. So what Bejoy says is what will happen for the wordcount program. Raj From: bejoy.had...@gmail.com bejoy.had...@gmail.com To: common-user@hadoop.apache.org Sent: Friday, February 10, 2012 11:15 AM Subject: Re: Combining MultithreadedMapper threadpool size map.tasks.maximum Hi Rob I'd try to answer this. From my understanding if you are using Multithreaded mapper on word count example with TextInputFormat and imagine you have 2 threads and 2 lines in your input split . RecordReader would read Line 1 and give it to map thread 1 and line 2 to map thread 2. So kind of identical process as defined would be happening with these two lines in parallel. This would be the default behavior. Regards Bejoy K S From handheld, Please excuse typos. -Original Message- From: Rob Stewart robstewar...@gmail.com Date: Fri, 10 Feb 2012 18:39:44 To: common-user@hadoop.apache.org Reply-To: common-user@hadoop.apache.org Subject: Re: Combining MultithreadedMapper threadpool size map.tasks.maximum Thanks, this is a lot clearer. One final question... On 10 February 2012 14:20, Harsh J ha...@cloudera.com wrote: Hello again, On Fri, Feb 10, 2012 at 7:31 PM, Rob Stewart robstewar...@gmail.com wrote: OK, take word count. The k,v to the map is null,foo bar lambda beta. The canonical Hadoop program would tokenize this line of text and output foo,1 and so on. How would the multithreadedmapper know how to further divide this line of text into, say: [null,foo bar,null,lambda beta] for 2 threads to run in parallel? Can you somehow provide an additional record reader to split the input to the map task into sub-inputs for each thread? In MultithreadedMapper, the IO work is still single threaded, while the map() calling post-read is multithreaded. But yes you could use a mix of CombineFileInputFormat and some custom logic to have multiple local splits per map task, and divide readers of them among your threads. But why do all this when thats what slots at the TT are for? I'm still unsure how the multi-threaded mapper knows how to split the input value into chunks, one chunk for each thread. There is only one example in the Hadoop 0.23 trunk that offers an example: hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/mapreduce/lib/map/TestMultithreadedMapper.java And in that source code, there is no custom logic for local splits per map task at all. Again, going back to the word count example. Given a line of text as input to a map, which comprises of 6 words. I specificy .setNumberOfThreads( 2 ), so ideally, I'd want 3 words analysed by one thread, and the 3 to the other. Is what what would happen? i.e. - I'm unsure whether the multithreadedmapper class does the splitting of inputs to map tasks... Regards,
Re: reference document which properties are set in which configuration file
Harsh, All This was one of the first questions that I asked. It is sometimes not clear whether some parameters are site related or jab related or whether it belongs to NN, JT , DN or TT. If I get some time during the weekend , I will try and put this into a document and see if it helps Raj From: Harsh J ha...@cloudera.com To: common-user@hadoop.apache.org Sent: Friday, February 10, 2012 8:31 PM Subject: Re: reference document which properties are set in which configuration file As a thumb rule, all properties starting with mapred.* or mapreduce.* go to mapred-site.xml, all properties starting with dfs.* go to hdfs-site.xml, and the rest may be put in core-site.xml to be safe. In case you notice MR or HDFS specific properties being outside of this naming convention, please do report a JIRA so we can deprecate the old name and rename it with a more appropriate prefix. On Sat, Feb 11, 2012 at 9:27 AM, Praveen Sripati praveensrip...@gmail.com wrote: The mapred.task.tracker.http.address will go in the mapred-site.xml file. In the Hadoop installation directory check the core-default.xml, hdfs-default,xml and mapred-default.xml files to know about the different properties. Some of the properties which might be in the code may not be mentioned in the xml files and will be defaulted. Praveen On Tue, Feb 7, 2012 at 3:30 PM, Kleegrewe, Christian christian.kleegr...@siemens.com wrote: Dear all, while configuring our hadoop cluster I wonder whether there exists a reference document that contains information about which configuration property has to be specified in which properties file. Especially I do not know where the mapred.task.tracker.http.address has to be set. Is it in the mapre-site.xml or in the hdfs-site.xml? any hint will be appreciated thanks Christian 8-- Siemens AG Corporate Technology Corporate Research and Technologies CT T DE IT3 Otto-Hahn-Ring 6 81739 München, Deutschland Tel.: +49 89 636-42722 Fax: +49 89 636-41423 mailto:christian.kleegr...@siemens.com Siemens Aktiengesellschaft: Vorsitzender des Aufsichtsrats: Gerhard Cromme; Vorstand: Peter Löscher, Vorsitzender; Roland Busch, Brigitte Ederer, Klaus Helmrich, Joe Kaeser, Barbara Kux, Hermann Requardt, Siegfried Russwurm, Peter Y. Solmssen, Michael Süß; Sitz der Gesellschaft: Berlin und München, Deutschland; Registergericht: Berlin Charlottenburg, HRB 12300, München, HRB 6684; WEEE-Reg.-Nr. DE 23691322 -- Harsh J Customer Ops. Engineer Cloudera | http://tiny.cloudera.com/about
Re: Can I write to an compressed file which is located in hdfs?
Hi Here is a piece of code that does the reverse of what you want; it takes a bunch of compressed files ( gzip, in this case ) and converts them to text. You can tweak the code to do the reverse http://pastebin.com/mBHVHtrm Raj From: Xiaobin She xiaobin...@gmail.com To: bejoy.had...@gmail.com Cc: common-user@hadoop.apache.org; David Sinclair dsincl...@chariotsolutions.com Sent: Tuesday, February 7, 2012 1:11 AM Subject: Re: Can I write to an compressed file which is located in hdfs? thank you Bejoy, I will look at that book. Thanks again! 2012/2/7 bejoy.had...@gmail.com ** Hi AFAIK I don't think it is possible to append into a compressed file. If you have files in hdfs on a dir and you need to compress the same (like files for an hour) you can use MapReduce to do that by setting mapred.output.compress = true and mapred.output.compression.codec='theCodecYouPrefer' You'd get the blocks compressed in the output dir. You can use the API to read from standard input like -get hadoop conf -register the required compression codec -write to CompressionOutputStream. You should get a well detailed explanation on the same from the book 'Hadoop - The definitive guide' by Tom White. Regards Bejoy K S From handheld, Please excuse typos. -- *From: * Xiaobin She xiaobin...@gmail.com *Date: *Tue, 7 Feb 2012 14:24:01 +0800 *To: *common-user@hadoop.apache.org; bejoy.had...@gmail.com; David Sinclairdsincl...@chariotsolutions.com *Subject: *Re: Can I write to an compressed file which is located in hdfs? hi Bejoy and David, thank you for you help. So I can't directly write logs or append logs into an compressed file in hdfs, right? Can I compress an file which is already in hdfs and has not been compressed? If I can , how can I do that? Thanks! 2012/2/6 bejoy.had...@gmail.com Hi I agree with David on the point, you can achieve step 1 of my previous response with flume. ie load real time inflow of data in compressed format into hdfs. You can specify a time interval or data size in flume collector that determines when to flush data on to hdfs. Regards Bejoy K S From handheld, Please excuse typos. -Original Message- From: David Sinclair dsincl...@chariotsolutions.com Date: Mon, 6 Feb 2012 09:06:00 To: common-user@hadoop.apache.org Cc: bejoy.had...@gmail.com Subject: Re: Can I write to an compressed file which is located in hdfs? Hi, You may want to have a look at the Flume project from Cloudera. I use it for writing data into HDFS. https://ccp.cloudera.com/display/SUPPORT/Downloads dave 2012/2/6 Xiaobin She xiaobin...@gmail.com hi Bejoy , thank you for your reply. actually I have set up an test cluster which has one namenode/jobtracker and two datanode/tasktracker, and I have make an test on this cluster. I fetch the log file of one of our modules from the log collector machines by rsync, and then I use hive command line tool to load this log file into the hive warehouse which simply copy the file from the local filesystem to hdfs. And I have run some analysis on these data with hive, all this run well. But now I want to avoid the fetch section which use rsync, and write the logs into hdfs files directly from the servers which generate these logs. And it seems easy to do this job if the file locate in the hdfs is not compressed. But how to write or append logs to an file that is compressed and located in hdfs? Is this possible? Or is this an bad practice? Thanks! 2012/2/6 bejoy.had...@gmail.com Hi If you have log files enough to become at least one block size in an hour. You can go ahead as - run a scheduled job every hour that compresses the log files for that hour and stores them on to hdfs (can use LZO or even Snappy to compress) - if your hive does more frequent analysis on this data store it as PARTITIONED BY (Date,Hour) . While loading into hdfs also follow a directory - sub dir structure. Once data is in hdfs issue a Alter Table Add Partition statement on corresponding hive table. -in Hive DDL use the appropriate Input format (Hive has some ApacheLog Input Format already) Regards Bejoy K S From handheld, Please excuse typos. -Original Message- From: Xiaobin She xiaobin...@gmail.com Date: Mon, 6 Feb 2012 16:41:50 To: common-user@hadoop.apache.org; 佘晓彬xiaobin...@gmail.com Reply-To: common-user@hadoop.apache.org Subject: Re: Can I write to an compressed file which is located in hdfs? sorry, this sentence is wrong, I can't compress these logs every hour and them put them into hdfs. it should be I can compress these logs every hour and them put them into hdfs. 2012/2/6 Xiaobin She xiaobin...@gmail.com hi all, I'm testing hadoop and hive,
Re: Problem in reduce phase(critical)
It says in the logs org.apache.hadoop.mapred.ReduceTask: java.net.ConnectException: Connection refused You have a network problem. Do you have firewall enabled. Raj and please do not spam the list with the same message. From: hadoop hive hadooph...@gmail.com To: u...@hive.apache.org; common-user@hadoop.apache.org Sent: Friday, February 3, 2012 6:02 AM Subject: Problem in reduce phase(critical) On Fri, Feb 3, 2012 at 4:56 PM, hadoop hive hadooph...@gmail.com wrote: hey folks, i m getting this error while running mapreduce and these comes up in reduce phase.. 2012-02-03 16:41:19,780 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201201271626_5282_r_00_0 copy failed: attempt_201201271626_5282_m_07_2 from hadoopdata3 2012-02-03 16:41:19,954 WARN org.apache.hadoop.mapred.ReduceTask: java.net.ConnectException: Connection refused at sun.reflect.GeneratedConstructorAccessor8.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1345) at java.security.AccessController.doPrivileged(Native Method) at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1339) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:993) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1447) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1349) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195) Caused by: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333) at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366) at java.net.Socket.connect(Socket.java:519) at sun.net.NetworkClient.doConnect(NetworkClient.java:158) at sun.net.www.http.HttpClient.openServer(HttpClient.java:394) at sun.net.www.http.HttpClient.openServer(HttpClient.java:529) at sun.net.www.http.HttpClient.init(HttpClient.java:233) at sun.net.www.http.HttpClient.New(HttpClient.java:306) at sun.net.www.http.HttpClient.New(HttpClient.java:323) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:837) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:778) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:703) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1026) ... 4 more 2012-02-03 16:41:19,995 INFO org.apache.hadoop.mapred.ReduceTask: Task attempt_201201271626_5282_r_00_0: Failed fetch #4 from attempt_201201271626_5282_m_07_2 2012-02-03 16:41:19,995 INFO org.apache.hadoop.mapred.ReduceTask: Failed to fetch map-output from attempt_201201271626_5282_m_07_2 even after MAX_FETCH_RETRIES_PER_MAP retries... reporting to the JobTracker 2012-02-03 16:41:19,995 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201201271626_5282_r_00_0 adding host hadoopdata3 to penalty box, next contact in 150 seconds 2012-02-03 16:41:19,995 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201271626_5282_r_00_0: Got 1 map-outputs from previous failures 2012-02-03 16:42:10,015 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201271626_5282_r_00_0 Need another 1 map output(s) where 0 is already in progress 2012-02-03 16:42:10,015 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201271626_5282_r_00_0 Scheduled 0 outputs (1 slow hosts and0 dup hosts) 2012-02-03 16:42:10,015 INFO org.apache.hadoop.mapred.ReduceTask: Penalized(slow) Hosts: 2012-02-03 16:42:10,015 INFO org.apache.hadoop.mapred.ReduceTask: hadoopdata3 Will be considered after: 99 seconds. 2012-02-03 16:43:10,050 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201271626_5282_r_00_0 Need another 1 map output(s) where 0 is already in progress 2012-02-03 16:43:10,050 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201271626_5282_r_00_0 Scheduled 0 outputs (1 slow hosts and0 dup hosts) 2012-02-03 16:43:10,050 INFO org.apache.hadoop.mapred.ReduceTask: Penalized(slow) Hosts: 2012-02-03 16:43:10,050 INFO org.apache.hadoop.mapred.ReduceTask: hadoopdata3 Will be considered after: 39 seconds.
Re: SimpleKMeansCLustering - Failed to set permissions of path to 0700
Can you run any map/reduce jobs suchas word count? Raj Sent from my iPad Please excuse the typos. On Oct 17, 2011, at 5:18 PM, robpd robpodol...@yahoo.co.uk wrote: Hi I am new to Mahout and Hadoop. I'm currently trying to get the SimpleKMeansClustering example from the Maout in Action book to work. I am running the whole thing from under cygwin from a sh script (in which I explicitly add the necessary jars to the classpath). Unfortunately I get... Exception in thread main java.io.IOException: Failed to set permissions of path: file:/tmp/hadoop-Rob/mapred/staging/Rob1823346078/.staging to 0700 at org.apache.hadoop.fs.RawLocalFileSystem.checkReturnValue(RawLocalFileSystem.java:525) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:499) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:318) at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:183) at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:116) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:797) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:791) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Unknown Source) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:791) at org.apache.hadoop.mapreduce.Job.submit(Job.java:465) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:494) at org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:362) at org.apache.mahout.clustering.kmeans.KMeansDriver.buildClustersMR(KMeansDriver.java:310) at org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:237) at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:152) at kmeans.SimpleKMeansClustering.main(Unknown Source) This is presumably because, although I am running under cygwin, Windows will not allow a change of privilege like this? I have done the following to address the problem, but without any success... a) Ensure I started Hadoop prior to running the program (start-all.sh) b) Edited the Hadoop conf file hdfs-site.xml to switch off the file permissions in HDFS property namedfs.permissions/name valuefalse/value /property c) I issues a hadoop fs -chmod +rwx -R /tmp to ensure that everyone was allowed to write to anything under tmp I'd be very grateful of some help here if you have some ideas. Sorry if I am being green about things. It does seem like there's lots to learn. -- View this message in context: http://lucene.472066.n3.nabble.com/SimpleKMeansCLustering-Failed-to-set-permissions-of-path-to-0700-tp3429867p3429867.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: performance normal?
Really horriblr performance Sent from my iPad Please excuse the typos. On Oct 8, 2011, at 12:12 AM, tom uno ltom...@gmail.com wrote: release 0.21.0 Production System 20 nodes read 100 megabits per second write 10 megabits per second performance normal?
Re: Maintaining map reduce job logs - The best practices
Bejoy You can find the job specific logs in two places. The first one is in the hdfs ouput directory. The second place is under $HADOOP_HOME/logs/history ($HADOOP_HOME/logs/history/done) Both these paces have the config file and the job logs for each submited job. Sent from my iPad Please excuse the typos. On Sep 23, 2011, at 12:52 AM, Bejoy KS bejoy.had...@gmail.com wrote: Hi All I do have a query here on maintaining Hadoop map-reduce logs. In default the logs appear in respective task tracker nodes which you can easily drill down from the job tracker web UI at times of any failure.(Which I was following till now) . Now I need to get into the next level to manage the logs corresponding to individual jobs. In my log I'm dumping some key parameters with respect to my business which could be used for business level debugging/analysis at time in the future if required . For this purpose, I need a central log file corresponding to a job. (not many files, ie one per task tracker because as the cluster grows the no of log files corresponding to a job also increases). A single point of reference makes things handy for analysis by any business folks . I think it would be a generic requirement of any enterprise application to manage and archive the logs of each job execution. Hence definitely there would be best practices and standards identified and maintained by most of the core Hadoop enterprise users. Could you please help me out by sharing some of the better options for log management for Hadoop map-reduce logs. It could greatly help me choose the best practice that suit my environment and application needs. Thank You Regards Bejoy.K.S