Re: Seeking Someone to Review Hadoop Article
I'm interesting in it. On Fri, Oct 24, 2008 at 6:31 AM, Tom Wheeler [EMAIL PROTECTED] wrote: Each month the developers at my company write a short article about a Java technology we find exciting. I've just finished one about Hadoop for November and am seeking a volunteer knowledgeable about Hadoop to look it over to help ensure it's both clear and technically accurate. If you're interested in helping me, please contact me offlist and I will send you the draft. Meanwhile, you can get a feel for the length and general style of the articles from our archives: http://www.ociweb.com/articles/publications/jnb.html Thanks in advance, Tom Wheeler -- [EMAIL PROTECTED] Institute of Computing Technology, Chinese Academy of Sciences, Beijing.
Re: Why did I only get 2 live datanodes?
Hi, wei: Did you have all your nodes have the same directory structures, the same configuration files, the same hadoop data directory, and make sure you have access to the data structure. You may have connection problems, MAKE SURE your slave node can ssh to master without password. And TURN OFF FIREWALL in your master node. -- [EMAIL PROTECTED] Institute of Computing Technology, Chinese Academy of Sciences, Beijing.
Re: Need help in hdfs configuration fully distributed way in Mac OSX...
Hi: You need to configure your nodes to ensure that node 1 can connect to node 2 without password. On Tue, Sep 16, 2008 at 2:04 PM, souravm [EMAIL PROTECTED] wrote: Hi All, I'm facing a problem in configuring hdfs in a fully distributed way in Mac OSX. Here is the topology - 1. The namenode is in machine 1 2. There is 1 datanode in machine 2 Now when I execute start-dfs.sh from machine 1, it connects to machine 2 (after it asks for password for connecting to machine 2) and starts datanode in machine 2 (as the console message says). However - 1. When I go to http://machine1:50070 - it does not show the data node at all. It says 0 data node configured 2. In the log file in machine 2 what I see is - / STARTUP_MSG: Starting DataNode STARTUP_MSG: host = rc0902b-dhcp169.apple.com/17.229.22.169 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.17.2.1 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.17 -r 684969; compiled by 'oom' on Wed Aug 20 22:29:32 UTC 2008 / 2008-09-15 18:54:44,626 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /17.229.23.77:9000. Already tried 1 time(s). 2008-09-15 18:54:45,627 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /17.229.23.77:9000. Already tried 2 time(s). 2008-09-15 18:54:46,628 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /17.229.23.77:9000. Already tried 3 time(s). 2008-09-15 18:54:47,629 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /17.229.23.77:9000. Already tried 4 time(s). 2008-09-15 18:54:48,630 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /17.229.23.77:9000. Already tried 5 time(s). 2008-09-15 18:54:49,631 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /17.229.23.77:9000. Already tried 6 time(s). 2008-09-15 18:54:50,632 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /17.229.23.77:9000. Already tried 7 time(s). 2008-09-15 18:54:51,633 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /17.229.23.77:9000. Already tried 8 time(s). 2008-09-15 18:54:52,635 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /17.229.23.77:9000. Already tried 9 time(s). 2008-09-15 18:54:53,640 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /17.229.23.77:9000. Already tried 10 time(s). 2008-09-15 18:54:54,641 INFO org.apache.hadoop.ipc.RPC: Server at / 17.229.23.77:9000 not available yet, Z... ... and this retyring gets on repeating The hadoop-site.xmls are like this - 1. In machine 1 - configuration property namefs.default.name/name valuehdfs://localhost:9000/value /property property namedfs.name.dir/name value/Users/souravm/hdpn/value /property property namemapred.job.tracker/name valuelocalhost:9001/value /property property namedfs.replication/name value1/value /property /configuration 2. In machine 2 configuration property namefs.default.name/name valuehdfs://machine1 ip:9000/value /property property namedfs.data.dir/name value/Users/nirdosh/hdfsd1/value /property property namedfs.replication/name value1/value /property /configuration The slaves file in machine 1 has single entry - user name@ip of machine2 The exact steps I did - 1. Reformat the namenode in machine 1 2. execute start-dfs.sh in machine 1 3. Then I try to see whether the datanode is created through http://machine 1 ip:50070 Any pointer to resolve this issue would be appreciated. Regards, Sourav CAUTION - Disclaimer * This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS End of Disclaimer INFOSYS*** -- [EMAIL PROTECTED] Institute of Computing Technology, Chinese Academy of Sciences, Beijing.
Re: Small Filesizes
Hi, I'm just working on this situation you described, with millions of small files sized around 10KB. My idea is to compact this files into big ones and create indexes for them. This is a file system over file system and support append update, lazy delete. May this help . -- [EMAIL PROTECTED] Institute of Computing Technology, Chinese Academy of Sciences, Beijing.
Re: specifying number of nodes for job
On Mon, Sep 8, 2008 at 2:25 AM, Sandy [EMAIL PROTECTED] wrote: Hi, This may be a silly question, but I'm strangely having trouble finding an answer for it (perhaps I'm looking in the wrong places?). Suppose I have a cluster with n nodes each with m processors. I wish to test the performance of, say, the wordcount program on k processors, where k is varied from k = 1 ... nm. You can specify the number of tasks for each node in your hadoop-site.xml file. So you can get k varied from k = n, 2*nm*n instead of k = 1...nm. How would I do this? I'm having trouble finding the proper command line option in the commands manual ( http://hadoop.apache.org/core/docs/current/commands_manual.html) Thank you very much for you time. -SM -- [EMAIL PROTECTED] Institute of Computing Technology, Chinese Academy of Sciences, Beijing.
Re: Output directory already exists
On Wed, Sep 3, 2008 at 1:24 AM, Shirley Cohen [EMAIL PROTECTED] wrote: Hi, I'm trying to write the output of two different map-reduce jobs into the same output directory. I'm using MultipleOutputFormats to set the filename dynamically, so there is no filename collision between the two jobs. However, I'm getting the error output directory already exists. Does the framework support this functionality? It seems silly to have to create a temp directory to store the output files from the second job and then have to copy them to the first job's output directory after the second job completes. Map/reduce will create output directory every time it runs and will fail if the directory exists. Seems that there is no way to implement your description other than modify the source code. Thanks, Shirley -- [EMAIL PROTECTED] Institute of Computing Technology, Chinese Academy of Sciences, Beijing.
Re: basic questions about Hadoop!
On Sat, Aug 30, 2008 at 10:12 AM, Gerardo Velez [EMAIL PROTECTED]wrote: Hi Victor! I got problem with remote writing as well, so I tried to go further on this and I would like to share what I did, maybe you have more luck than me 1) as I'm working with user gvelez in remote host I had to give write access to all, like this: bin/hadoop dfs -chmod -R a+w input 2) After that, there is no more connection refused error, but instead I got following exception $ bin/hadoop dfs -copyFromLocal README.txt /user/hadoop/input/README.txt cygpath: cannot create short name of d:\hadoop\hadoop-0.17.2\logs 08/08/29 19:06:51 INFO dfs.DFSClient: org.apache.hadoop.ipc.RemoteException: jav a.io.IOException: File /user/hadoop/input/README.txt could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.ja va:1145) at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:300) at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces sorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896) How many datanode do you have ? Only one, I guess. Modify your $HADOOP_HOME/conf/hadoop-site.xml and lookup property namedfs.replication/name value1/value /property set value to 0. On Fri, Aug 29, 2008 at 9:53 AM, Victor Samoylov [EMAIL PROTECTED] wrote: Jeff, Thanks for detailed instructions, but on machine that is not hadoop server I got error: ~/hadoop-0.17.2$ ./bin/hadoop dfs -copyFromLocal NOTICE.txt test 08/08/29 19:33:07 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.ConnectException: Connection refused 08/08/29 19:33:07 INFO dfs.DFSClient: Abandoning block blk_-7622891475776838399 The thing is that file was created, but with zero size. Do you have ideas why this happened? Thanks, Victor On Fri, Aug 29, 2008 at 4:10 AM, Jeff Payne [EMAIL PROTECTED] wrote: You can use the hadoop command line on machines that aren't hadoop servers. If you copy the hadoop configuration from one of your master servers or data node to the client machine and run the command line dfs tools, it will copy the files directly to the data node. Or, you could use one of the client libraries. The java client, for example, allows you to open up an output stream and start dumping bytes on it. On Thu, Aug 28, 2008 at 5:05 PM, Gerardo Velez [EMAIL PROTECTED] wrote: Hi Jeff, thank you for answering! What about remote writing on HDFS, lets suppose I got an application server on a linux server A and I got a Hadoop cluster on servers B (master), C (slave), D (slave) What I would like is sent some files from Server A to be processed by hadoop. So in order to do so, what I need to do do I need send those files to master server first and then copy those to HDFS? or can I pass those files to any slave server? basically I'm looking for remote writing due to files to be process are not being generated on any haddop server. Thanks again! -- Gerardo Regarding On Thu, Aug 28, 2008 at 4:04 PM, Jeff Payne [EMAIL PROTECTED] wrote: Gerardo: I can't really speak to all of your questions, but the master/slave issue is a common concern with hadoop. A cluster has a single namenode and therefore a single point of failure. There is also a secondary name node process which runs on the same machine as the name node in most default configurations. You can make it a different machine by adjusting the master file. One of the more experienced lurkers should feel free to correct me, but my understanding is that the secondary name node keeps track of all the same index information used by the primary name node. So, if the namenode fails, there is no automatic recovery, but you can always tweak your cluster configuration to make the secondary namenode the primary and safely restart the cluster. As for the storage of files, the name node is really just the traffic cop for HDFS. No HDFS files are actually stored on that machine. It's basically used as a directory and lock manager, etc. The files are stored on multiple datanodes and I'm pretty sure all the actual file I/O happens directly between the client and the respective datanodes. Perhaps one of the more hardcore hadoop people on here will point it out if I'm giving bad advice. On Thu, Aug 28, 2008 at 2:28 PM, Gerardo Velez [EMAIL PROTECTED] wrote:
Re: Help: how to check the active datanodes?
Hi, zhang: Once you start hadoop with shell start-all.sh, a hadoop status pape can be accessed through http://namenode-ip:port/dfshealth. Port is specified by namedfs.http.address/name in your hadoop-default.xml. If the datanodes status is not as expected, you need to check log files. They show the details of failure. On Fri, Jul 4, 2008 at 4:17 AM, Richard Zhang [EMAIL PROTECTED] wrote: Hi guys: I am running hadoop on a 8 nodes cluster. I uses start-all.sh to boot hadoop and it shows that all 8 data nodes are started. However, when I use bin/hadoop dfsadmin -report to check the status of the data nodes and it shows only one data node (the one with the same host as name node) is active. How could we know if all the data nodes are active precisely? Does anyone has deal with this before? Thanks. Richard -- [EMAIL PROTECTED] Institute of Computing Technology, Chinese Academy of Sciences, Beijing.
Re: reduce task hanging or just slow?
Hi: I have met the similar problem with you. Finally, I found that this problem was caused by the hostname resolution because hadoop use hostname to access other nodes. To fix this, try open your jobtracker log file( It often resides in $HADOOP_HOME/logs/hadoop--jobtracker-.log ) to see if there is a error: FATAL org.apache.hadoop.mapred.JobTracker: java.net.UnknownHostException: Invalid hostname for server: local If, it is, adding ip-hostname pairs to /etc/hosts files on all of you nodes may fix this problem. Good luck and best regards. Mafish -- [EMAIL PROTECTED] Institute of Computing Technology, Chinese Academy of Sciences, Beijing.
Re: Reduce Hangs
All ports are listed in conf/hadoop-default.xml and conf/hadoop-site.xml. Also, if you are using hbase, you need to concern about hbase-default.xmland hbase-site.xml, located in hbase directory. 2008/3/29 Natarajan, Senthil [EMAIL PROTECTED]: Hi, Thanks for your suggestions. It looks like the problem is with firewall, I created the firewall rule to allow these ports 5 to 50100 (I found in these port range hadoop was listening) Looks like I am missing some ports and that gets blocked in the firewall. Could anyone please let me know, how to configure hadoop to use only certain specified ports, so that those ports can be allowed in the firewall. Thanks, Senthil -- [EMAIL PROTECTED] Institute of Computing Technology, Chinese Academy of Sciences, Beijing.
Re: Reduce Hangs
On Fri, Mar 28, 2008 at 12:31 AM, 朱盛凯 [EMAIL PROTECTED] wrote: Hi, I met this problem in my cluster before, I think I can share with you some of my experience. But it may not work in you case. The job in my cluster always hung at 16% of reduce. It occured because the reduce task could not fetch the map output from other nodes. In my case, two factors may result in this faliure of communication between two task trackers. One is the firewall block the trackers from communications. I solved this by disabling the firewall. The other factor is that trackers refer to other nodes by host name only, but not ip address. I solved this by editing the file /etc/hosts with mapping from hostname to ip address of all nodes in cluster. I meet this problem with the same reason too. Try to host names to all your /etc/hosts files . I hope my experience will be helpful for you. On 3/27/08, Natarajan, Senthil [EMAIL PROTECTED] wrote: Hi, I have small Hadoop cluster, one master and three slaves. When I try the example wordcount on one of our log file (size ~350 MB) Map runs fine but reduce always hangs (sometime around 19%,60% ...) after very long time it finishes. I am seeing this error Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out In the log I am seeing this INFO org.apache.hadoop.mapred.TaskTracker: task_200803261535_0001_r_00_0 0.1834% reduce copy (11 of 20 at 0.02 MB/s) Do you know what might be the problem. Thanks, Senthil -- [EMAIL PROTECTED] Institute of Computing Technology, Chinese Academy of Sciences, Beijing.
Re: dynamically adding slaves to hadoop cluster
You should do the following steps: 1. Have hadoop deployed on the new node with the same directory structure and configuration. 2. Just run $HADOOP_HOME/bin/hadoop datanode and jobtracker. Datanode and jobtracker will contact to namenode specified in hadoop configuration file automatically and finish adding new node to the hadoop cluster. On Mon, Mar 10, 2008 at 4:56 AM, Aaron Kimball [EMAIL PROTECTED] wrote: Yes. You should have the same hadoop-site across all your slaves. They will need to know the DNS name for the namenode and jobtracker. - Aaron tjohn wrote: Mahadev Konar wrote: I believe (as far as I remember) you should be able to add the node by bringing up the datanode or tasktracker on the remote machine. The Namenode or the jobtracker (I think) does not check for the nodes in the slaves file. The slaves file is just to start up all the daemon's by ssshing to all the nodes in the slaves file during startup. So you should just be able to startup the datanode pointing to correct namenode and it should work. Regards Mahadev Sorry for my ignorance... To make a datanode/tasktraker point to the namenode what should i do? Have i to edit the hadoop-site.xml? Thanks John -- [EMAIL PROTECTED] Institute of Computing Technology, Chinese Academy of Sciences, Beijing.
Re: dynamically adding slaves to hadoop cluster
On Mon, Mar 10, 2008 at 9:47 AM, Mafish Liu [EMAIL PROTECTED] wrote: You should do the following steps: 1. Have hadoop deployed on the new node with the same directory structure and configuration. 2. Just run $HADOOP_HOME/bin/hadoop datanode and jobtracker. Addition: do not run bin/hadoop namenode -format before you run datanode, or you will get a error like Incompatible namespaceIDs ... Datanode and jobtracker will contact to namenode specified in hadoop configuration file automatically and finish adding new node to the hadoop cluster. On Mon, Mar 10, 2008 at 4:56 AM, Aaron Kimball [EMAIL PROTECTED] wrote: Yes. You should have the same hadoop-site across all your slaves. They will need to know the DNS name for the namenode and jobtracker. - Aaron tjohn wrote: Mahadev Konar wrote: I believe (as far as I remember) you should be able to add the node by bringing up the datanode or tasktracker on the remote machine. The Namenode or the jobtracker (I think) does not check for the nodes in the slaves file. The slaves file is just to start up all the daemon's by ssshing to all the nodes in the slaves file during startup. So you should just be able to startup the datanode pointing to correct namenode and it should work. Regards Mahadev Sorry for my ignorance... To make a datanode/tasktraker point to the namenode what should i do? Have i to edit the hadoop-site.xml? Thanks John -- [EMAIL PROTECTED] Institute of Computing Technology, Chinese Academy of Sciences, Beijing. -- [EMAIL PROTECTED] Institute of Computing Technology, Chinese Academy of Sciences, Beijing.