Re: Map-Reduce Slow Down
Assuming you are on a linux box, on both machines verify that the servers are listening on the ports you expect via netstat -a -n -t -p -a show sockets accepting connections -n do not translate ip addresses to host names -t only list tcp sockets -p list the pid/process name on the machine 192.168.0.18 you should have sockets bound to 0.0.0.0:54310 with a process of java, and the pid should be the pid of your namenode process. On the remote machine you should be able to *telnet 192.168.0.18 54310* and have it connect *Connected to 192.168.0.18. Escape character is '^]'. * If the netstat shows the socket accepting and the telnet does not connect, then something is blocking the TCP packets between the machines. one or both machines has a firewall, an intervening router has a firewall, or there is some routing problem the command /sbin/iptables -L will normally list the firewall rules, if any for a linux machine. You should be able to use telnet to verify that you can connect from the remote machine. On Thu, Apr 16, 2009 at 9:18 PM, Mithila Nagendra mnage...@asu.edu wrote: Thanks! I ll see what I can find out. On Fri, Apr 17, 2009 at 4:55 AM, jason hadoop jason.had...@gmail.com wrote: The firewall was run at system startup, I think there was a /etc/sysconfig/iptables file present which triggered the firewall. I don't currently have access to any centos 5 machines so I can't easily check. On Thu, Apr 16, 2009 at 6:54 PM, jason hadoop jason.had...@gmail.com wrote: The kickstart script was something that the operations staff was using to initialize new machines, I never actually saw the script, just figured out that there was a firewall in place. On Thu, Apr 16, 2009 at 1:28 PM, Mithila Nagendra mnage...@asu.edu wrote: Jason: the kickstart script - was it something you wrote or is it run when the system turns on? Mithila On Thu, Apr 16, 2009 at 1:06 AM, Mithila Nagendra mnage...@asu.edu wrote: Thanks Jason! Will check that out. Mithila On Thu, Apr 16, 2009 at 5:23 AM, jason hadoop jason.had...@gmail.com wrote: Double check that there is no firewall in place. At one point a bunch of new machines were kickstarted and placed in a cluster and they all failed with something similar. It turned out the kickstart script turned enabled the firewall with a rule that blocked ports in the 50k range. It took us a while to even think to check that was not a part of our normal machine configuration On Wed, Apr 15, 2009 at 11:04 AM, Mithila Nagendra mnage...@asu.edu wrote: Hi Aaron I will look into that thanks! I spoke to the admin who overlooks the cluster. He said that the gateway comes in to the picture only when one of the nodes communicates with a node outside of the cluster. But in my case the communication is carried out between the nodes which all belong to the same cluster. Mithila On Wed, Apr 15, 2009 at 8:59 PM, Aaron Kimball aa...@cloudera.com wrote: Hi, I wrote a blog post a while back about connecting nodes via a gateway. See http://www.cloudera.com/blog/2008/12/03/securing-a-hadoop-cluster-through-a-gateway/ This assumes that the client is outside the gateway and all datanodes/namenode are inside, but the same principles apply. You'll just need to set up ssh tunnels from every datanode to the namenode. - Aaron On Wed, Apr 15, 2009 at 10:19 AM, Ravi Phulari rphul...@yahoo-inc.com wrote: Looks like your NameNode is down . Verify if hadoop process are running ( jps should show you all java running process). If your hadoop process are running try restarting your hadoop process . I guess this problem is due to your fsimage not being correct . You might have to format your namenode. Hope this helps. Thanks, -- Ravi On 4/15/09 10:15 AM, Mithila Nagendra mnage...@asu.edu wrote: The log file runs into thousands of line with the same message being displayed every time. On Wed, Apr 15, 2009 at 8:10 PM, Mithila Nagendra mnage...@asu.edu wrote: The log file : hadoop-mithila-datanode-node19.log.2009-04-14 has the following in it: 2009-04-14 10:08:11,499 INFO org.apache.hadoop.dfs.DataNode: STARTUP_MSG: / STARTUP_MSG: Starting DataNode STARTUP_MSG: host = node19/127.0.0.1 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.18.3 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18-r 736250; compiled by 'ndaley' on
If I make two map reduce, can I don't save the medial output?
as the tittle. Thank You! imcaptor
Problem with using differnt username
Hi I am running Hadoop Cluster on windows. I have 4 datnodes. 3 data node have same username so they always start. But one datanode have different username. When I run command $bin/start-all.sh master tries to find $bin/Hadoop-demon.sh giving master username instead of username in which file is there. Please where should I make change so master find file on the different user name. Thanks Regards Aseem Puri Project Trainee Honeywell Technology Solutions Lab Bangalore
Re: Question about the classpath setting for bin/hadoop jar
I noticed that the bin/hadoop jar command doesn't add the jar being executed to the classpath. Is this deliberate and what is the reasoning? The result is that resources in the jar are not accessible from the system class loader. Rather they are only available from the thread context class loader and the class loader of the main class. In map and reduce tasks' jvm, job libraries are added to the system classloader. However for others only framework code is present in system classloader. If you are seeing this as a problem in your client side code, you can use Configuration#getClassByName(String name) instead of Class.forName() for loading your job related classes.
Re: Map-Reduce Slow Down
Thanks Jason! This helps a lot. I m planning to talk to my network admin tomorrow. I hoping he ll be able to fix this problem. Mithila On Fri, Apr 17, 2009 at 9:00 AM, jason hadoop jason.had...@gmail.comwrote: Assuming you are on a linux box, on both machines verify that the servers are listening on the ports you expect via netstat -a -n -t -p -a show sockets accepting connections -n do not translate ip addresses to host names -t only list tcp sockets -p list the pid/process name on the machine 192.168.0.18 you should have sockets bound to 0.0.0.0:54310 with a process of java, and the pid should be the pid of your namenode process. On the remote machine you should be able to *telnet 192.168.0.18 54310* and have it connect *Connected to 192.168.0.18. Escape character is '^]'. * If the netstat shows the socket accepting and the telnet does not connect, then something is blocking the TCP packets between the machines. one or both machines has a firewall, an intervening router has a firewall, or there is some routing problem the command /sbin/iptables -L will normally list the firewall rules, if any for a linux machine. You should be able to use telnet to verify that you can connect from the remote machine. On Thu, Apr 16, 2009 at 9:18 PM, Mithila Nagendra mnage...@asu.edu wrote: Thanks! I ll see what I can find out. On Fri, Apr 17, 2009 at 4:55 AM, jason hadoop jason.had...@gmail.com wrote: The firewall was run at system startup, I think there was a /etc/sysconfig/iptables file present which triggered the firewall. I don't currently have access to any centos 5 machines so I can't easily check. On Thu, Apr 16, 2009 at 6:54 PM, jason hadoop jason.had...@gmail.com wrote: The kickstart script was something that the operations staff was using to initialize new machines, I never actually saw the script, just figured out that there was a firewall in place. On Thu, Apr 16, 2009 at 1:28 PM, Mithila Nagendra mnage...@asu.edu wrote: Jason: the kickstart script - was it something you wrote or is it run when the system turns on? Mithila On Thu, Apr 16, 2009 at 1:06 AM, Mithila Nagendra mnage...@asu.edu wrote: Thanks Jason! Will check that out. Mithila On Thu, Apr 16, 2009 at 5:23 AM, jason hadoop jason.had...@gmail.com wrote: Double check that there is no firewall in place. At one point a bunch of new machines were kickstarted and placed in a cluster and they all failed with something similar. It turned out the kickstart script turned enabled the firewall with a rule that blocked ports in the 50k range. It took us a while to even think to check that was not a part of our normal machine configuration On Wed, Apr 15, 2009 at 11:04 AM, Mithila Nagendra mnage...@asu.edu wrote: Hi Aaron I will look into that thanks! I spoke to the admin who overlooks the cluster. He said that the gateway comes in to the picture only when one of the nodes communicates with a node outside of the cluster. But in my case the communication is carried out between the nodes which all belong to the same cluster. Mithila On Wed, Apr 15, 2009 at 8:59 PM, Aaron Kimball aa...@cloudera.com wrote: Hi, I wrote a blog post a while back about connecting nodes via a gateway. See http://www.cloudera.com/blog/2008/12/03/securing-a-hadoop-cluster-through-a-gateway/ This assumes that the client is outside the gateway and all datanodes/namenode are inside, but the same principles apply. You'll just need to set up ssh tunnels from every datanode to the namenode. - Aaron On Wed, Apr 15, 2009 at 10:19 AM, Ravi Phulari rphul...@yahoo-inc.com wrote: Looks like your NameNode is down . Verify if hadoop process are running ( jps should show you all java running process). If your hadoop process are running try restarting your hadoop process . I guess this problem is due to your fsimage not being correct . You might have to format your namenode. Hope this helps. Thanks, -- Ravi On 4/15/09 10:15 AM, Mithila Nagendra mnage...@asu.edu wrote: The log file runs into thousands of line with the same message being displayed every time. On Wed, Apr 15, 2009 at 8:10 PM, Mithila Nagendra mnage...@asu.edu wrote: The log file : hadoop-mithila-datanode-node19.log.2009-04-14 has the following in it: 2009-04-14 10:08:11,499 INFO
Re: Sometimes no map tasks are run - X are complete and N-X are pending, none running
The last map task is forrever in the pending queue - is this is issue my setup/config or do others have the problem? Do you mean the left over maps are not at all scheduled ? What do you see in jobtracker logs ?
Re: Sometimes no map tasks are run - X are complete and N-X are pending, none running
On 4/17/09 12:26 PM, Sharad Agarwal shara...@yahoo-inc.com wrote: The last map task is forrever in the pending queue - is this is issue my setup/config or do others have the problem? Do you mean the left over maps are not at all scheduled ? What do you see in jobtracker logs ? Also in the JT UI, please check on how many maps are marked as running, when this map is still pending?
Re: If I make two map reduce, can I don't save the medial output?
More detail description? On Fri, Apr 17, 2009 at 2:21 PM, 王红宝 imcap...@gmail.com wrote: as the tittle. Thank You! imcaptor -- 朱盛凯 Jash Zhu 复旦大学软件学院 Software School, Fudan University
Re: Datanode Setup
ok, I have my hosts file setup the way you told me, I changed my replication factor to 1. The thing that I don't get is this line from the datanodes... STARTUP_MSG: host = java.net.UnknownHostException: myhost: myhost If I have my hadoop-site.xml setup correctly, with the correct address it should work right? It seems like the datanodes aren't getting an IP address to use, and I'm not sure why. jpe30 wrote: That helps a lot actually. I will try setting up my hosts file tomorrow and make the other changes you suggested. Thanks! Mithila Nagendra wrote: Hi, The replication factor has to be set to 1. Also for you dfs and job tracker configuration you should insert the name of the node rather than the i.p address. For instance: value192.168.1.10:54310/value can be: valuemaster:54310/value The nodes can be renamed by renaming them in the hosts files in /etc folder. It should look like the following: # Do not remove the following line, or various programs # that require network functionality will fail. 127.0.0.1 localhost.localdomain localhost node01 192.168.0.1 node01 192.168.0.2 node02 192.168.0.3 node03 Hope this helps Mithila On Wed, Apr 15, 2009 at 9:40 PM, jpe30 jpotte...@gmail.com wrote: I'm setting up a Hadoop cluster and I have the name node and job tracker up and running. However, I cannot get any of my datanodes or tasktrackers to start. Here is my hadoop-site.xml file... ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namehadoop.tmp.dir/name value/home/hadoop/h_temp/value descriptionA base for other temporary directories./description /property property namedfs.data.dir/name value/home/hadoop/data/value /property property namefs.default.name/name value192.168.1.10:54310/value descriptionThe name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem./description finaltrue/final /property property namemapred.job.tracker/name value192.168.1.10:54311/value descriptionThe host and port that the MapReduce job tracker runs at. If local, then jobs are run in-process as a single map and reduce task. /description /property property namedfs.replication/name value0/value descriptionDefault block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. /description /property /configuration and here is the error I'm getting... 2009-04-15 14:00:48,208 INFO org.apache.hadoop.dfs.DataNode: STARTUP_MSG: / STARTUP_MSG: Starting DataNode STARTUP_MSG: host = java.net.UnknownHostException: myhost: myhost STARTUP_MSG: args = [] STARTUP_MSG: version = 0.18.3 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r 736250; compiled by 'ndaley' on Thu Jan 22 23:12:0$ / 2009-04-15 14:00:48,355 ERROR org.apache.hadoop.dfs.DataNode: java.net.UnknownHostException: myhost: myhost at java.net.InetAddress.getLocalHost(InetAddress.java:1353) at org.apache.hadoop.net.DNS.getDefaultHost(DNS.java:185) at org.apache.hadoop.dfs.DataNode.startDataNode(DataNode.java:249) at org.apache.hadoop.dfs.DataNode.init(DataNode.java:223) at org.apache.hadoop.dfs.DataNode.makeInstance(DataNode.java:3071) at org.apache.hadoop.dfs.DataNode.instantiateDataNode(DataNode.java:3026) at org.apache.hadoop.dfs.DataNode.createDataNode(DataNode.java:3034) at org.apache.hadoop.dfs.DataNode.main(DataNode.java:3156) 2009-04-15 14:00:48,356 INFO org.apache.hadoop.dfs.DataNode: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down DataNode at java.net.UnknownHostException: myhost: myhost / -- View this message in context: http://www.nabble.com/Datanode-Setup-tp23064660p23064660.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/Datanode-Setup-tp23064660p23100910.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Problem with using differnt username
I don't think you can tell start-all.sh to log in as a different user on certain nodes. Why not just create the same user on the fourth node? An alternative would be to start the fourth node manually via hadoop-daemon.sh script. Here's an example: bin/hadoop-daemon.sh start datanode bin/hadoop-daemon.sh start tasktracker Those commands should be run on the fourth node. This allows you to bypass the SSH step that start-all.sh does. Alex On Thu, Apr 16, 2009 at 11:24 PM, Puri, Aseem aseem.p...@honeywell.comwrote: Hi I am running Hadoop Cluster on windows. I have 4 datnodes. 3 data node have same username so they always start. But one datanode have different username. When I run command $bin/start-all.sh master tries to find $bin/Hadoop-demon.sh giving master username instead of username in which file is there. Please where should I make change so master find file on the different user name. Thanks Regards Aseem Puri Project Trainee Honeywell Technology Solutions Lab Bangalore
Re: Sometimes no map tasks are run - X are complete and N-X are pending, none running
I forgot to examine logs, I will next time it happens. Thank you BW, no maps are running. Only few are pending and the rest are complete. Saptarshi Guha On Fri, Apr 17, 2009 at 3:06 AM, Jothi Padmanabhan joth...@yahoo-inc.comwrote: On 4/17/09 12:26 PM, Sharad Agarwal shara...@yahoo-inc.com wrote: The last map task is forrever in the pending queue - is this is issue my setup/config or do others have the problem? Do you mean the left over maps are not at all scheduled ? What do you see in jobtracker logs ? Also in the JT UI, please check on how many maps are marked as running, when this map is still pending?
Ec2 instability
Hi, Its been several days since we have been trying to stabilize hadoop/hbase on ec2 cluster. but failed to do so. We still come across frequent region server fails, scanner timeout exceptions and OS level deadlocks etc... and 2day while doing a list of tables on hbase i get the following exception: hbase(main):001:0 list 09/04/17 13:57:18 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 0 time(s). 09/04/17 13:57:19 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 1 time(s). 09/04/17 13:57:20 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 2 time(s). 09/04/17 13:57:20 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not available yet, Z... 09/04/17 13:57:20 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could not be reached after 1 tries, giving up. 09/04/17 13:57:21 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 0 time(s). 09/04/17 13:57:22 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 1 time(s). 09/04/17 13:57:23 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 2 time(s). 09/04/17 13:57:23 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not available yet, Z... 09/04/17 13:57:23 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could not be reached after 1 tries, giving up. 09/04/17 13:57:26 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 0 time(s). 09/04/17 13:57:27 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 1 time(s). 09/04/17 13:57:28 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 2 time(s). 09/04/17 13:57:28 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not available yet, Z... 09/04/17 13:57:28 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could not be reached after 1 tries, giving up. 09/04/17 13:57:29 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 0 time(s). 09/04/17 13:57:30 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 1 time(s). 09/04/17 13:57:31 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 2 time(s). 09/04/17 13:57:31 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not available yet, Z... 09/04/17 13:57:31 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could not be reached after 1 tries, giving up. 09/04/17 13:57:34 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 0 time(s). 09/04/17 13:57:35 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 1 time(s). 09/04/17 13:57:36 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 2 time(s). 09/04/17 13:57:36 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not available yet, Z... but if i check on the UI, hbase master is still on, (tried refreshing it several times). and i have been getting a lot of exceptions from time to time including region servers going down (which happens very frequently due to which there is heavy data loss... that too on production data), scanner timeout exceptions, cannot allocate memory exceptions etc. I am working on amazon ec2 Large cluster with 6 nodes... with each node having the hardware configuration as follows: - Large Instance 7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each), 850 GB of instance storage, 64-bit platform I am using hadoop-0.19.0 and hbase 0.19.0 (resynced to all the nodes and made sure that there is a symbolic link to hadoop-site from hbase/conf) Following is my configuration on hadoop-site.xml configuration property namehadoop.tmp.dir/name value/mnt/hadoop/value /property property namefs.default.name/name valuehdfs://domU-12-31-39-00-E5-D2.compute-1.internal:50001/value /property property namemapred.job.tracker/name valuedomU-12-31-39-00-E5-D2.compute-1.internal:50002/value /property property nametasktracker.http.threads/name value80/value /property property namemapred.tasktracker.map.tasks.maximum/name value3/value /property property namemapred.tasktracker.reduce.tasks.maximum/name value3/value /property property namemapred.output.compress/name valuetrue/value /property property namemapred.output.compression.type/name valueBLOCK/value /property property namedfs.client.block.write.retries/name value3/value /property property namemapred.child.java.opts/name value-Xmx4096m/value /property Given it a high value since the RAM on each node is 7GB... not sure of this setting though **i got Cannot Allocate Memory Exception after making this setting. (got it for the first time) after going through the archives, someone suggested enabling the overcommit memorynot sure of it though ** property
Re: Ec2 instability
Hi, this is the exception i have been getting @ the mapreduce java.io.IOException: Cannot run program bash: java.io.IOException: error=12, Cannot allocate memory at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) at org.apache.hadoop.util.Shell.runCommand(Shell.java:149) at org.apache.hadoop.util.Shell.run(Shell.java:134) at org.apache.hadoop.fs.DF.getAvailable(DF.java:73) at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:321) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124) at org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1199) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:857) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333) at org.apache.hadoop.mapred.Child.main(Child.java:155) Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory at java.lang.UNIXProcess.(UNIXProcess.java:148) at java.lang.ProcessImpl.start(ProcessImpl.java:65) at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) ... 10 more On Fri, Apr 17, 2009 at 10:09 PM, Rakhi Khatwani rakhi.khatw...@gmail.comwrote: Hi, Its been several days since we have been trying to stabilize hadoop/hbase on ec2 cluster. but failed to do so. We still come across frequent region server fails, scanner timeout exceptions and OS level deadlocks etc... and 2day while doing a list of tables on hbase i get the following exception: hbase(main):001:0 list 09/04/17 13:57:18 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 0 time(s). 09/04/17 13:57:19 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 1 time(s). 09/04/17 13:57:20 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 2 time(s). 09/04/17 13:57:20 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not available yet, Z... 09/04/17 13:57:20 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could not be reached after 1 tries, giving up. 09/04/17 13:57:21 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 0 time(s). 09/04/17 13:57:22 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 1 time(s). 09/04/17 13:57:23 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 2 time(s). 09/04/17 13:57:23 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not available yet, Z... 09/04/17 13:57:23 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could not be reached after 1 tries, giving up. 09/04/17 13:57:26 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 0 time(s). 09/04/17 13:57:27 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 1 time(s). 09/04/17 13:57:28 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 2 time(s). 09/04/17 13:57:28 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not available yet, Z... 09/04/17 13:57:28 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could not be reached after 1 tries, giving up. 09/04/17 13:57:29 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 0 time(s). 09/04/17 13:57:30 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 1 time(s). 09/04/17 13:57:31 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 2 time(s). 09/04/17 13:57:31 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not available yet, Z... 09/04/17 13:57:31 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could not be reached after 1 tries, giving up. 09/04/17 13:57:34 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 0 time(s). 09/04/17 13:57:35 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 1 time(s). 09/04/17 13:57:36 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 2 time(s). 09/04/17 13:57:36 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not available yet, Z... but if i check on the UI, hbase master is still on, (tried refreshing it several times). and i have been getting a lot of exceptions from time to time including region servers going down (which happens very frequently due to which there is heavy data loss... that too on production data), scanner timeout exceptions, cannot allocate memory exceptions etc. I am working on amazon ec2 Large cluster with 6 nodes... with each node having the hardware configuration as follows: - Large Instance 7.5 GB of memory, 4 EC2
Re: Datanode Setup
You have to make sure that you can ssh between the nodes. Also check the file hosts in /etc folder. Both the master and the slave much have each others machines defined in it. Refer to my previous mail Mithila On Fri, Apr 17, 2009 at 7:18 PM, jpe30 jpotte...@gmail.com wrote: ok, I have my hosts file setup the way you told me, I changed my replication factor to 1. The thing that I don't get is this line from the datanodes... STARTUP_MSG: host = java.net.UnknownHostException: myhost: myhost If I have my hadoop-site.xml setup correctly, with the correct address it should work right? It seems like the datanodes aren't getting an IP address to use, and I'm not sure why. jpe30 wrote: That helps a lot actually. I will try setting up my hosts file tomorrow and make the other changes you suggested. Thanks! Mithila Nagendra wrote: Hi, The replication factor has to be set to 1. Also for you dfs and job tracker configuration you should insert the name of the node rather than the i.p address. For instance: value192.168.1.10:54310/value can be: valuemaster:54310/value The nodes can be renamed by renaming them in the hosts files in /etc folder. It should look like the following: # Do not remove the following line, or various programs # that require network functionality will fail. 127.0.0.1 localhost.localdomain localhost node01 192.168.0.1 node01 192.168.0.2 node02 192.168.0.3 node03 Hope this helps Mithila On Wed, Apr 15, 2009 at 9:40 PM, jpe30 jpotte...@gmail.com wrote: I'm setting up a Hadoop cluster and I have the name node and job tracker up and running. However, I cannot get any of my datanodes or tasktrackers to start. Here is my hadoop-site.xml file... ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namehadoop.tmp.dir/name value/home/hadoop/h_temp/value descriptionA base for other temporary directories./description /property property namedfs.data.dir/name value/home/hadoop/data/value /property property namefs.default.name/name value192.168.1.10:54310/value descriptionThe name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem./description finaltrue/final /property property namemapred.job.tracker/name value192.168.1.10:54311/value descriptionThe host and port that the MapReduce job tracker runs at. If local, then jobs are run in-process as a single map and reduce task. /description /property property namedfs.replication/name value0/value descriptionDefault block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. /description /property /configuration and here is the error I'm getting... 2009-04-15 14:00:48,208 INFO org.apache.hadoop.dfs.DataNode: STARTUP_MSG: / STARTUP_MSG: Starting DataNode STARTUP_MSG: host = java.net.UnknownHostException: myhost: myhost STARTUP_MSG: args = [] STARTUP_MSG: version = 0.18.3 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r 736250; compiled by 'ndaley' on Thu Jan 22 23:12:0$ / 2009-04-15 14:00:48,355 ERROR org.apache.hadoop.dfs.DataNode: java.net.UnknownHostException: myhost: myhost at java.net.InetAddress.getLocalHost(InetAddress.java:1353) at org.apache.hadoop.net.DNS.getDefaultHost(DNS.java:185) at org.apache.hadoop.dfs.DataNode.startDataNode(DataNode.java:249) at org.apache.hadoop.dfs.DataNode.init(DataNode.java:223) at org.apache.hadoop.dfs.DataNode.makeInstance(DataNode.java:3071) at org.apache.hadoop.dfs.DataNode.instantiateDataNode(DataNode.java:3026) at org.apache.hadoop.dfs.DataNode.createDataNode(DataNode.java:3034) at org.apache.hadoop.dfs.DataNode.main(DataNode.java:3156) 2009-04-15 14:00:48,356 INFO org.apache.hadoop.dfs.DataNode: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down DataNode at java.net.UnknownHostException: myhost: myhost / -- View this message in context: http://www.nabble.com/Datanode-Setup-tp23064660p23064660.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- View
RE: Ec2 instability
Rakhi, I'd suggest going to 0.19.1. hbase and hadoop. We had so many problems with .0.19.0 on EC2 that we couldn't use it. Having problems with name resolution and generic startup scripts with .0.19.1 release but not a show stopper. Ted -Original Message- From: Rakhi Khatwani [mailto:rakhi.khatw...@gmail.com] Sent: Friday, April 17, 2009 12:45 PM To: hbase-u...@hadoop.apache.org; core-user@hadoop.apache.org Subject: Re: Ec2 instability Hi, this is the exception i have been getting @ the mapreduce java.io.IOException: Cannot run program bash: java.io.IOException: error=12, Cannot allocate memory at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) at org.apache.hadoop.util.Shell.runCommand(Shell.java:149) at org.apache.hadoop.util.Shell.run(Shell.java:134) at org.apache.hadoop.fs.DF.getAvailable(DF.java:73) at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathF orWrite(LocalDirAllocator.java:321) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllo cator.java:124) at org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFi le.java:61) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java :1199) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:857) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333) at org.apache.hadoop.mapred.Child.main(Child.java:155) Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory at java.lang.UNIXProcess.(UNIXProcess.java:148) at java.lang.ProcessImpl.start(ProcessImpl.java:65) at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) ... 10 more On Fri, Apr 17, 2009 at 10:09 PM, Rakhi Khatwani rakhi.khatw...@gmail.comwrote: Hi, Its been several days since we have been trying to stabilize hadoop/hbase on ec2 cluster. but failed to do so. We still come across frequent region server fails, scanner timeout exceptions and OS level deadlocks etc... and 2day while doing a list of tables on hbase i get the following exception: hbase(main):001:0 list 09/04/17 13:57:18 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 0 time(s). 09/04/17 13:57:19 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 1 time(s). 09/04/17 13:57:20 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 2 time(s). 09/04/17 13:57:20 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not available yet, Z... 09/04/17 13:57:20 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could not be reached after 1 tries, giving up. 09/04/17 13:57:21 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 0 time(s). 09/04/17 13:57:22 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 1 time(s). 09/04/17 13:57:23 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 2 time(s). 09/04/17 13:57:23 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not available yet, Z... 09/04/17 13:57:23 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could not be reached after 1 tries, giving up. 09/04/17 13:57:26 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 0 time(s). 09/04/17 13:57:27 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 1 time(s). 09/04/17 13:57:28 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 2 time(s). 09/04/17 13:57:28 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not available yet, Z... 09/04/17 13:57:28 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could not be reached after 1 tries, giving up. 09/04/17 13:57:29 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 0 time(s). 09/04/17 13:57:30 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 1 time(s). 09/04/17 13:57:31 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 2 time(s). 09/04/17 13:57:31 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not available yet, Z... 09/04/17 13:57:31 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 could not be reached after 1 tries, giving up. 09/04/17 13:57:34 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 0 time(s). 09/04/17 13:57:35 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 1 time(s). 09/04/17 13:57:36 INFO ipc.HBaseClass: Retrying connect to server: / 10.254.234.32:60020. Already tried 2 time(s). 09/04/17 13:57:36 INFO ipc.HbaseRPC: Server at /10.254.234.32:60020 not available yet, Z... but if i check on the UI, hbase master is still on, (tried refreshing it
Re: Datanode Setup
Mithila Nagendra wrote: You have to make sure that you can ssh between the nodes. Also check the file hosts in /etc folder. Both the master and the slave much have each others machines defined in it. Refer to my previous mail Mithila I have SSH setup correctly and here is the /etc/hosts file on node6 of the datanodes. #ip-address hostname.domain.org hostname 127.0.0.1 localhost.localdomain localhost node6 192.168.1.10master 192.168.1.1 node1 192.168.1.2 node2 192.168.1.3 node3 192.168.1.4 node4 192.168.1.5 node5 192.168.1.6 node6 I have the slaves file on each machine set as node1 to node6, and each masters file set to master except for the master itself. Still, I keep getting that same error in the datanodes... -- View this message in context: http://www.nabble.com/Datanode-Setup-tp23064660p23101738.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Map-Reduce Slow Down
Hey Jason The problem s fixed! :) My network admin had messed something up! Now it works! Thanks for your help! Mithila On Thu, Apr 16, 2009 at 11:58 PM, Mithila Nagendra mnage...@asu.edu wrote: Thanks Jason! This helps a lot. I m planning to talk to my network admin tomorrow. I hoping he ll be able to fix this problem. Mithila On Fri, Apr 17, 2009 at 9:00 AM, jason hadoop jason.had...@gmail.comwrote: Assuming you are on a linux box, on both machines verify that the servers are listening on the ports you expect via netstat -a -n -t -p -a show sockets accepting connections -n do not translate ip addresses to host names -t only list tcp sockets -p list the pid/process name on the machine 192.168.0.18 you should have sockets bound to 0.0.0.0:54310 with a process of java, and the pid should be the pid of your namenode process. On the remote machine you should be able to *telnet 192.168.0.18 54310* and have it connect *Connected to 192.168.0.18. Escape character is '^]'. * If the netstat shows the socket accepting and the telnet does not connect, then something is blocking the TCP packets between the machines. one or both machines has a firewall, an intervening router has a firewall, or there is some routing problem the command /sbin/iptables -L will normally list the firewall rules, if any for a linux machine. You should be able to use telnet to verify that you can connect from the remote machine. On Thu, Apr 16, 2009 at 9:18 PM, Mithila Nagendra mnage...@asu.edu wrote: Thanks! I ll see what I can find out. On Fri, Apr 17, 2009 at 4:55 AM, jason hadoop jason.had...@gmail.com wrote: The firewall was run at system startup, I think there was a /etc/sysconfig/iptables file present which triggered the firewall. I don't currently have access to any centos 5 machines so I can't easily check. On Thu, Apr 16, 2009 at 6:54 PM, jason hadoop jason.had...@gmail.com wrote: The kickstart script was something that the operations staff was using to initialize new machines, I never actually saw the script, just figured out that there was a firewall in place. On Thu, Apr 16, 2009 at 1:28 PM, Mithila Nagendra mnage...@asu.edu wrote: Jason: the kickstart script - was it something you wrote or is it run when the system turns on? Mithila On Thu, Apr 16, 2009 at 1:06 AM, Mithila Nagendra mnage...@asu.edu wrote: Thanks Jason! Will check that out. Mithila On Thu, Apr 16, 2009 at 5:23 AM, jason hadoop jason.had...@gmail.com wrote: Double check that there is no firewall in place. At one point a bunch of new machines were kickstarted and placed in a cluster and they all failed with something similar. It turned out the kickstart script turned enabled the firewall with a rule that blocked ports in the 50k range. It took us a while to even think to check that was not a part of our normal machine configuration On Wed, Apr 15, 2009 at 11:04 AM, Mithila Nagendra mnage...@asu.edu wrote: Hi Aaron I will look into that thanks! I spoke to the admin who overlooks the cluster. He said that the gateway comes in to the picture only when one of the nodes communicates with a node outside of the cluster. But in my case the communication is carried out between the nodes which all belong to the same cluster. Mithila On Wed, Apr 15, 2009 at 8:59 PM, Aaron Kimball aa...@cloudera.com wrote: Hi, I wrote a blog post a while back about connecting nodes via a gateway. See http://www.cloudera.com/blog/2008/12/03/securing-a-hadoop-cluster-through-a-gateway/ This assumes that the client is outside the gateway and all datanodes/namenode are inside, but the same principles apply. You'll just need to set up ssh tunnels from every datanode to the namenode. - Aaron On Wed, Apr 15, 2009 at 10:19 AM, Ravi Phulari rphul...@yahoo-inc.com wrote: Looks like your NameNode is down . Verify if hadoop process are running ( jps should show you all java running process). If your hadoop process are running try restarting your hadoop process . I guess this problem is due to your fsimage not being correct . You might have to format your namenode. Hope this helps. Thanks, -- Ravi On 4/15/09 10:15 AM, Mithila Nagendra mnage...@asu.edu wrote: The log file runs into thousands of line with the same message being displayed every time. On Wed, Apr 15, 2009 at 8:10 PM, Mithila Nagendra
Re: Datanode Setup
You should have conf/slaves file on the master node set to master, node01, node02. so on and the masters file on master set to master. Also in the /etc/hosts file get rid of 'node6' in the line 127.0.0.1 localhost.localdomain localhost node6 on all your nodes. Ensure that the /etc/hosts file contain the same information on all nodes. Also hadoop-site.xml files on all nodes should have master:portno for hdfs and tasktracker. Once you do this restart hadoop. On Fri, Apr 17, 2009 at 10:04 AM, jpe30 jpotte...@gmail.com wrote: Mithila Nagendra wrote: You have to make sure that you can ssh between the nodes. Also check the file hosts in /etc folder. Both the master and the slave much have each others machines defined in it. Refer to my previous mail Mithila I have SSH setup correctly and here is the /etc/hosts file on node6 of the datanodes. #ip-address hostname.domain.org hostname 127.0.0.1 localhost.localdomain localhost node6 192.168.1.10master 192.168.1.1 node1 192.168.1.2 node2 192.168.1.3 node3 192.168.1.4 node4 192.168.1.5 node5 192.168.1.6 node6 I have the slaves file on each machine set as node1 to node6, and each masters file set to master except for the master itself. Still, I keep getting that same error in the datanodes... -- View this message in context: http://www.nabble.com/Datanode-Setup-tp23064660p23101738.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks
There's definitely a false dichotomy to this paper, and I think it's a tad disingenuous. It's titled A Comparison Of Approaches To Large Scale Data Analysis, when it should be titled A Comparison of Parallel RDBMSs to MapReduce for RDBMS-specific problems. There's little surprise that the people who wrote the paper have been gunning for Hadoop for quite a while -- they've written papers before which describe MR as a Big Step Backwards. Not to mention the primary authors are a CTO of Vertica, a parallel DB company, and a lead tech from Microsoft. We all know MapReduce is not meant for non-parallelizable, non-indexed tasks like O(1) access to data,table joins, grepping indexed stuff, etc. MapReduce excels at highly parallelizable tasks, like keyword and document indexing, web crawling, gene sequencing, etc. What would have been *great*, and what I'm working on a whitepaper for, is a study on what classes of problems are ideal for parallel RDBMs, what are ideal for MapReduce, and then performance timing on those solutions. The study is about as useful as if I had written Comparison of Approaches to Operating System File Allocation Table Management, and then compared SQL and Ext3. Yes, I'm in one of *those* moods today :) Cheers, Bradford On Wed, Apr 15, 2009 at 8:22 AM, Jonathan Gray jl...@streamy.com wrote: I agree with you, Andy. This seems to be a great look into what Hadoop MapReduce is not good at. Over in the HBase world, we constantly deal with comparisons like this to RDBMSs, trying to determine if one is better than the other. It's a false choice and completely depends on the use case. Hadoop is not suited for random access, joins, dealing with subsets of your data; ie. it is not a relational database! It's designed to distribute a full scan of a large dataset, placing tasks on the same nodes as the data its processing. The emphasis is on task scheduling, fault tolerance, and very large datasets, low-latency has not been a priority. There are no indexes to speak of, it's completely orthogonal to what it does, so of course there is an enormous disparity in cases where that makes sense. Yes, B-Tree indexes are a wonderful breakthrough in data technology :) In short, I'm using Hadoop (HDFS and MapReduce) for a broad spectrum of applications including batch log processing, web crawling, and number of machine learning and natural language processing jobs... These may not be tasks that DBMS-X or Vertica would be good at, if even capable of them, but all things that I would include under Large-Scale Data Analysis. Would have been really interesting to see how things like Pig, Hive, and Cascading would stack up against DBMS-X/Vertica for very complex, multi-join/sort/etc queries, across a broad spectrum of use cases and dataset/result sizes. There are a wide variety of solutions to the problems out there. It's important to know the strengths and weaknesses of each, so a bit unfortunate that this paper set the stage as it did. JG On Wed, April 15, 2009 6:44 am, Andy Liu wrote: Not sure if comparing Hadoop to databases is an apples to apples comparison. Hadoop is a complete job execution framework, which collocates the data with the computation. I suppose DBMS-X and Vertica do that to some certain extent, by way of SQL, but you're restricted to that. If you want to say, build a distributed web crawler, or a complex data processing pipeline, Hadoop will schedule those processes across a cluster for you, while Vertica and DBMS-X only deal with the storage of the data. The choice of experiments seemed skewed towards DBMS-X and Vertica. I think everybody is aware that Map-Reduce is inefficient for handling SQL-like queries and joins. It's also worth noting that I think 4 out of the 7 authors either currently or at one time work with Vertica (or c-store, the precursor to Vertica). Andy On Tue, Apr 14, 2009 at 10:16 AM, Guilherme Germoglio germog...@gmail.comwrote: (Hadoop is used in the benchmarks) http://database.cs.brown.edu/sigmod09/ There is currently considerable enthusiasm around the MapReduce (MR) paradigm for large-scale data analysis [17]. Although the basic control flow of this framework has existed in parallel SQL database management systems (DBMS) for over 20 years, some have called MR a dramatically new computing model [8, 17]. In this paper, we describe and compare both paradigms. Furthermore, we evaluate both kinds of systems in terms of performance and de- velopment complexity. To this end, we define a benchmark con- sisting of a collection of tasks that we have run on an open source version of MR as well as on two parallel DBMSs. For each task, we measure each system’s performance for various degrees of par- allelism on a cluster of 100 nodes. Our results reveal some inter- esting trade-offs. Although the process to load data into and tune the execution of parallel DBMSs took much longer than the MR
Re: Seattle / PNW Hadoop + Lucene User Group?
OK, we've got 3 people... that's enough for a party? :) Surely there must be dozens more of you guys out there... c'mon, accelerate your knowledge! Join us in Seattle! On Thu, Apr 16, 2009 at 3:27 PM, Bradford Stephens bradfordsteph...@gmail.com wrote: Greetings, Would anybody be willing to join a PNW Hadoop and/or Lucene User Group with me in the Seattle area? I can donate some facilities, etc. -- I also always have topics to speak about :) Cheers, Bradford