Re: FileStatus.getLen(): bug in documentation or bug in implememtation?
Documentation is wrong. Implementation wins. Could you please file a bug. Thanks, --Konstantin Dima Rzhevskiy wrote: Hi all I try get length of file hadoop(RawFilesysten or hdfs) . In javadoc method org.apache.hadoop.fs.FileStatus.getLen() writtend that this method return the length of this file, in blocks But method return size in bytes. Is this bug in documentation or implememtation? I use hadoop-0.18.3. Dmitry Rzhevskiy.
Re: Add new Datnodes : Is redistribution of previous data required?
These links should help you to rebalance the nodes: http://developer.yahoo.com/hadoop/tutorial/module2.html#rebalancing http://hadoop.apache.org/core/docs/current/hdfs_user_guide.html#Rebalancer http://hadoop.apache.org/core/docs/current/commands_manual.html#balancer http://issues.apache.org/jira/secure/attachment/12368261/RebalanceDesign6.pdf --Konstantin asif md wrote: @Alex Thanks. http://wiki.apache.org/hadoop/FAQ#6 has anyone any experience with this? Please suggest. On Wed, Jun 24, 2009 at 5:44 PM, Alex Loddengaard a...@cloudera.com wrote: Hi, Running the rebalancer script (by the way, you only need to run it once) redistributes all of your data for you. That is, after you've run the rebalancer, your data should be stored evenly among your 10 nodes. Alex On Wed, Jun 24, 2009 at 2:50 PM, asif md asif.d...@gmail.com wrote: hello everyone, I have added 7 nodes to my 3 node cluster. I followed the following steps to do this 1. added the node's ip to conf/slaves at master 2. ran bin/start-balance.sh at each node As i loaded the data when the size of the cluster was three which is now TEN. Can i do anything to redistribute the data among all the nodes? Any ideas appreciated. Thanks and Regards Asif.
Re: Max. Possible No. of Files
There are some name-node memory estimates in this jira. http://issues.apache.org/jira/browse/HADOOP-1687 With 16 GB you can normally have 60 million objects (files + blocks) on the name-node. The number of files would depend on the file to block ratio. --Konstantin Brian Bockelman wrote: On Jun 5, 2009, at 11:51 AM, Wasim Bari wrote: Hi, Does someone has some data regarding maximum possible number of files over HDFS ? Hey Wasim, I don't think that there is a maximum limit. Remember: 1) Less is better. HDFS is optimized for big files. 2) The amount of memory the HDFS namenode needs is a function of the number of files. If you have a huge number of files, you get a huge memory requirement. 1-2 million files is fairly safe if you have a normal-looking namenode server (8-16GB RAM). I know some of our UCSD colleagues just ran a test where they were able to put more than .5M files in a single directory and still have a useable file system. Brian my second question is, I created small files with small block size up to one lac and read the files from HDFS, reading performance remains almost unaffected with increasing number of files. The possible reasons I could think are: 1 . One lac isn't a big number to disturb HDFS performance (I used 1 namenode and 4 data nodes) 2. As reading is done directly from datanode with first time interaction with namenode, so reading from different nodes doesn't affect the performance. If someone could add or negate some information it will be highly appreciated. Cheers, Wasim
Re: Setting up another machine as secondary node
I don't think you will find any step-by-step instructions. Somebody has already mentioned in replies below that secondary node is NOT a fail-over node. You can read about it here: http://wiki.apache.org/hadoop/FAQ#7 http://hadoop.apache.org/core/docs/current/hdfs_user_guide.html#Secondary+NameNode In fact, Secondary NameNode is a checkpointer only: it cannot process heartbeats from data-nodes or ls commands from hdfs clients. You probably meant to do after NameNode failed on node1 is: stop Secondary node on node2 and then start the real NameNode on node2. You will also have to restart data-nodes to redirect them to the new name-node. Another way to model a fail-over is to play with the Backup node, which is only available in trunk (not in 0.19, which you seem to be using), and is supposed to replace secondary node in 0.21. Backup node is a real name-node and it can start processing heartbeats and client commands if you redirect them to the Backup node. I guess nobody tried it yet. So please share your experience. Regards, --Konstantin Rakhi Khatwani wrote: Hi, Thanks for the suggestions. but my scenario is a little different. i am doin a POC on namenode failover. i have a 5 cluster node setup in which one acts as a master, 3 acts as slaves and the last one, the secondary node. i start my hadoop dfs, write something into it... and later kill my namenode. (tryin to produce a real worls scenario where my namenode fails due to some hardware error). so my aim is to start the secondary node as the primary m/c. so tht the dfs is intact (by copyin the checkpoint info) and all the slave pcs becoming the slaves of the secondary namenode now. 1. Can this be achieved without shuttin down the cluster?... i have read this somewhere... but coudnt achieve it. 2. Whats the step by step instruction to achieve it?.. i hv google it, got a lot of different opinions n m totally confused now. Thanks, Raakhi On Tue, May 26, 2009 at 11:27 PM, Konstantin Shvachko s...@yahoo-inc.comwrote: Hi Rakhi, This is because your name-node is trying to -importCheckpoint from a directory, which is locked by secondary name-node. The secondary node is also running in your case, right? You should use -importCheckpoint as the last resort, when name-node's directories are damaged. In regular case you start name-node with ./hadoop-daemon.sh start namenode Thanks, --Konstantin Rakhi Khatwani wrote: Hi, I followed the instructions suggested by you all. but i still come across this exception when i use the following command: ./hadoop-daemon.sh start namenode -importCheckpoint the exception is as follows: 2009-05-26 14:43:48,004 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG: / STARTUP_MSG: Starting NameNode STARTUP_MSG: host = germapp/192.168.0.1 STARTUP_MSG: args = [-importCheckpoint] STARTUP_MSG: version = 0.19.0 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19 -r 713890; compiled by 'ndaley' on Fri Nov 14 03:12:29 UTC 2008 / 2009-05-26 14:43:48,147 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=4 2009-05-26 14:43:48,154 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at: germapp/192.168.0.1:4 2009-05-26 14:43:48,160 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null 2009-05-26 14:43:48,166 INFO org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext 2009-05-26 14:43:48,316 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=ithurs,ithurs 2009-05-26 14:43:48,317 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup 2009-05-26 14:43:48,317 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true 2009-05-26 14:43:48,343 INFO org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics: Initializing FSNamesystemMetrics using context object:org.apache.hadoop.metrics.spi.NullContext 2009-05-26 14:43:48,347 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemStatusMBean 2009-05-26 14:43:48,455 INFO org.apache.hadoop.hdfs.server.common.Storage: Storage directory /tmp/hadoop-ithurs/dfs/name is not formatted. 2009-05-26 14:43:48,455 INFO org.apache.hadoop.hdfs.server.common.Storage: Formatting ... 2009-05-26 14:43:48,457 INFO org.apache.hadoop.hdfs.server.common.Storage: Cannot lock storage /tmp/hadoop-ithurs/dfs/namesecondary. The directory is already locked. 2009-05-26 14:43:48,460 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed. java.io.IOException: Cannot lock storage /tmp/hadoop-ithurs/dfs/namesecondary. The directory is already locked
Re: Setting up another machine as secondary node
Hi Rakhi, This is because your name-node is trying to -importCheckpoint from a directory, which is locked by secondary name-node. The secondary node is also running in your case, right? You should use -importCheckpoint as the last resort, when name-node's directories are damaged. In regular case you start name-node with ./hadoop-daemon.sh start namenode Thanks, --Konstantin Rakhi Khatwani wrote: Hi, I followed the instructions suggested by you all. but i still come across this exception when i use the following command: ./hadoop-daemon.sh start namenode -importCheckpoint the exception is as follows: 2009-05-26 14:43:48,004 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG: / STARTUP_MSG: Starting NameNode STARTUP_MSG: host = germapp/192.168.0.1 STARTUP_MSG: args = [-importCheckpoint] STARTUP_MSG: version = 0.19.0 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19 -r 713890; compiled by 'ndaley' on Fri Nov 14 03:12:29 UTC 2008 / 2009-05-26 14:43:48,147 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=4 2009-05-26 14:43:48,154 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at: germapp/192.168.0.1:4 2009-05-26 14:43:48,160 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null 2009-05-26 14:43:48,166 INFO org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext 2009-05-26 14:43:48,316 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=ithurs,ithurs 2009-05-26 14:43:48,317 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup 2009-05-26 14:43:48,317 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true 2009-05-26 14:43:48,343 INFO org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics: Initializing FSNamesystemMetrics using context object:org.apache.hadoop.metrics.spi.NullContext 2009-05-26 14:43:48,347 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemStatusMBean 2009-05-26 14:43:48,455 INFO org.apache.hadoop.hdfs.server.common.Storage: Storage directory /tmp/hadoop-ithurs/dfs/name is not formatted. 2009-05-26 14:43:48,455 INFO org.apache.hadoop.hdfs.server.common.Storage: Formatting ... 2009-05-26 14:43:48,457 INFO org.apache.hadoop.hdfs.server.common.Storage: Cannot lock storage /tmp/hadoop-ithurs/dfs/namesecondary. The directory is already locked. 2009-05-26 14:43:48,460 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed. java.io.IOException: Cannot lock storage /tmp/hadoop-ithurs/dfs/namesecondary. The directory is already locked. at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:510) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:363) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:273) at org.apache.hadoop.hdfs.server.namenode.FSImage.doImportCheckpoint(FSImage.java:504) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:344) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:290) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:163) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:208) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:194) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:859) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:868) 2009-05-26 14:43:48,464 INFO org.apache.hadoop.ipc.Server: Stopping server on 4 2009-05-26 14:43:48,466 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.IOException: Cannot lock storage /tmp/hadoop-ithurs/dfs/namesecondary. The directory is already locked. at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:510) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:363) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:273) at org.apache.hadoop.hdfs.server.namenode.FSImage.doImportCheckpoint(FSImage.java:504) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:344) at
Re: HDFS read/write speeds, and read optimization
I just wanted to add to this one other published benchmark http://developer.yahoo.net/blogs/hadoop/2008/09/scaling_hadoop_to_4000_nodes_a.html In this example on a very busy cluster of 4000 nodes both read and write throughputs were close to the local disk bandwidth. This benchmark (called TestDFSIO) uses large consequent write and reads. You can run it yourself on your hardware to compare. Is it more efficient to unify the disks into one volume (RAID or LVM), and then present them as a single space? Or it's better to specify each disk separately? There was a discussion recently on this list about RAID0 vs separate disks. Please search the archives. Separate disks turn out to perform better. Reliability-wise, the latter sounds more correct, as a single/several (up to 3) disks going down won't take the whole node with them. But perhaps there is a performance penalty? You always have block replicas on other nodes, so one node going down should not be a problem. Thanks, --Konstantin
Re: Not a host:port pair when running balancer
Clarifying: port # is missing in your configuration, should be property namefs.default.name/name valuehdfs://hvcwydev0601:8020/value /property where 8020 is your port number. --Konstantin Hairong Kuang wrote: Please try using the port number 8020. Hairong On 3/11/09 9:42 AM, Stuart White stuart.whi...@gmail.com wrote: I've been running hadoop-0.19.0 for several weeks successfully. Today, for the first time, I tried to run the balancer, and I'm receiving: java.lang.RuntimeException: Not a host:port pair: hvcwydev0601 In my hadoop-site.xml, I have this: property namefs.default.name/name valuehdfs://hvcwydev0601//value /property What do I need to change to get the balancer to work? It seems I need to add a port to fs.default.name. If so, what port? Can I just pick any port? If I specify a port, do I need to specify any other parms accordingly? I searched the forum, and found a few posts on this topic, but it seems that the configuration parms have changed over time, so I'm not sure what the current correct configuration is. Also, if fs.default.name is supposed to have a port, I'll point out that the docs don't say so: http://hadoop.apache.org/core/docs/r0.19.1/cluster_setup.html The example given for fs.default.name is hdfs://hostname/. Thanks!
Re: Not a host:port pair when running balancer
This is not about the default port. The port was not specified at all in the original configuration. --Konstantin Doug Cutting wrote: Konstantin Shvachko wrote: Clarifying: port # is missing in your configuration, should be property namefs.default.name/name valuehdfs://hvcwydev0601:8020/value /property where 8020 is your port number. That's the work-around, but it's a bug. One should not need to specify the default port number (8020). Please file an issue in Jira. Doug
Re: HDFS issues in 0.17.2.1 and 0.19.0 versions
Are you sure you were using 0.19 not 0.20 ? For 0.17 please check that configuration file hadoop-site.xml exists in your configuration directory is not empty and points to hdfs rather than local file system, which it does buy default. In 0.17 all config variables have been in a common file. 0.19 was the same. 0.20 changed it so now we have hdfs-site.xml, core-site.xml, mapred-site.xml See https://issues.apache.org/jira/browse/HADOOP-4631 Hope this helps. --Konstantin Shyam Sarkar wrote: Hello, I am trying to understand the clustering inside 0.17.2.1 as opposed to 0.19.0 versions. I am trying to create a directory inside 0.17.2.1 HDFS but it creates in Linux FS. However, I can do that in 0.19.0 without any problem. Can someone suggest what should I do for 0.17.2.1 so that I can create directory in HDFS? Thanks, shyam.s.sar...@gmail.com
Re: HDFS loosing blocks or connection error
Yes guys. We observed such problems. They will be common for 0.18.2 and 0.19.0 exactly as you described it when data-nodes become unstable. There were several issues, please take a look HADOOP-4997 workaround for tmp file handling on DataNodes HADOOP-4663 - links to other related HADOOP-4810 Data lost at cluster startup HADOOP-4702 Failed block replication leaves an incomplete block We run 0.18.3 now and it does not have these problems. 0.19.1 should be the same. Thanks, --Konstantin Zak, Richard [USA] wrote: It happens right after the MR job (though once or twice its happened during). I am not using EBS, just HDFS between the machines. As for tasks, there are 4 mappers and 0 reducers. Richard J. Zak -Original Message- From: jdcry...@gmail.com [mailto:jdcry...@gmail.com] On Behalf Of Jean-Daniel Cryans Sent: Friday, January 23, 2009 13:24 To: core-user@hadoop.apache.org Subject: Re: HDFS loosing blocks or connection error xlarge is good. Is it normally happening during a MR job? If so, how many tasks do you have running at the same moment overall? Also, is your data stored on EBS? Thx, J-D On Fri, Jan 23, 2009 at 12:55 PM, Zak, Richard [USA] zak_rich...@bah.comwrote: 4 slaves, 1 master, all are the m1.xlarge instance type. Richard J. Zak -Original Message- From: jdcry...@gmail.com [mailto:jdcry...@gmail.com] On Behalf Of Jean-Daniel Cryans Sent: Friday, January 23, 2009 12:34 To: core-user@hadoop.apache.org Subject: Re: HDFS loosing blocks or connection error Richard, This happens when the datanodes are too slow and eventually all replicas for a single block are tagged as bad. What kind of instances are you using? How many of them? J-D On Fri, Jan 23, 2009 at 12:13 PM, Zak, Richard [USA] zak_rich...@bah.comwrote: Might there be a reason for why this seems to routinely happen to me when using Hadoop 0.19.0 on Amazon EC2? 09/01/23 11:45:52 INFO hdfs.DFSClient: Could not obtain block blk_-1757733438820764312_6736 from any node: java.io.IOException: No live nodes contain current block 09/01/23 11:45:55 INFO hdfs.DFSClient: Could not obtain block blk_-1757733438820764312_6736 from any node: java.io.IOException: No live nodes contain current block 09/01/23 11:45:58 INFO hdfs.DFSClient: Could not obtain block blk_-1757733438820764312_6736 from any node: java.io.IOException: No live nodes contain current block 09/01/23 11:46:01 WARN hdfs.DFSClient: DFS Read: java.io.IOException: Could not obtain block: blk_-1757733438820764312_6736 file=/stats.txt It seems hdfs isn't so robust or reliable as the website says and/or I have a configuration issue. Richard J. Zak
Re: Hadoop 0.17.1 = EOFException reading FSEdits file, what causes this? how to prevent?
Joe, It looks like you edits file is corrupted or truncated. Most probably the last modification was not written to it, when the name-node was turned off. This may happen if the node crashes depending on the underlying local file system I guess. Here are some options for you to consider: - try an alternative replica of the image directory if you had one. - try to edit the edits file if you know the internal format. - try to modify local copy of your name-node code, which should catch EOFException and ignore it. - Use a checkpointed image if you can afford to loose latest modifications to the fs. - Formatting of cause is the last resort since you loose everything. Thanks, --Konstantin Joe Montanez wrote: Hi: I'm using Hadoop 0.17.1 and I'm encountering EOFException reading the FSEdits file. I don't have a clear understanding what is causing this and how to prevent this. Has anyone seen this and can advise? Thanks in advance, Joe 2009-01-12 22:51:45,573 ERROR org.apache.hadoop.dfs.NameNode: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106) at org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:90) at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:599) at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:766) at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:640) at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:223) at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80) at org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:274) at org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:255) at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:133) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:178) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:164) at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:848) at org.apache.hadoop.dfs.NameNode.main(NameNode.java:857) 2009-01-12 22:51:45,574 INFO org.apache.hadoop.dfs.NameNode: SHUTDOWN_MSG:
Re: Locks in hadoop
Did you look at Zookeeper? Thanks, --Konstantin Sagar Naik wrote: I would like to implement a locking mechanism across the hdfs cluster I assume there is no inherent support for it I was going to do it with files. According to my knowledge, file creation is an atomic operation. So the file-based lock should work. I need to think through with all conditions but if some one has better idea/solution, pl share Thanks -Sagar
Re: TestDFSIO delivers bad values of throughput and average IO rate
In TestDFSIO we want each task to create only one file. It is a one-to-one mapping from files to map tasks. And splits are defined so that each map gets only one file name, which it creates or reads. --Konstantin tienduc_dinh wrote: I don't understand, why the parameter -nrFiles of TestDFSIO should override mapred.map.tasks. nrFiles is the number of the files which will be created and mapred.map.tasks is the number how many splits will be done by the input file. Thanks Konstantin Shvachko wrote: Hi tienduc_dinh, Just a bit of a background, which should help to answer your questions. TestDFSIO mappers perform one operation (read or write) each, measure the time taken by the operation and output the following three values: (I am intentionally omitting some other output stuff.) - size(i) - time(i) - rate(i) = size(i) / time(i) i is the index of the map task 0 = i N, and N is the -nrFiles value, which equals the number of maps. Then the reduce sums those values and writes them into part-0. That is you get three fields in it size = size(0) + ... + size(N-1) time = time(0) + ... + time(N-1) rate = rate(0) + ... + rate(N-1) Then we calculate throughput = size / time averageIORate = rate / N So answering your questions - There should be only one reduce task, otherwise you will have to manually sum corresponding values in part-0 and part-1. - The value of the :rate after the reduce equals the sum of individual rates of each operation. So if you want to have an average you should divide it by the number tasks rather than multiply. Now, in your case you create only one file -nrFiles 1, which means you run only one map task. Setting mapred.map.tasks to 10 in hadoop-site.xml defines the default number of tasks per job. See here http://hadoop.apache.org/core/docs/current/hadoop-default.html#mapred.map.tasks In case of TestDFSIO it will be overridden by -nrFiles. Hope this answers your questions. Thanks, --Konstantin tienduc_dinh wrote: Hello, I'm now using hadoop-0.18.0 and testing it on a cluster with 1 master and 4 slaves. In hadoop-site.xml the value of mapred.map.tasks is 10. Because the values throughput and average IO rate are similar, I just post the values of throughput of the same command with 3 times running - hadoop-0.18.0/bin/hadoop jar testDFSIO.jar -write -fileSize 2048 -nrFiles 1 + with dfs.replication = 1 = 33,60 / 31,48 / 30,95 + with dfs.replication = 2 = 26,40 / 20,99 / 21,70 I find something strange while reading the source code. - The value of mapred.reduce.tasks is always set to 1 job.setNumReduceTasks(1) in the function runIOTest() and reduceFile = new Path(WRITE_DIR, part-0) in analyzeResult(). So I think, if we properly have mapred.reduce.tasks = 2, we will have on the file system 2 Paths to part-0 and part-1, e.g. /benchmarks/TestDFSIO/io_write/part-0 - And i don't understand the line with double med = rate / 1000 / tasks. Is it not double med = rate * tasks / 1000
Re: NameNode fatal crash - 0.18.1
Hi, Jonathan. The problem is that the local drive(s) you use for dfs.name.dir became unaccessible. So the name-node is not able to persist name-space modifications anymore, and therefore self terminated. The rest are the consequences. This is the core message 2008-12-15 01:49:31,178 FATAL org.apache.hadoop.fs.FSNamesystem: Fatal Error : All storage directories are inaccessible. Could you please check the drives. --Konstantin Jonathan Gray wrote: I have a 10+1 node cluster, each slave running DataNode/TaskTracker/HBase RegionServer. At the time of this crash, NameNode and SecondaryNameNode were both hosted on same master. We do a nightly backup and about 95% of the way through, HDFS crashed with... NameNode shows: 2008-12-15 01:49:31,178 ERROR org.apache.hadoop.fs.FSNamesystem: Unable to sync edit log. Fatal Error. 2008-12-15 01:49:31,178 FATAL org.apache.hadoop.fs.FSNamesystem: Fatal Error : All storage directories are inaccessible. 2008-12-15 01:49:31,179 INFO org.apache.hadoop.dfs.NameNode: SHUTDOWN_MSG: Every single DataNode shows: 2008-12-15 01:49:32,340 WARN org.apache.hadoop.dfs.DataNode: java.io.IOException: Call failed on local exception at org.apache.hadoop.ipc.Client.call(Client.java:718) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.dfs.$Proxy4.sendHeartbeat(Unknown Source) at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:655) at org.apache.hadoop.dfs.DataNode.run(DataNode.java:2888) at java.lang.Thread.run(Thread.java:636) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441) This is virtually all of the information I have. At the same time as the backup, we have normal HBase traffic and our hourly batch MR jobs. So slave nodes were pretty heavily loaded, but don't see anything in DN logs besides this Call failed. There are no space issues or anything else, Ganglia shows high CPU load around this time which has been typical every night, but I don't see any issues in DN's or NN about expired leases/no heartbeats/etc. Is there a way to prevent this failure from happening in the first place? I guess just reduce total load across cluster? Second question is about how to recover once NameNode does fail... When trying to bring HDFS back up, we get hundreds of: 2008-12-15 07:54:13,265 ERROR org.apache.hadoop.dfs.LeaseManager: XXX not found in lease.paths And then 2008-12-15 07:54:13,267 ERROR org.apache.hadoop.fs.FSNamesystem: FSNamesystem initialization failed. Is there a way to recover from this? As of time of this crash, we had SecondaryNameNode on the same node. Moving it to another node with sufficient memory now, but would that even prevent this kind of FS botching? Also, my SecondaryNameNode is telling me it cannot connect when trying to do a checkpoint: 2008-12-15 09:59:48,017 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in doCheckpoint: 2008-12-15 09:59:48,018 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) I changed my masters file to just contain the hostname of the secondarynamenode, this seems to have properly started the NameNode where I launched the ./bin/start-dfs.sh from and started SecondaryNameNode on correct node as well. But it seems to be unable to connect back to primary. I have hadoop-site.xml pointing to fs.default.name of primary, but otherwise there are not links back. Where would I specify to the secondary where primary is located? We're also upgrading to Hadoop 0.19.0 at this time. Thank you for any help. Jonathan Gray
Re: 0.18.1 datanode psuedo deadlock problem
Hi Jason, 2 million blocks per data-node is not going to work. There were discussions about it previously, please check the mail archives. This means you have a lot of very small files, which HDFS is not designed to support. A general recommendation is to group small files into large ones, introducing some kind of record structure delimiting those small files, and control it in on the application level. Thanks, --Konstantin Jason Venner wrote: The problem we are having is that datanodes periodically stall for 10-15 minutes and drop off the active list and then come back. What is going on is that a long operation set is holding the lock on on FSDataset.volumes, and all of the other block service requests stall behind this lock. DataNode: [/data/dfs-video-18/dfs/data] daemon prio=10 tid=0x4d7ad400 nid=0x7c40 runnable [0x4c698000..0x4c6990d0] java.lang.Thread.State: RUNNABLE at java.lang.String.lastIndexOf(String.java:1628) at java.io.File.getName(File.java:399) at org.apache.hadoop.dfs.FSDataset$FSDir.getGenerationStampFromFile(FSDataset.java:148) at org.apache.hadoop.dfs.FSDataset$FSDir.getBlockInfo(FSDataset.java:181) at org.apache.hadoop.dfs.FSDataset$FSVolume.getBlockInfo(FSDataset.java:412) at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.getBlockInfo(FSDataset.java:511) - locked 0x551e8d48 (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet) at org.apache.hadoop.dfs.FSDataset.getBlockReport(FSDataset.java:1053) at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:708) at org.apache.hadoop.dfs.DataNode.run(DataNode.java:2890) at java.lang.Thread.run(Thread.java:619) This is basically taking a stat on every hdfs block on the datanode, which in our case is ~ 2million, and can take 10+ minutes (we may be experiencing problems with our raid controller but have no visibility into it) at the OS level the file system seems fine and operations eventually finish. It appears that a couple of different data structures are being locked with the single object FSDataset$Volume. Then this happens: org.apache.hadoop.dfs.datanode$dataxcei...@1bcee17 daemon prio=10 tid=0x4da8d000 nid=0x7ae4 waiting for monitor entry [0x459fe000..0x459ff0d0] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.getNextVolume(FSDataset.java:473) - waiting to lock 0x551e8d48 (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet) at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:934) - locked 0x54e550e0 (a org.apache.hadoop.dfs.FSDataset) at org.apache.hadoop.dfs.DataNode$BlockReceiver.init(DataNode.java:2322) at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1187) at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1045) at java.lang.Thread.run(Thread.java:619) which locks the FSDataset while waiting on the volume object and now all of the Datanode operations stall waiting on the FSDataset object. -- Our particular installation doesn't use multiple directories for hdfs, so a first simple hack for a local fix would be to modify getNextVolume to just return the single volume and not be synchronized A richer alternative would be to make the locking more fine grained on FSDataset$FSVolumeSet. Of course we are also trying to fix the file system performance and dfs block loading that results in the block report taking a long time. Any suggestions or warnings? Thanks.
Re: TestDFSIO delivers bad values of throughput and average IO rate
tienduc_dinh wrote: Hi Konstantin, thanks so much for your help. I was a litte bit confused about why my setting mapred.map.tasks = 10 in hadoop-site.xml, but hadoop didn't map anything. So your answer with In case of TestDFSIO it will be overridden by -nrFiles. is the key. I need now your confirm to know, if I've understood it right. That is correct. + If I want to write 2 GB with 1 map task, I should use the following command. hadoop-0.18.0/bin/hadoop jar testDFSIO.jar -write -fileSize 2048 -nrFiles 1 The values of throughput are, e.g. 33,60 / 31,48 / 30,95. + If I want to write 2 GB with 4 map tasks, I should use the following command. hadoop-0.18.0/bin/hadoop jar testDFSIO.jar -write -fileSize 5012 -nrFiles 4 You are writing 20GB not 2GB. Should be 512 instead of 5012. The values of throughput are, e.g. 31,50 / 32,09 / 30,56. Can you please explain me, why the values in case 2 are much better. I have 1 master and 4 slaves and if I calculate it right, they must be even 4 times higher, right ? throughput is mb/sec per client. It is great that you get the same numbers for 1 write and 4 parallel writes. This means that Hadoop on your cluster scales well! :-) Sorry for my poor english skill and thanks very much for your help. Tien Duc Dinh Konstantin Shvachko wrote: Hi tienduc_dinh, Just a bit of a background, which should help to answer your questions. TestDFSIO mappers perform one operation (read or write) each, measure the time taken by the operation and output the following three values: (I am intentionally omitting some other output stuff.) - size(i) - time(i) - rate(i) = size(i) / time(i) i is the index of the map task 0 = i N, and N is the -nrFiles value, which equals the number of maps. Then the reduce sums those values and writes them into part-0. That is you get three fields in it size = size(0) + ... + size(N-1) time = time(0) + ... + time(N-1) rate = rate(0) + ... + rate(N-1) Then we calculate throughput = size / time averageIORate = rate / N So answering your questions - There should be only one reduce task, otherwise you will have to manually sum corresponding values in part-0 and part-1. - The value of the :rate after the reduce equals the sum of individual rates of each operation. So if you want to have an average you should divide it by the number tasks rather than multiply. Now, in your case you create only one file -nrFiles 1, which means you run only one map task. Setting mapred.map.tasks to 10 in hadoop-site.xml defines the default number of tasks per job. See here http://hadoop.apache.org/core/docs/current/hadoop-default.html#mapred.map.tasks In case of TestDFSIO it will be overridden by -nrFiles. Hope this answers your questions. Thanks, --Konstantin tienduc_dinh wrote: Hello, I'm now using hadoop-0.18.0 and testing it on a cluster with 1 master and 4 slaves. In hadoop-site.xml the value of mapred.map.tasks is 10. Because the values throughput and average IO rate are similar, I just post the values of throughput of the same command with 3 times running - hadoop-0.18.0/bin/hadoop jar testDFSIO.jar -write -fileSize 2048 -nrFiles 1 + with dfs.replication = 1 = 33,60 / 31,48 / 30,95 + with dfs.replication = 2 = 26,40 / 20,99 / 21,70 I find something strange while reading the source code. - The value of mapred.reduce.tasks is always set to 1 job.setNumReduceTasks(1) in the function runIOTest() and reduceFile = new Path(WRITE_DIR, part-0) in analyzeResult(). So I think, if we properly have mapred.reduce.tasks = 2, we will have on the file system 2 Paths to part-0 and part-1, e.g. /benchmarks/TestDFSIO/io_write/part-0 - And i don't understand the line with double med = rate / 1000 / tasks. Is it not double med = rate * tasks / 1000
Re: TestDFSIO delivers bad values of throughput and average IO rate
Hi tienduc_dinh, Just a bit of a background, which should help to answer your questions. TestDFSIO mappers perform one operation (read or write) each, measure the time taken by the operation and output the following three values: (I am intentionally omitting some other output stuff.) - size(i) - time(i) - rate(i) = size(i) / time(i) i is the index of the map task 0 = i N, and N is the -nrFiles value, which equals the number of maps. Then the reduce sums those values and writes them into part-0. That is you get three fields in it size = size(0) + ... + size(N-1) time = time(0) + ... + time(N-1) rate = rate(0) + ... + rate(N-1) Then we calculate throughput = size / time averageIORate = rate / N So answering your questions - There should be only one reduce task, otherwise you will have to manually sum corresponding values in part-0 and part-1. - The value of the :rate after the reduce equals the sum of individual rates of each operation. So if you want to have an average you should divide it by the number tasks rather than multiply. Now, in your case you create only one file -nrFiles 1, which means you run only one map task. Setting mapred.map.tasks to 10 in hadoop-site.xml defines the default number of tasks per job. See here http://hadoop.apache.org/core/docs/current/hadoop-default.html#mapred.map.tasks In case of TestDFSIO it will be overridden by -nrFiles. Hope this answers your questions. Thanks, --Konstantin tienduc_dinh wrote: Hello, I'm now using hadoop-0.18.0 and testing it on a cluster with 1 master and 4 slaves. In hadoop-site.xml the value of mapred.map.tasks is 10. Because the values throughput and average IO rate are similar, I just post the values of throughput of the same command with 3 times running - hadoop-0.18.0/bin/hadoop jar testDFSIO.jar -write -fileSize 2048 -nrFiles 1 + with dfs.replication = 1 = 33,60 / 31,48 / 30,95 + with dfs.replication = 2 = 26,40 / 20,99 / 21,70 I find something strange while reading the source code. - The value of mapred.reduce.tasks is always set to 1 job.setNumReduceTasks(1) in the function runIOTest() and reduceFile = new Path(WRITE_DIR, part-0) in analyzeResult(). So I think, if we properly have mapred.reduce.tasks = 2, we will have on the file system 2 Paths to part-0 and part-1, e.g. /benchmarks/TestDFSIO/io_write/part-0 - And i don't understand the line with double med = rate / 1000 / tasks. Is it not double med = rate * tasks / 1000
Re: DFS replication and Error Recovery on failure
1) If i set value of dfs.replication to 3 only in hadoop-site.xml of namenode(master) and then restart the cluster will this take effect. or i have to change hadoop-site.xml at all slaves ? dfs.replication is the name-node parameter, so you need to restart only the name-node in order to reset the value. I should mention that setting new value will not immediately change replication of the existing blocks, because replication is per file, and you need to use setReplication to change it. Although for new files the replication will be set to the new value automatically. 2) What can be possible cause of following error at a datanode. ? ERROR org.apache.hadoop.dfs.DataNode: java.io.IOException: Incompatible namespaceIDs in /mnt/hadoop28/HADOOP/hadoop-0.16.3/tmp/dir/hadoop-hadoop/dfs/data: namenode namespaceID = 1396640905; datanode namespaceID = 820259954 namespaceID provides cluster integrity. name- and data-nodes share the same value. This either means you ran the data-nodes with another name-node, or you reformatted the name-node recently. It is better to have a dedicated directory for data-node storage rather than use tmp. If my data node goes down due to above error, what should i do in following scenarios 1) i have some data on the currupted data node that i need to recover, how can i recover that data ? You should make sure first which cluster it belongs to. 2) If i dont care about the data, but i want the node back on the cluster, can i just delete the /mnt/hadoop28/HADOOP/hadoop-0.16.3/tmp and include the node back in the cluster? Yes you can remove the directory if you dont need the data. Thanks, --Konstantin
Re: Detect Dead DataNode
Sandeep Dhawan wrote: Hi, I have a setup of 2-node Hadoop cluster running on Windows using cygwin. When I open up the web gui to view the number of Live Nodes, it shows 2. But when I kill the slave node and refreshes the gui, it still shows the number of Live Nodes as 2. Its only after some 20-30 mins, It should be 10 minutes by default. that the master node is able to detect the failure which is then reflected in the gui. It then shows up : Live Node : 1 Dead Node : 1 Also, after killing the slave datanode if I try to copy a file from the local file system, it fails. 1. Is there a way by which we can configure the time interval after which master node can declare a datanode as dead. You can modify heartbeat.recheck.interval which by default is set to 5 min. The expiration time is twice this, that is 10 min. So you set heartbeat.recheck.interval = 1 min, then your nodes will be expiring in 2 minutes. 2. Why does the file transfer fail when one of the slave node is dead and masternode is alive. There could be different reasons, you need to read the message returned. Thanks, --Konstantin
Re: Datanode handling of single disk failure
Brian Bockelman wrote: Hello all, I'd like to take the datanode's capability to handle multiple directories to a somewhat-extreme, and get feedback on how well this might work. We have a few large RAID servers (12 to 48 disks) which we'd like to transition to Hadoop. I'd like to mount each of the disks individually (i.e., /mnt/disk1, /mnt/disk2, ) and take advantage of Hadoop's replication - instead of pay the overhead to set up a RAID and still have to pay the overhead of replication. In my experience this is the right way to go. However, we're a bit concerned about how well Hadoop might handle one of the directories disappearing from underneath it. If a single volume, say, /mnt/disk1 starts returning I/O errors, is Hadoop smart enough to figure out that this whole volume is broken? Or will we have to restart the datanode after any disk failure for it to search the directory realize everything is broken? What happens if you start up the datanode with a data directory that it can't write into? In current implementation if at any point Datanode detects an unwritable or unreadable drive it shuts itself down logging a message what went wrong and reporting the problem to the name-node. So yes if such thing happens you will have to restart the data-node. But since the cluster takes care of data-node failures by re-replicating lost blocks that should not be a problem. Is anyone running in this fashion (i.e., multiple data directories corresponding to different disk volumes ... even better if you're doing it with more than a few disks)? We have a large experience running 4 drives per data-node (no RAID). So this is not something new or untested. Thanks, --Konstantin
Re: question: NameNode hanging on startup as it intends to leave safe mode
This is probably related to HADOOP-4795. http://issues.apache.org/jira/browse/HADOOP-4795 We are testing it on 0.18 now. Should be committed soon. Please let know if it is something else. Thanks, --Konstantin Karl Kleinpaste wrote: We have a cluster comprised of 21 nodes holding a total capacity of about 55T where we have had a problem twice in the last couple weeks on startup of NameNode. We are running 0.18.1. DFS space is currently just below the halfway point of actual occupation, about 25T. Symptom is that there is normal startup logging on NameNode's part, where it self-analyzes its expected DFS content, reports #files known, and begins to accept reports from slaves' DataNodes about blocks they hold. During this time, NameNode is in safe mode pending adequate block discovery from slaves. As the fraction of reported blocks rises, eventually it hits the required 0.9990 threshold and announces that it will leave safe mode in 30 seconds. The problem occurs when, at the point of logging 0 seconds to leave safe mode, NameNode hangs: It uses no more CPU; it logs nothing further; it stops responding on its port 50070 web interface; hadoop fs commands report no contact with NameNode; netstat -atp shows a number of open connections on 9000 and 50070, indicating the connections are being accepted, but NameNode never processes them. This has happened twice in the last 2 weeks and it has us fairly concerned. Both times, it has been adequate simply to start over again, and NameNode successfully comes to life the 2nd time around. Is anyone else familiar with this sort of hang, and do you know of any solutions?
Re: When is decomissioning done?
Just for the reference these links: http://wiki.apache.org/hadoop/FAQ#17 http://hadoop.apache.org/core/docs/r0.19.0/hdfs_user_guide.html#DFSAdmin+Command Decommissioning is not happening at once. -refreshNodes just starts the process, but does not complete it. There could be a lot of blocks on the nodes you want to decommission, and replication takes time. The progress can be monitored on the name-node web UI. Right after -refreshNodes on the web ui you will see the nodes you chose for decommission have state Decommission In Progress you should wait until it is changed to Decommissioned and then turn the node off. --Konstantin David Hall wrote: I'm starting to think I'm doing things wrong. I have an absolute path to dfs.hosts.exclude that includes what i want decommissioned, and a dfs.hosts which includes those i want to remain commissioned (this points to the slaves file). Nothing seems to do anything... What am I missing? -- David On Thu, Dec 4, 2008 at 12:48 AM, David Hall [EMAIL PROTECTED] wrote: Hi, I'm trying to decommission some nodes. The process I tried to follow is: 1) add them to conf/excluding (hadoop-site points there) 2) invoke hadoop dfsadmin -refreshNodes This returns immediately, so I thought it was done, so i killed off the cluster and rebooted without the new nodes, but then fsck was very unhappy... Is there some way to watch the progress of decomissioning? Thanks, -- David
Re: Hadoop Development Status
This is very nice. A suggestion if it is related to the development status. Do you think guys you can analyze which questions are discussed most often in the mailing lists, so that we could update our FAQs based on that. Thanks, --Konstantin Alex Loddengaard wrote: Some engineers here at Cloudera have been working on a website to report on Hadoop development status, and we're happy to announce that the website is now available! We've written a blog post describing its usefulness, goals, and future, so take a look if you're interested: http://www.cloudera.com/blog/2008/11/18/introducing-hadoop-development-status/ The tool is hosted here: http://community.cloudera.com Please give us any feedback or suggestions off-list, to avoid polluting the list. Enjoy! Alex, Jeff, and Tom
Re: The Case of a Long Running Hadoop System
Bagri, According to the numbers you posted your cluster has 6,000,000 block replicas and only 12 data-nodes. The blocks are small on average about 78KB according to fsck. So each node contains about 40GB worth of block data. But the number of blocks is really huge 500,000 per node. Is my math correct? I haven't seen data-nodes that big yet. The problem here is that a data-node keeps a map of all its blocks in memory. The map is a HashMap. With 500,000 entries you can get long lookup times I guess. And also block reports can take long time. So I believe restarting name-node will not help you. You should somehow pack your small files into larger ones. Alternatively, you can increase your cluster size, probably 5 to 10 times larger. I don't remember whether we had any optimization patches related to data-nodes block map since 0.15. Please advise if anybody remembers. Thanks, --Konstantin Abhijit Bagri wrote: We do not have a secondary namenode because 0.15.3 has serious bug which truncates the namenode image if there is a failure while namenode fetches image from secondary namenode. See HADOOP-3069 I have a patched version of 0.15.3 for this issue. From the patch of HADOOP-3069, the changes are on namenode _and_ secondary namenode, which means I just cant fire up a seconday namenode. - Bagri On Nov 15, 2008, at 11:36 PM, Billy Pearson wrote: If I understand the secondary namenode merges the edits log in to the fsimage and reduces the edit log size. Which is likely the root of your problems 8.5G seams large and likely putting a strain on your master servers memory and io bandwidth Why do you not have a secondary namenode? If you do not have the memory on the master I would look in to stopping a datanode/tasktracker on a server and loading the secondary namenode on it Let it run for a while and watch your log for the secondary namenode you should see your edit log get smaller I am not an expert but that would be my first action. Billy Abhijit Bagri [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] Hi, This is a long mail as I have tried to put in as much details as might help any of the Hadoop dev/users to help us out. The gist is this: We have a long running Hadoop system (masters not restarted for about 3 months). We have recently started seeing the DFS responding very slowly which has resulted in failures on a system which depends on Hadoop. Further, the DFS seems to be an unstable state (i.e if fsck is a good representation which I believe it is). The edits file These are the details (skip/return here later and jump to the questions at the end of the mail for a quicker read) : Hadoop Version: 0.15.3 on 32 bit systems. Number of slaves: 12 Slaves heap size: 1G Namenode heap: 2G Jobtracker heap: 2G The namenode and jobtrackers have not been restarted for about 3 months. We did restart slaves(all of them within a few hours) a few times for some maintaineance in between though. We do not have a secondary namenode in place. There is another system X which talks to this hadoop cluster. X writes to the Hadoop DFS and submits jobs to the Jobtracker. The number of jobs submitted to Hadoop so far is over 650,000 ( I am using the job id for jobs for this), each job may rad/write to multiple files and has several dependent libraries which it loads from Distributed Cache. Recently, we started seeing that there were several timeouts happening while X tries to read/write to the DFS. This in turn results in DFS becoming very slow in response. The writes are especially slow. The trace we get in the logs are: java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.net.SocketInputStream.read(SocketInputStream.java:182) at java.io.DataInputStream.readShort(DataInputStream.java:284) at org.apache.hadoop.dfs.DFSClient $DFSOutputStream.endBlock(DFSClient.java:1660) at org.apache.hadoop.dfs.DFSClient $DFSOutputStream.close(DFSClient.java:1733) at org.apache.hadoop.fs.FSDataOutputStream $PositionCache.close(FSDataOutputStream.java:49) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java: 64) ... Also, datanode logs show a lot of traces like these: 2008-11-14 21:21:49,429 ERROR org.apache.hadoop.dfs.DataNode: DataXceiver: java.io.IOException: Block blk_-1310124865741110666 is valid, and cannot be written to. at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:551) at org.apache.hadoop.dfs.DataNode $BlockReceiver.init(DataNode.java:1257) at org.apache.hadoop.dfs.DataNode $DataXceiver.writeBlock(DataNode.java:901) at org.apache.hadoop.dfs.DataNode $DataXceiver.run(DataNode.java:804) at java.lang.Thread.run(Thread.java:595) and these 2008-11-14 21:21:50,695 WARN org.apache.hadoop.dfs.DataNode: java.io.IOException: Error in deleting
Re: SecondaryNameNode on separate machine
You can either do what you just described with dfs.name.dir = dirX or you can start name-node with -importCheckpoint option. This is an automation for copying image files from secondary to primary. See here: http://hadoop.apache.org/core/docs/current/commands_manual.html#namenode http://hadoop.apache.org/core/docs/current/hdfs_user_guide.html#Secondary+NameNode http://issues.apache.org/jira/browse/HADOOP-2585#action_12584755 --Konstantin Tomislav Poljak wrote: Hi, Thank you all for your time and your answers! Now SecondaryNameNode connects to the NameNode (after I configured dfs.http.address to the NN's http server - NN hostname on port 50070) and creates(transfers) edits and fsimage from NameNode. Can you explain me a little bit more how NameNode failover should work now? For example, SecondaryNameNode now stores fsimage and edits to (SNN's) dirX and let's say NameNode goes down (disk becomes unreadable). Now I create/dedicate a new machine for NameNode (also change DNS to point to this new NameNode machine as nameNode host) and take the data dirX from SNN and copy it to new NameNode. How do I configure new NameNode to use data from dirX (do I configure dfs.name.dir to point to dirX and start new NameNode)? Thanks, Tomislav On Fri, 2008-10-31 at 11:38 -0700, Konstantin Shvachko wrote: True, dfs.http.address is the NN Web UI address. This where the NN http server runs. Besides the Web UI there also a servlet running on that server which is used to transfer image and edits from NN to the secondary using http get. So SNN uses both addresses fs.default.name and dfs.http.address. When SNN finishes the checkpoint the primary needs to transfer the resulting image back. This is done via the http server running on SNN. Answering Tomislav's question: The difference between fs.default.name and dfs.http.address is that fs.default.name is the name-node's PRC address, where clients and data-nodes connect to, while dfs.http.address is the NN's http server address where our browsers connect to, but it is also used for transferring image and edits files. --Konstantin Otis Gospodnetic wrote: Konstantin Co, please correct me if I'm wrong, but looking at hadoop-default.xml makes me think that dfs.http.address is only the URL for the NN *Web UI*. In other words, this is where we people go look at the NN. The secondary NN must then be using only the Primary NN URL specified in fs.default.name. This URL looks like hdfs://name-node-hostname-here/. Something in Hadoop then knows the exact port for the Primary NN based on the URI schema (e.g. hdfs://) in this URL. Is this correct? Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Tomislav Poljak [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Thursday, October 30, 2008 1:52:18 PM Subject: Re: SecondaryNameNode on separate machine Hi, can you, please, explain the difference between fs.default.name and dfs.http.address (like how and when is SecondaryNameNode using fs.default.name and how/when dfs.http.address). I have set them both to same (namenode's) hostname:port. Is this correct (or dfs.http.address needs some other port)? Thanks, Tomislav On Wed, 2008-10-29 at 16:10 -0700, Konstantin Shvachko wrote: SecondaryNameNode uses http protocol to transfer the image and the edits from the primary name-node and vise versa. So the secondary does not access local files on the primary directly. The primary NN should know the secondary's http address. And the secondary NN need to know both fs.default.name and dfs.http.address of the primary. In general we usually create one configuration file hadoop-site.xml and copy it to all other machines. So you don't need to set up different values for all servers. Regards, --Konstantin Tomislav Poljak wrote: Hi, I'm not clear on how does SecondaryNameNode communicates with NameNode (if deployed on separate machine). Does SecondaryNameNode uses direct connection (over some port and protocol) or is it enough for SecondaryNameNode to have access to data which NameNode writes locally on disk? Tomislav On Wed, 2008-10-29 at 09:08 -0400, Jean-Daniel Cryans wrote: I think a lot of the confusion comes from this thread : http://www.nabble.com/NameNode-failover-procedure-td11711842.html Particularly because the wiki was updated with wrong information, not maliciously I'm sure. This information is now gone for good. Otis, your solution is pretty much like the one given by Dhruba Borthakur and augmented by Konstantin Shvachko later in the thread but I never did it myself. One thing should be clear though, the NN is and will remain a SPOF (just like HBase's Master) as long as a distributed manager service (like Zookeeper) is not plugged into Hadoop to help with failover. J-D On Wed, Oct 29, 2008 at 2:12 AM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Hi, So what is the recipe for avoiding NN SPOF using only what comes with Hadoop
Re: SecondaryNameNode on separate machine
SecondaryNameNode uses http protocol to transfer the image and the edits from the primary name-node and vise versa. So the secondary does not access local files on the primary directly. The primary NN should know the secondary's http address. And the secondary NN need to know both fs.default.name and dfs.http.address of the primary. In general we usually create one configuration file hadoop-site.xml and copy it to all other machines. So you don't need to set up different values for all servers. Regards, --Konstantin Tomislav Poljak wrote: Hi, I'm not clear on how does SecondaryNameNode communicates with NameNode (if deployed on separate machine). Does SecondaryNameNode uses direct connection (over some port and protocol) or is it enough for SecondaryNameNode to have access to data which NameNode writes locally on disk? Tomislav On Wed, 2008-10-29 at 09:08 -0400, Jean-Daniel Cryans wrote: I think a lot of the confusion comes from this thread : http://www.nabble.com/NameNode-failover-procedure-td11711842.html Particularly because the wiki was updated with wrong information, not maliciously I'm sure. This information is now gone for good. Otis, your solution is pretty much like the one given by Dhruba Borthakur and augmented by Konstantin Shvachko later in the thread but I never did it myself. One thing should be clear though, the NN is and will remain a SPOF (just like HBase's Master) as long as a distributed manager service (like Zookeeper) is not plugged into Hadoop to help with failover. J-D On Wed, Oct 29, 2008 at 2:12 AM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Hi, So what is the recipe for avoiding NN SPOF using only what comes with Hadoop? From what I can tell, I think one has to do the following two things: 1) configure primary NN to save namespace and xa logs to multiple dirs, one of which is actually on a remotely mounted disk, so that the data actually lives on a separate disk on a separate box. This saves namespace and xa logs on multiple boxes in case of primary NN hardware failure. 2) configure secondary NN to periodically merge fsimage+edits and create the fsimage checkpoint. This really is a second NN process running on another box. It sounds like this secondary NN has to somehow have access to fsimage edits files from the primary NN server. http://hadoop.apache.org/core/docs/r0.18.1/hdfs_user_guide.html#Secondary+NameNodedoes not describe the best practise around that - the recommended way to give secondary NN access to primary NN's fsimage and edits files. Should one mount a disk from the primary NN box to the secondary NN box to get access to those files? Or is there a simpler way? In any case, this checkpoint is just a merge of fsimage+edits files and again is there in case the box with the primary NN dies. That's what's described on http://hadoop.apache.org/core/docs/r0.18.1/hdfs_user_guide.html#Secondary+NameNodemore or less. Is this sufficient, or are there other things one has to do to eliminate NN SPOF? Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Jean-Daniel Cryans [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Tuesday, October 28, 2008 8:14:44 PM Subject: Re: SecondaryNameNode on separate machine Tomislav. Contrary to popular belief the secondary namenode does not provide failover, it's only used to do what is described here : http://hadoop.apache.org/core/docs/r0.18.1/hdfs_user_guide.html#Secondary+NameNode So the term secondary does not mean a second one but is more like a second part of. J-D On Tue, Oct 28, 2008 at 9:44 AM, Tomislav Poljak wrote: Hi, I'm trying to implement NameNode failover (or at least NameNode local data backup), but it is hard since there is no official documentation. Pages on this subject are created, but still empty: http://wiki.apache.org/hadoop/NameNodeFailover http://wiki.apache.org/hadoop/SecondaryNameNode I have been browsing the web and hadoop mailing list to see how this should be implemented, but I got even more confused. People are asking do we even need SecondaryNameNode etc. (since NameNode can write local data to multiple locations, so one of those locations can be a mounted disk from other machine). I think I understand the motivation for SecondaryNameNode (to create a snapshoot of NameNode data every n seconds/hours), but setting (deploying and running) SecondaryNameNode on different machine than NameNode is not as trivial as I expected. First I found that if I need to run SecondaryNameNode on other machine than NameNode I should change masters file on NameNode (change localhost to SecondaryNameNode host) and set some properties in hadoop-site.xml on SecondaryNameNode (fs.default.name, fs.checkpoint.dir, fs.checkpoint.period etc.) This was enough to start SecondaryNameNode when starting NameNode with bin/start-dfs.sh , but it didn't create image on SecondaryNameNode. Then I found that I need to set
Re: adding more datanode
You just start the new data-node as the cluster is running using bin/hadoop datanode The configuration on the new data-node should be the same as on other nodes. The data-node should join the cluster automatically. Formatting will destroy your file system. --Konstantin David Wei wrote: Well, in my cluster, I do this: 1. Adding new machines into conf/slaves on master machine 2. On the new nodes, run format command 3. Back to master, run start-all.sh 4. Run start-balancer.sh , still on master Then I got the new nodes inside my cluster and no need to reboot the whole system. Hopefully this will help. ;=) Ski Gh3 写道: I'm not sure I get this. 1. If you format the filesystem (which I thought is usually executed on the master node, but anyway) don't you erase all your data? 2. I guess I need to add the new machine to the conf/slaves file, but then I run the start-all.sh again from the master node while my cluster is already running? Thanks! On Mon, Oct 20, 2008 at 5:59 PM, David Wei [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote: this is quite easy. U can just config your new datanodes as others and format the filesystem before u start it. Remember to make it ssh-able for your master and run ./bin/start-all.sh on the master machine if you want to start all the deamons. This will start and add the new datanodes to the up-and-running cluster. hopefully my info will be help. Ski Gh3 写道: hi, I am wondering how to add more datanodes to an up-and-running hadoop instance? Couldn't find instructions on this from the wiki page. Thanks!
Re: dfs i/o stats
We use TestDFSIO for measuring IO performance on our clusters. It is called a test, but in fact its a benchmark. It runs a map-reduce job, which either writes to or reads from files and collects statistics. Another thing is that Hadoop automatically collects metrics. Like number of creates, deletes, ls's etc. Here are some links: http://wiki.apache.org/hadoop/GangliaMetrics http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/dfs/NameNodeMetrics.html http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/dfs/FSNamesystemMetrics.html Hope this is helpful. --Konstantin Shirley Cohen wrote: Hi, I would like to measure the disk i/o performance of our hadoop cluster. However, running iostat on 16 nodes is rather cumbersome. Does dfs keep track of any stats like the number of blocks or bytes read and written? From scanning the api, I found a class called org.apache.hadoop.fs.FileSystem.Statistics that could be relevant. Does anyone know if this is what I'm looking for? Thanks, Shirley
Re: Small Filesizes
Peter, You are likely to hit memory limitations on the name-node. With 100 million small files it will need to support 200 mln objects, which will require roughly 30 GB of RAM on the name-node. You may also consider hadoop archives or present your files as a collection of records and use Pig, Hive etc. --Konstantin Brian Vargas wrote: -BEGIN PGP SIGNED MESSAGE- Hash: RIPEMD160 Peter, In my testing with files of that size (well, larger, but still well below the block size) it was impossible to achieve any real throughput on the data because of the overhead of looking up the locations to all those files on the NameNode. Your application spends so much time looking up file names that most of the CPUs sit idle. A simple solution is to just load all of the small files into a sequence file, and process the sequence file instead. Brian Peter McTaggart wrote: Hi All, I am considering using HDFS for an application that potentially has many small files – ie 10-100 million files with an estimated average filesize of 50-100k (perhaps smaller) and is an online interactive application. All of the documentation I have seen suggests that a blockszie of 64-128Mb works best for Hadoop/HDFS and it is best used for batch oriented applications. Does anyone have any experience using it for files of this size in an online application environment? Is it worth pursuing HDFS for this type of application? Thanks Peter -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (MingW32) Comment: What is this? http://pgp.ardvaark.net iD8DBQFIzlOt3YdPnMKx1eMRA18fAJ48voMDWLRiKPZHcBxAFAM1Kktk8wCguSDX dIHsqlePzQHQYFr9AwhkI3I= =gmAj -END PGP SIGNATURE-
Re: is SecondaryNameNode in support for the NameNode?
NameNodeFailover http://wiki.apache.org/hadoop/NameNodeFailover, with a SecondaryNameNode http://wiki.apache.org/hadoop/SecondaryNameNode hosted I think it is wrong, please correct it. You probably look at some cached results. Both pages do not exist. The first one was a cause of confusion and was removed. Regards, --Konstantin 2008/9/6, Jean-Daniel Cryans [EMAIL PROTECTED]: Hi, See http://wiki.apache.org/hadoop/FAQ#7 and http://hadoop.apache.org/core/docs/r0.17.2/hdfs_user_guide.html#Secondary+Namenode Regards, J-D On Sat, Sep 6, 2008 at 5:26 AM, ??? [EMAIL PROTECTED] wrote: Hi all! The NameNode is a Single Point of Failure for the HDFS Cluster. There is support for NameNodeFailover, with a SecondaryNameNode hosted on a separate machine being able to stand in for the original NameNode if it goes down. Is it right? is SecondaryNameNode in support for the NameNode? Sorry for my englist!! ?
Re: Hadoop over Lustre?
Great! If you decide to run TestDFSIO on your cluster, please let me know. I'll run the same on the same scale with hdfs and we can compare the numbers. --Konstantin Joel Welling wrote: That seems to have done the trick! I am now running Hadoop 0.18 straight out of Lustre, without an intervening HDFS. The unusual things about my hadoop-site.xml are: property namefs.default.name/name valuefile:///bessemer/welling/value /property property namemapred.system.dir/name value${fs.default.name}/hadoop_tmp/mapred/system/value descriptionThe shared directory where MapReduce stores control files. /description /property where /bessemer/welling is a directory on a mounted Lustre filesystem. I then do 'bin/start-mapred.sh' (without starting dfs), and I can run Hadoop programs normally. I do have to specify full input and output file paths- they don't seem to be relative to fs.default.name . That's not too troublesome, though. Thanks very much! -Joel [EMAIL PROTECTED] On Fri, 2008-08-29 at 10:52 -0700, Owen O'Malley wrote: Check the setting for mapred.system.dir. This needs to be a path that is on a distributed file system. In old versions of Hadoop, it had to be on the default file system, but that is no longer true. In recent versions, the system dir only needs to be configured on the JobTracker and it is passed to the TaskTrackers and clients.
Re: restarting datanode corrupts the hdfs
I can see 3 reasons for that: 1. dfs.data.dir is pointing to a wrong data-node storage directory, or 2. somebody manually moved directory hadoop into /home/hadoop/dfs/tmp/, which is supposed to contain only block files named blk_number 3. There is some collision of configuration variables so that the same directory /home/hadoop/dfs/ is used by different servers (e.g. data-node and task tracker) on your single node cluster. To save hdfs data you can manually remove hadoop from /home/hadoop/dfs/tmp/ and then restart the data-node. Or you can also manully remove tmp from /home/hadoop/dfs/. In the latter case you risk to loose some latest blocks, but not the whole system. --Konstantin Barry Haddow wrote: Hi Since upgrading to 0.18.0 I've noticed that restarting the datanode corrupts the hdfs so that the only option is to delete it and start again. I'm running hadoop in distributed mode, on a single host. It runs as the user hadoop and the hdfs is contained in a directory /home/hadoop/dfs. When I restart hadoop using start-all.sh the datanode fails with the following message: STARTUP_MSG: args = [] STARTUP_MSG: version = 0.18.0 STARTUP_MSG: build = http://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r 686010; compiled by 'hadoopqa' on Thu Aug 14 19:48:33 UTC 2008 / 2008-09-01 12:06:55,871 ERROR org.apache.hadoop.dfs.DataNode: java.io.IOException: Found /home/hadoop/dfs/tmp/hadoop in /home/hadoop/dfs/tmp but it is not a file. at org.apache.hadoop.dfs.FSDataset$FSVolume.recoverDetachedBlocks(FSDataset.java:437) at org.apache.hadoop.dfs.FSDataset$FSVolume.init(FSDataset.java:310) at org.apache.hadoop.dfs.FSDataset.init(FSDataset.java:671) at org.apache.hadoop.dfs.DataNode.startDataNode(DataNode.java:277) at org.apache.hadoop.dfs.DataNode.init(DataNode.java:190) at org.apache.hadoop.dfs.DataNode.makeInstance(DataNode.java:2987) at org.apache.hadoop.dfs.DataNode.instantiateDataNode(DataNode.java:2942) at org.apache.hadoop.dfs.DataNode.createDataNode(DataNode.java:2950) at org.apache.hadoop.dfs.DataNode.main(DataNode.java:3072) 2008-09-01 12:06:55,872 INFO org.apache.hadoop.dfs.DataNode: SHUTDOWN_MSG: Running an fsck on the hdfs shows that it is corrupt, and the only way to fix it seems to be to delete it and reformat. Any suggestions? regards Barry
Re: Hadoop over Lustre?
mapred.job.tracker is the address and port of the JobTracker - the main server that controls map-reduce jobs. Every task tracker needs to know the address in order to connect. Do you follow the docs, e.g. that one http://wiki.apache.org/hadoop/GettingStartedWithHadoop Can you start one node cluster? Are there standard tests of hadoop performance? There is the sort benchmark. We also run DFSIO benchmark for read and write throughputs. --Konstantin Joel Welling wrote: So far no success, Konstantin- the hadoop job seems to start up, but fails immediately leaving no logs. What is the appropriate setting for mapred.job.tracker ? The generic value references hdfs, but it also has a port number- I'm not sure what that means. My cluster is small, but if I get this working I'd be very happy to run some benchmarks. Are there standard tests of hadoop performance? -Joel [EMAIL PROTECTED] On Fri, 2008-08-22 at 15:59 -0700, Konstantin Shvachko wrote: I think the solution should be easier than Arun and Steve advise. Lustre is already mounted as a local directory on each cluster machines, right? Say, it is mounted on /mnt/lustre. Then you configure hadoop-site.xml and set property namefs.default.name/name valuefile:///mnt/lustre/value /property And then you start map-reduce only without hdfs using start-mapred.sh By this you basically redirect all FileSystem requests to Lustre and you don't need data-nodes or the name-node. Please let me know if that works. Also it would very interesting to have your experience shared on this list. Problems, performance - everything is quite interesting. Cheers, --Konstantin Joel Welling wrote: 2. Could you set up symlinks from the local filesystem, so point every node at a local dir /tmp/hadoop with each node pointing to a different subdir in the big filesystem? Yes, I could do that! Do I need to do it for the log directories as well, or can they be shared? -Joel On Fri, 2008-08-22 at 15:48 +0100, Steve Loughran wrote: Joel Welling wrote: Thanks, Steve and Arun. I'll definitely try to write something based on the KFS interface. I think that for our applications putting the mapper on the right rack is not going to be that useful. A lot of our calculations are going to be disordered stuff based on 3D spatial relationships like nearest-neighbor finding, so things will be in a random access pattern most of the time. Is there a way to set up the configuration for HDFS so that different datanodes keep their data in different directories? That would be a big help in the short term. yes, but you'd have to push out a different config to each datanode. 1. I have some stuff that could help there, but its not ready for production use yet [1]. 2. Could you set up symlinks from the local filesystem, so point every node at a local dir /tmp/hadoop with each node pointing to a different subdir in the big filesystem? [1] http://people.apache.org/~stevel/slides/deploying_hadoop_with_smartfrog.pdf
Re: Hadoop over Lustre?
I think the solution should be easier than Arun and Steve advise. Lustre is already mounted as a local directory on each cluster machines, right? Say, it is mounted on /mnt/lustre. Then you configure hadoop-site.xml and set property namefs.default.name/name valuefile:///mnt/lustre/value /property And then you start map-reduce only without hdfs using start-mapred.sh By this you basically redirect all FileSystem requests to Lustre and you don't need data-nodes or the name-node. Please let me know if that works. Also it would very interesting to have your experience shared on this list. Problems, performance - everything is quite interesting. Cheers, --Konstantin Joel Welling wrote: 2. Could you set up symlinks from the local filesystem, so point every node at a local dir /tmp/hadoop with each node pointing to a different subdir in the big filesystem? Yes, I could do that! Do I need to do it for the log directories as well, or can they be shared? -Joel On Fri, 2008-08-22 at 15:48 +0100, Steve Loughran wrote: Joel Welling wrote: Thanks, Steve and Arun. I'll definitely try to write something based on the KFS interface. I think that for our applications putting the mapper on the right rack is not going to be that useful. A lot of our calculations are going to be disordered stuff based on 3D spatial relationships like nearest-neighbor finding, so things will be in a random access pattern most of the time. Is there a way to set up the configuration for HDFS so that different datanodes keep their data in different directories? That would be a big help in the short term. yes, but you'd have to push out a different config to each datanode. 1. I have some stuff that could help there, but its not ready for production use yet [1]. 2. Could you set up symlinks from the local filesystem, so point every node at a local dir /tmp/hadoop with each node pointing to a different subdir in the big filesystem? [1] http://people.apache.org/~stevel/slides/deploying_hadoop_with_smartfrog.pdf
Re: When will hadoop version 0.18 be released?
I don't think HADOOP-3781 will be fixed. Here is the complete list of what is going to be fixed in 0.18 https://issues.apache.org/jira/secure/IssueNavigator.jspa?fixfor=12312972 --Konstantin Thibaut_ wrote: Will this bug (https://issues.apache.org/jira/browse/HADOOP-3781) also be fixed, which makes it impossible to use the distributed jar file with any external application? (Works only with a local recompile) Thibaut Konstantin Shvachko wrote: But you won't get append in 0.18. It was committed for 0.19. --konstantin Arun C Murthy wrote: On Aug 12, 2008, at 11:51 PM, 11 Nov. wrote: Hi colleagues, As you know, the append writer will be available in version 0.18. We are here waiting for the feature and want to know the rough time of release. It's currently under vote, it should be released by the end of the week if it passes. Arun
Re: When will hadoop version 0.18 be released?
But you won't get append in 0.18. It was committed for 0.19. --konstantin Arun C Murthy wrote: On Aug 12, 2008, at 11:51 PM, 11 Nov. wrote: Hi colleagues, As you know, the append writer will be available in version 0.18. We are here waiting for the feature and want to know the rough time of release. It's currently under vote, it should be released by the end of the week if it passes. Arun
Confusing NameNodeFailover page in Hadoop Wiki
I was wondering around Hadoop wiki and found this page dedicated to name-node failover. http://wiki.apache.org/hadoop/NameNodeFailover I think it is confusing, contradicts other documentation on the subject and contains incorrect facts. See http://hadoop.apache.org/core/docs/current/hdfs_user_guide.html#Secondary+Namenode http://wiki.apache.org/hadoop/FAQ#7 Besides it contains some kind of discussion. It is not that I am against discussions, lets have them on this list. But I was trying to understand were all the confusion about secondary-node issues comes from lately... Imho we either need to correct it or remove. Thanks, --Konstantin
Re: Hadoop 4 disks per server
On hdfs see http://wiki.apache.org/hadoop/FAQ#15 In addition to the James's suggestion you can also specify dfs.name.dir for the name-node to store extra copies of the namespace. James Moore wrote: On Tue, Jul 29, 2008 at 6:37 PM, Rafael Turk [EMAIL PROTECTED] wrote: Hi All, I´m setting up a cluster with 4 disks per server. Is there any way to make Hadoop aware of this setup and take benefits from that? I believe all you need to do is give four directories (one on each drive) as the value for dfs.data.dir and mapred.local.dir. Something like: property namedfs.data.dir/name value/drive1/myDfsDir,/drive2/myDfsDir,/drive3/myDfsDir,/drive4/myDfsDir/value descriptionDetermines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored. /description /property
Re: corrupted fsimage and edits
You should also run a secondary name-node, which does namespace checkpoints and shrinks the edits log file. And this is exactly the case when the checkpoint image comes handy. http://wiki.apache.org/hadoop/FAQ#7 In the recent release you can start the primary node using the secondary image directly. In the old releases you need to move some files around. --Konstantin Raghu Angadi wrote: Torsten Curdt wrote: On Jul 30, 2008, at 20:35, Raghu Angadi wrote: You should always have more than one location (preferably on different disks) for fsimage and editslog. On production we do frequent backups. Is there a mechanism from inside hadoop now to do something like that now? The more than one location bit sounds a little like that. You can specify multiple directories for dfs.name.dir, in which case fsimage and editslog are written to multiple places. If one of these goes bad, you can use the other one. See http://wiki.apache.org/hadoop/FAQ#15 Raghu. A few months back I had a proposal to keep checksums for each record on fsimage and editslog and NameNode would recover transparently from such corruptions when there are more than one copies available. It didn't come up in priority since there were no such failures observed. You should certainly report these cases and will help the feature gain more traction. Will file a bug report tomorrow. cheers -- Torsten
Re: Inconsistency in namenode's and datanode's namespaceID
Yes this is a known bug. http://issues.apache.org/jira/browse/HADOOP-1212 You should manually remove current directory from every data-node after reformatting the name-node and start the cluster again. I do not believe there is any other way. Thanks, --Konstantin Taeho Kang wrote: No, I don't think it's a bug. Your datanodes' data partition/directory was probably used in other HDFS setup and thus had other namespaceID. Or you could've used other partition/directory for your new HDFS setup by setting different values for dfs.data.dir on your datanode. But in this case, you can't access your old HDFS's data. On Thu, Jul 3, 2008 at 4:21 AM, Xuan Dzung Doan [EMAIL PROTECTED] wrote: I was following the quickstart guide to run pseudo-distributed operations with Hadoop 0.16.4. I got it to work successfully the first time. But I failed to repeat the steps (I tried to re-do everything from re-formating the HDFS). Then by looking at the log files of the daemons, I found out the datanode failed to start because its namespaceID didn't match with the namenode's. I after that found that the namespaceID is stored in the text file VERSION under dfs/data/current and dfs/name/current for the datanode and the namenode, respectively. The reformatting step does change namespaceID of the namenode, but not for the datanode, and that's the cause for the inconsistency. So after reformatting, if I manually update namespaceID for the datanode, things will work totally fine again. I guess there are probably others who had this same experience. Is it a bug in Hadoop 0.16.4? If so, has it been taken care of in later versions? Thanks, David.
Re: HDFS blocks
lohit wrote: 1. Can we have multiple files in DFS use different block sizes ? No, current this might not be possible, we have fixed sized blocks. Actually you can. HDFS provides api to specify block size when you create a file. Here is the link http://hadoop.apache.org/core/docs/r0.17.0/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20boolean,%20int,%20short,%20long,%20org.apache.hadoop.util.Progressable) This should probably be in H-FAQ. 2. If we use default block size for these small chunks, is the DFS space wasted ? DFS space is not wasted, all the blocks are stored on individual datanode's filesystem as is. But you would be wasting NameNode's namespace. NameNode holds the entire namespace in memory, so, instead of using 1 file with 128M block if you do multiple files of size 6M you would be having so many entries. If not then does it mean that a single DFS block can hold data from more than one file ? DFS Block cannot hold data from more than one file. If your file size say 5M which is less than your default block size say 128M, then the block stored in DFS would be 5M alone. To over come this, ppl usually run a map/reduce job with 1 reducer and Identity mapper, which basically merges all small files into one file. In hadoop 0.18 we have archives and once HADOOP-1700 is done, one could open the file to append to it. Thanks, Lohit - Original Message From: Goel, Ankur [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Friday, June 27, 2008 2:27:57 AM Subject: HDFS blocks Hi Folks, I have a setup where in I am streaming data into HDFS from a remote location and creating a new files every X min. The file generated is of a very small size (512 KB - 6 MB) size. Since that is the size range the streaming code sets the block size to 6MB whereas default that we have set for the cluster is 128 MB. The idea behind such a thing is to generate small temporal data chunks from multiple sources and merge them periodically into a big chunk with our default (128 MB) block size. The webUI for DFS reports the block size for these files to be 6 MB. My questions are. 1. Can we have multiple files in DFS use different block sizes ? 2. If we use default block size for these small chunks, is the DFS space wasted ? If not then does it mean that a single DFS block can hold data from more than one file ? Thanks -Ankur
Re: realtime hadoop
Also HDFS might be critical since to access your data you need to close the file Not anymore. Since 0.16 files are readable while being written to. it as fast as possible. I need to be able to maintain some guaranteed max. processing time, for example under 3 minutes. It looks like you do not need very strict guarantees. I think you can use hdfs as a data-storage. Don't know what kind of data-processing you do, but I agree with Stefan that map-reduce is designed for batch tasks rather than for real-time processing. Stefan Groschupf wrote: Hadoop might be the wrong technology for you. Map Reduce is a batch processing mechanism. Also HDFS might be critical since to access your data you need to close the file - means you might have many small file, a situation where hdfs is not very strong (namespace is hold in memory). Hbase might be an interesting tool for you, also zookeeper if you want to do something home grown... On Jun 23, 2008, at 11:31 PM, Vadim Zaliva wrote: Hi! I am considering using Hadoop for (almost) realime data processing. I have data coming every second and I would like to use hadoop cluster to process it as fast as possible. I need to be able to maintain some guaranteed max. processing time, for example under 3 minutes. Does anybody have experience with using Hadoop in such manner? I will appreciate if you can share your experience or give me pointers to some articles or pages on the subject. Vadim ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com
Re: hadoop file system error
Did you close those files? If not they may be empty. ??? wrote: Dears, I use hadoop-0.16.4 to do some work and found a error which i can't get the reasons. The scenario is like this: In the reduce step, instead of using OutputCollector to write result, i use FSDataOutputStream to write result to files on HDFS(becouse i want to split the result by some rules). After the job finished, i found that *some* files(but not all) are empty on HDFS. But i'm sure in the reduce step the files are not empty since i added some logs to read the generated file. It seems that some file's contents are lost after the reduce step. Is anyone happen to face such errors? or it's a hadoop bug? Please help me to find the reason if you some guys know Thanks Regards Guangfeng
Re: dfs put fails
Looks like the client machine from which you call -put cannot connect to the data-nodes. It could be firewall or wrong configuration parameters that you use for the client. Alexander Arimond wrote: hi, i'm new in hadoop and im just testing it at the moment. i set up a cluster with 2 nodes and it seems like they are running normally, the log files of the namenode and the datanodes dont show errors. Firewall should be set right. but when i try to upload a file to the dfs i get following message: [EMAIL PROTECTED]:~/hadoop$ bin/hadoop dfs -put file.txt file.txt 08/06/12 14:44:19 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.ConnectException: Connection refused 08/06/12 14:44:19 INFO dfs.DFSClient: Abandoning block blk_5837981856060447217 08/06/12 14:44:28 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.ConnectException: Connection refused 08/06/12 14:44:28 INFO dfs.DFSClient: Abandoning block blk_2573458924311304120 08/06/12 14:44:37 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.ConnectException: Connection refused 08/06/12 14:44:37 INFO dfs.DFSClient: Abandoning block blk_1207459436305221119 08/06/12 14:44:46 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.ConnectException: Connection refused 08/06/12 14:44:46 INFO dfs.DFSClient: Abandoning block blk_-8263828216969765661 08/06/12 14:44:52 WARN dfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block. 08/06/12 14:44:52 WARN dfs.DFSClient: Error Recovery for block blk_-8263828216969765661 bad datanode[0] dont know what that means and didnt found something about that.. Hope somebody can help with that. Thank you!
Re: Best practices for handling many small files
Would the new archive feature HADOOP-3307 that is currently being developed help this problem? http://issues.apache.org/jira/browse/HADOOP-3307 --Konstantin Subramaniam Krishnan wrote: We have actually written a custom Multi File Splitter that collapses all the small files to a single split till the DFS Block Size is hit. We also take care of handling big files by splitting them on Block Size and adding up all the reminders(if any) to a single split. It works great for us:-) We are working on optimizing it further to club all the small files in a single data node together so that the Map can have maximum local data. We plan to share this(provided it's found acceptable, of course) once this is done. Regards, Subru Stuart Sierra wrote: Thanks for the advice, everyone. I'm going to go with #2, packing my million files into a small number of SequenceFiles. This is slow, but only has to be done once. My datacenter is Amazon Web Services :), so storing a few large, compressed files is the easiest way to go. My code, if anyone's interested, is here: http://stuartsierra.com/2008/04/24/a-million-little-files -Stuart altlaw.org On Wed, Apr 23, 2008 at 11:55 AM, Stuart Sierra [EMAIL PROTECTED] wrote: Hello all, Hadoop newbie here, asking: what's the preferred way to handle large (~1 million) collections of small files (10 to 100KB) in which each file is a single record? 1. Ignore it, let Hadoop create a million Map processes; 2. Pack all the files into a single SequenceFile; or 3. Something else? I started writing code to do #2, transforming a big tar.bz2 into a BLOCK-compressed SequenceFile, with the file names as keys. Will that work? Thanks, -Stuart, altlaw.org
Re: TestDU.testDU() throws assertionfailederror
Edward, testDU() writes a 32K file to the local fs and then verifies whether the value reported by du changes exactly to the amount written. Although this is true for most block oriented file systems it might not be true for some. I suspect that in your case the file is written to tmpfs, which is a memory fs and thus creates an inode a directory entry and may be something else in memory (4K total) in addition to the actual data. That is why du returns a different value and test fails. Although TestDU is not universal we still want it to run in order to prevent bugs in DU on real file systems. Thanks, --Konstantin Edward J. Yoon wrote: Hi, community. In my local computer (CentOS 5), TestDU.testDU() throws assertionfailederror. What is the 4096 byte? Testsuite: org.apache.hadoop.fs.TestDU Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 5.143 sec Testcase: testDU took 5.138 sec FAILED expected:32768 but was:36864 junit.framework.AssertionFailedError: expected:32768 but was:36864 at org.apache.hadoop.fs.TestDU.testDU(TestDU.java:77) [EMAIL PROTECTED] hadoop]# df -T FilesystemType 1K-blocks Used Available Use% Mounted on /dev/sda1 ext352980472 11103380 39185808 23% / none tmpfs 516924 0516924 0% /dev/shm Thanks.
Re: secondary namenode web interface
Yuri, The NullPointerException should be fixed as Dhruba proposed. We do not have any secondary nn web interface as of today. The http server is used for transferring data between the primary and the secondary. I don't see we can display anything useful on the secondary web UI except for the current status, config values, and the last checkpoint date/time. If you have anything in mind that can be displayed on the UI please let us know. You can also find a jira for the issue, it would be good if this discussion is reflected in it. Thanks, --Konstantin dhruba Borthakur wrote: The secondary Namenode uses the HTTP interface to pull the fsimage from the primary. Similarly, the primary Namenode uses the dfs.secondary.http.address to pull the checkpointed-fsimage back from the secondary to the primary. So, the definition of dfs.secondary.http.address is needed. However, the servlet dfshealth.jsp should not be served from the secondary Namenode. This servet should be setup in such a way that only the primary Namenode invokes this servlet. Thanks, dhruba -Original Message- From: Yuri Pradkin [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 08, 2008 10:11 AM To: core-user@hadoop.apache.org Subject: Re: secondary namenode web interface I'd be happy to file a JIRA for the bug, I just want to make sure I understand what the bug is: is it the misleading null pointer message or is it that someone is listening on this port and not doing anything useful? I mean, what is the configuration parameter dfs.secondary.http.address for? Unless there are plans to make this interface work, this config parameter should go away, and so should the listening thread, shouldn't they? Thanks, -Yuri On Friday 04 April 2008 03:30:46 pm dhruba Borthakur wrote: Your configuration is good. The secondary Namenode does not publish a web interface. The null pointer message in the secondary Namenode log is a harmless bug but should be fixed. It would be nice if you can open a JIRA for it. Thanks, Dhruba -Original Message- From: Yuri Pradkin [mailto:[EMAIL PROTECTED] Sent: Friday, April 04, 2008 2:45 PM To: core-user@hadoop.apache.org Subject: Re: secondary namenode web interface I'm re-posting this in hope that someone would help. Thanks! On Wednesday 02 April 2008 01:29:45 pm Yuri Pradkin wrote: Hi, I'm running Hadoop (latest snapshot) on several machines and in our setup namenode and secondarynamenode are on different systems. I see from the logs than secondary namenode regularly checkpoints fs from primary namenode. But when I go to the secondary namenode HTTP (dfs.secondary.http.address) in my browser I see something like this: HTTP ERROR: 500 init RequestURI=/dfshealth.jsp Powered by Jetty:// And in secondary's log I find these lines: 2008-04-02 11:26:25,357 WARN /: /dfshealth.jsp: java.lang.NullPointerException at org.apache.hadoop.dfs.dfshealth_jsp.init(dfshealth_jsp.java:21) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorA cce ssorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCons tru ctorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:539) at java.lang.Class.newInstance0(Class.java:373) at java.lang.Class.newInstance(Class.java:326) at org.mortbay.jetty.servlet.Holder.newInstance(Holder.java:199) at org.mortbay.jetty.servlet.ServletHolder.getServlet(ServletHolder.java:32 6) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:405) at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationH and ler.java:475) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567) at org.mortbay.http.HttpContext.handle(HttpContext.java:1565) at org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationCon tex t.java:635) at org.mortbay.http.HttpContext.handle(HttpContext.java:1517) at org.mortbay.http.HttpServer.service(HttpServer.java:954) at org.mortbay.http.HttpConnection.service(HttpConnection.java:814) at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981) at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831) at org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244 ) at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357) at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534) Is something missing from my configuration? Anybody else seen these? Thanks, -Yuri
Re: secondary namenode web interface
Unfortunately we do not have an api for the secondary nn that would allow browsing the checkpoint. I agree it would be nice to have one. Thanks for filing the issue. --Konstantin Yuri Pradkin wrote: On Tuesday 08 April 2008 11:54:35 am Konstantin Shvachko wrote: If you have anything in mind that can be displayed on the UI please let us know. You can also find a jira for the issue, it would be good if this discussion is reflected in it. Well, I guess we could have interface to browse the checkpointed image (actually this is what I was expecting to see), but it's not that big of a deal. Filed https://issues.apache.org/jira/browse/HADOOP-3212 Thanks, -Yuri
Hadoop-Patch buil is not progressing for 6 hours
Usually a build takes 2 hours or less. This one is stuck and I don't see changes in the QUEUE OF PENDING PATCHES when I submit a patch. I guess something is wrong with Hadson. Could anybody please check. --Konstantin
Re: Namenode not a host port pair
You should use host:port rather than just port. See HADOOP-2404, and HADOOP-2185. Ved Prakash wrote: Hi friends, I have been trying to start hadoop on the master but it doesn't start the name node on it, checking the logs I found the following error hadoop-site.xml listing configuration property namefs.default.name/name valuehdfs://ved-desktop:50001/value /property property namemapred.job.tracker/name valueved-desktop:50002/value /property property namedfs.secondary.http.address/name value50003/value /property Should be host:port not just port. property namedfs.http.address/name value50004/value /property Same here. property namemapred.job.tracker.http.address/name value50005/value /property Same here. property nametasktracker.http.address/name value50006/value /property Same here. /configuration Yesterday I could start namenode, tasktracker, jobtracker, secondarynamenode properly but today its giving me problem. What could be the reason, can anyone help me with this? Thanks
Re: File size and number of files considerations
Naama Kraus wrote: Hi, Thanks all for the input. Here are my further questions: I can consolidate data off-line to have big enough files (64M) or copy to dfs smaller files and then consolidate using MapReduce. You can also let MapReduce read your original small local files and write them into large hdfs file consolidating them so to speak on the fly. Something like distcp (see related issues in Jira) but with custom processing of your inputs. 1. If I choose the first option, would the copy of a 64M file into dfs from a local file system perform well ? 2. If I choose the second option, how would one suggest to implement it ? I am not sure how I control the size of the reduce output files. 3. I had the impression that dfs splits large files and distributes splits around. Is that true ? Just wanted to mention that splits are logical not physical. It is not like hdfs cuts files into pieces and moves them around. You can think of splits as file ranges. If so, why should I mind if my files are extremely large ? Say Gigas or even Teras ? Doesn't dfs take care of it internally and thus scales up in terms of file size ? I am quoting from the HDFS architecture document in http://hadoop.apache.org/core/docs/current/hdfs_design.html#Large+Data+Sets Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. 4. Is there further recommended material to read about these issues ? Thanks, Naama On Mon, Mar 10, 2008 at 6:43 PM, Amar Kamat [EMAIL PROTECTED] wrote: By chunks I meant basic unit of processing i.e a dfs block. Sorry for the confusion, I should have mentioned it clearly. What I meant was in case of files smaller than the default block size, the file becomes the basic unit for computation. Now one can have a very huge file and rely on the dfs block size but a simpler approach would be create small files in the beginning itself (if possible). This avoids playing around with the block size and adds lesser confusion in terms of record boundaries etc. I dont have any specific values for the file sizes but files with very small sizes will cause lots of maps which will cause reducers to be slower. So make sure to have files that form the logical unit of computation and good enough size. Thanks Ted for pointing it out. Amar On Mon, 10 Mar 2008, Ted Dunning wrote: Amar's comments are a little strange. Replication occurs at the block level, not the file level. Storing data in a small number of large files or a large number of small files will have less than a factor of two effect on number of replicated blocks if the small files are 64MB. Files smaller than that will hurt performance due to seek costs. To address Naama's question, you should consolidate your files so that you have files of at least 64 MB and preferably a bit larger than that. This helps because it allows the reading of the files to proceed in a nice sequential manner which can greatly increase throughput. If consolidating these files off-line is difficult, it is easy to do in a preliminary map-reduce step. This will incur a one-time cost, but if you are doing multiple passes over the data later, it will be worth it. On 3/10/08 3:12 AM, Amar Kamat [EMAIL PROTECTED] wrote: On Mon, 10 Mar 2008, Naama Kraus wrote: Hi, In our system, we plan to upload data into Hadoop from external sources and use it later on for analysis tasks. The interface to the external repositories allows us to fetch pieces of data in chunks. E.g. get n records at a time. Records are relatively small, though the overall amount of data is assumed to be large. For each repository, we fetch pieces of data in a serial manner. Number of repositories is small (few of them). My first step is to put the data in plain files in HDFS. My question is what is the optimized file sizes to use. Many small files (to the extent of each record in a file) ? - guess not. Few huge files each holding all data of same type ? Or maybe put each chunk we get in a separate file, and close it right after a chunk was uploaded ? I think it should be more based on the size of the data you want to process in a map which I think here is the chunk size, no? Larger the file less the replicas and hence more the network transfers in case of more maps. In case of smaller file size the NN will be bottleneck but you will end up having more replicas for each map task and hence more locality. Amar How would HFDS perform best, with few large files or more smaller files ? As I wrote we plan to run MapReduce jobs over the data in the files in order to organize the data and analyze it. Thanks for any help, Naama
Re: Namenode fails to re-start after cluster shutdown
André, You can try to rollback. You did use upgrade when you switched to the new trunk, right? --Konstantin Raghu Angadi wrote: André Martin wrote: Hi Raghu, done: https://issues.apache.org/jira/browse/HADOOP-2873 Subsequent tries did not succeed - so it looks like I need to re-format the cluster :-( Please back up the log files and name node image files if you can before re-format. Raghu.
Re: dfsadmin reporting wrong disk usage numbers
Yes, please file a bug. There are file systems with different block sizes out there Linux or Solaris. Thanks, --Konstantin Martin Traverso wrote: I think I found the issue. The class org.apache.hadoop.fs.DU assumes 1024-byte blocks when reporting usage information: this.used = Long.parseLong(tokens[0])*1024; This works fine in linux, but in Solaris and Mac OS the reported number of blocks is based on 512-byte blocks. The solution is simple: DU should use du -sk instead of du -s. Should I file I bug for this? Martin