Re: FSDataInputStream.read(byte[]) only reads to a block boundary?
This seems to be the case. I don't think there is any specific reason not to read across the block boundary... Even if HDFS does read across the blocks, it is still not a good idea to ignore the JavaDoc for read(). If you want all the bytes read, then you should have a while loop or one of the readFully() variants. For e.g. if you later change your code by wrapping a BufferedInputStream around 'in', you would still get partial reads even if HDFS reads all the data. Raghu. forbbs forbbs wrote: The hadoop version is 0.19.0. My file is larger than 64MB, and the block size is 64MB. The output of the code below is '10'. May I read across the block boundary? Or I should use 'while (left..){}' style code? public static void main(String[] args) throws IOException { Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); FSDataInputStream fin = fs.open(new Path(args[0])); fin.seek(64*1024*1024 - 10); byte[] buffer = new byte[32*1024]; int len = fin.read(buffer); //int len = fin.read(buffer, 0, 128); System.out.println(len); fin.close(); }
Re: HDFS Random Access
Yes, FSDataInputStream allows random access. There are way to read x bytes at a position p: 1) in.seek(p); read(buf, 0, x); 2) in.(p, buf, 0, x); These two have slightly different semantics. The second one is preferred and is easier for HDFS to optimize further. Random access should be pretty good with HDFS and it is increasingly getting more users and thus more importance. HBase is one of the users. Just yesterday I attached a benchmark and comparissions to random access on native filesystem to https://issues.apache.org/jira/browse/HDFS-236 . As of now, the overhead on average is about 2 ms over 9-10ms it takes for native read. There are a few fairly simple fixes possible to reduce this gap. I think getFileStatus() is the way to find the length, though there might have been a call added to FSDataInputStream recently. I am not sure. Raghu. tsuraan wrote: All the documentation for HDFS says that it's for large streaming jobs, but I couldn't find an explicit answer to this, so I'll try asking here. How is HDFS's random seek performance within an FSDataInputStream? I use lucene with a lot of indices (potentially thousands), so I was thinking of putting them into HDFS and reimplementing my search as a Hadoop map-reduce. I've noticed that lucene tends to do a bit of random seeking when searching though; I don't believe that it guarantees that all seeks be to increasing file positions either. Would HDFS be a bad fit for an access pattern that involves seeks to random positions within a stream? Also, is getFileStatus the typical way of getting the length of a file in HDFS, or is there some method on FSDataInputStream that I'm not seeing? Please cc: me on any reply; I'm not on the hadoop list. Thanks!
Re: UnknownHostException
This is at RPC client level and there is requirement for fully qualified hostname. May be . at the end of 10.2.24.21 causing the problem? btw, in 0.21 even fs.default.name does not need to be fully qualified name.. anything that resolves to an ipaddress is fine (at least for common/FS and HDFS). Raghu. Matt Massie wrote: fs.default.name in your hadoop-site.xml needs to be set to a fully-qualified domain name (instead of an IP address) -Matt On Jun 23, 2009, at 6:42 AM, bharath vissapragada wrote: when i try to execute the command bin/start-dfs.sh , i get the following error . I have checked the hadoop-site.xml file on all the nodes , and they are fine .. can some-one help me out! 10.2.24.21: Exception in thread main java.net.UnknownHostException: unknown host: 10.2.24.21. 10.2.24.21: at org.apache.hadoop.ipc.Client$Connection.init(Client.java:195) 10.2.24.21: at org.apache.hadoop.ipc.Client.getConnection(Client.java:779) 10.2.24.21: at org.apache.hadoop.ipc.Client.call(Client.java:704) 10.2.24.21: at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) 10.2.24.21: at org.apache.hadoop.dfs.$Proxy4.getProtocolVersion(Unknown Source) 10.2.24.21: at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319) 10.2.24.21: at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:306) 10.2.24.21: at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:343) 10.2.24.21: at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:288)
Re: UnknownHostException
Raghu Angadi wrote: This is at RPC client level and there is requirement for fully qualified I meant to say there is NO requirement ... hostname. May be . at the end of 10.2.24.21 causing the problem? btw, in 0.21 even fs.default.name does not need to be fully qualified that fix is probably in 0.20 too. Raghu. name.. anything that resolves to an ipaddress is fine (at least for common/FS and HDFS). Raghu. Matt Massie wrote: fs.default.name in your hadoop-site.xml needs to be set to a fully-qualified domain name (instead of an IP address) -Matt On Jun 23, 2009, at 6:42 AM, bharath vissapragada wrote: when i try to execute the command bin/start-dfs.sh , i get the following error . I have checked the hadoop-site.xml file on all the nodes , and they are fine .. can some-one help me out! 10.2.24.21: Exception in thread main java.net.UnknownHostException: unknown host: 10.2.24.21. 10.2.24.21: at org.apache.hadoop.ipc.Client$Connection.init(Client.java:195) 10.2.24.21: at org.apache.hadoop.ipc.Client.getConnection(Client.java:779) 10.2.24.21: at org.apache.hadoop.ipc.Client.call(Client.java:704) 10.2.24.21: at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) 10.2.24.21: at org.apache.hadoop.dfs.$Proxy4.getProtocolVersion(Unknown Source) 10.2.24.21: at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319) 10.2.24.21: at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:306) 10.2.24.21: at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:343) 10.2.24.21: at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:288)
Re: Too many open files error, which gets resolved after some time
Stas Oskin wrote: Hi. Any idea if calling System.gc() periodically will help reducing the amount of pipes / epolls? since you have HADOOP-4346, you should not have excessive epoll/pipe fds open. First of all do you still have the problem? If yes, how many hadoop streams do you have at a time? System.gc() won't help if you have HADOOP-4346. Ragu. Thanks for your opinion! 2009/6/22 Stas Oskin stas.os...@gmail.com Ok, seems this issue is already patched in the Hadoop distro I'm using (Cloudera). Any idea if I still should call GC manually/periodically to clean out all the stale pipes / epolls? 2009/6/22 Steve Loughran ste...@apache.org Stas Oskin wrote: Hi. So what would be the recommended approach to pre-0.20.x series? To insure each file is used only by one thread, and then it safe to close the handle in that thread? Regards. good question -I'm not sure. For anythiong you get with FileSystem.get(), its now dangerous to close, so try just setting the reference to null and hoping that GC will do the finalize() when needed
Re: Too many open files error, which gets resolved after some time
To be more accurate, once you have HADOOP-4346, fds for epoll and pipes = 3 * threads blocked on Hadoop I/O Unless you have hundreds of threads at a time, you should not see hundreds of these. These fds stay up to 10sec even after the threads exit. I am a bit confused about your exact situation. Please check number of threads if you still facing the problem. Raghu. Raghu Angadi wrote: since you have HADOOP-4346, you should not have excessive epoll/pipe fds open. First of all do you still have the problem? If yes, how many hadoop streams do you have at a time? System.gc() won't help if you have HADOOP-4346. Ragu. Thanks for your opinion! 2009/6/22 Stas Oskin stas.os...@gmail.com Ok, seems this issue is already patched in the Hadoop distro I'm using (Cloudera). Any idea if I still should call GC manually/periodically to clean out all the stale pipes / epolls? 2009/6/22 Steve Loughran ste...@apache.org Stas Oskin wrote: Hi. So what would be the recommended approach to pre-0.20.x series? To insure each file is used only by one thread, and then it safe to close the handle in that thread? Regards. good question -I'm not sure. For anythiong you get with FileSystem.get(), its now dangerous to close, so try just setting the reference to null and hoping that GC will do the finalize() when needed
Re: Too many open files error, which gets resolved after some time
how many threads do you have? Number of active threads is very important. Normally, #fds = (3 * #threads_blocked_on_io) + #streams 12 per stream is certainly way off. Raghu. Stas Oskin wrote: Hi. In my case it was actually ~ 12 fd's per stream, which included pipes and epolls. Could it be that HDFS opens 3 x 3 (input - output - epoll) fd's per each thread, which make it close to the number I mentioned? Or it always 3 at maximum per thread / stream? Up to 10 sec looks quite the correct number, it seems it gets freed arround this time indeed. Regards. 2009/6/23 Raghu Angadi rang...@yahoo-inc.com To be more accurate, once you have HADOOP-4346, fds for epoll and pipes = 3 * threads blocked on Hadoop I/O Unless you have hundreds of threads at a time, you should not see hundreds of these. These fds stay up to 10sec even after the threads exit. I am a bit confused about your exact situation. Please check number of threads if you still facing the problem. Raghu. Raghu Angadi wrote: since you have HADOOP-4346, you should not have excessive epoll/pipe fds open. First of all do you still have the problem? If yes, how many hadoop streams do you have at a time? System.gc() won't help if you have HADOOP-4346. Ragu. Thanks for your opinion! 2009/6/22 Stas Oskin stas.os...@gmail.com Ok, seems this issue is already patched in the Hadoop distro I'm using (Cloudera). Any idea if I still should call GC manually/periodically to clean out all the stale pipes / epolls? 2009/6/22 Steve Loughran ste...@apache.org Stas Oskin wrote: Hi. So what would be the recommended approach to pre-0.20.x series? To insure each file is used only by one thread, and then it safe to close the handle in that thread? Regards. good question -I'm not sure. For anythiong you get with FileSystem.get(), its now dangerous to close, so try just setting the reference to null and hoping that GC will do the finalize() when needed
Re: Too many open files error, which gets resolved after some time
Is this before 0.20.0? Assuming you have closed these streams, it is mostly https://issues.apache.org/jira/browse/HADOOP-4346 It is the JDK internal implementation that depends on GC to free up its cache of selectors. HADOOP-4346 avoids this by using hadoop's own cache. Raghu. Stas Oskin wrote: Hi. After tracing some more with the lsof utility, and I managed to stop the growth on the DataNode process, but still have issues with my DFS client. It seems that my DFS client opens hundreds of pipes and eventpolls. Here is a small part of the lsof output: java10508 root 387w FIFO0,6 6142565 pipe java10508 root 388r FIFO0,6 6142565 pipe java10508 root 389u 0,100 6142566 eventpoll java10508 root 390u FIFO0,6 6135311 pipe java10508 root 391r FIFO0,6 6135311 pipe java10508 root 392u 0,100 6135312 eventpoll java10508 root 393r FIFO0,6 6148234 pipe java10508 root 394w FIFO0,6 6142570 pipe java10508 root 395r FIFO0,6 6135857 pipe java10508 root 396r FIFO0,6 6142570 pipe java10508 root 397r 0,100 6142571 eventpoll java10508 root 398u FIFO0,6 6135319 pipe java10508 root 399w FIFO0,6 6135319 pipe I'm using FSDataInputStream and FSDataOutputStream, so this might be related to pipes? So, my questions are: 1) What happens these pipes/epolls to appear? 2) More important, how I can prevent their accumation and growth? Thanks in advance! 2009/6/21 Stas Oskin stas.os...@gmail.com Hi. I have HDFS client and HDFS datanode running on same machine. When I'm trying to access a dozen of files at once from the client, several times in a row, I'm starting to receive the following errors on client, and HDFS browse function. HDFS Client: Could not get block locations. Aborting... HDFS browse: Too many open files I can increase the maximum number of files that can opened, as I have it set to the default 1024, but would like to first solve the problem, as larger value just means it would run out of files again later on. So my questions are: 1) Does the HDFS datanode keeps any files opened, even after the HDFS client have already closed them? 2) Is it possible to find out, who keeps the opened files - datanode or client (so I could pin-point the source of the problem). Thanks in advance!
Re: Too many open files error, which gets resolved after some time
64k might help in the sense, you might hit GC before you hit the limit. Otherwise, your only options are to use the patch attached to HADOOP-4346 or run System.gc() occasionally. I think it should be committed to 0.18.4 Raghu. Stas Oskin wrote: Hi. Yes, it happens with 0.18.3. I'm closing now every FSData stream I receive from HDFS, so the number of open fd's in DataNode is reduced. Problem is that my own DFS client still have a high number of fd's open, mostly pipes and epolls. They sometimes quickly drop to the level of ~400 - 500, and sometimes just stuck at ~1000. I'm still trying to find out how well it behaves if I set the maximum fd number to 65K. Regards. 2009/6/22 Raghu Angadi rang...@yahoo-inc.com Is this before 0.20.0? Assuming you have closed these streams, it is mostly https://issues.apache.org/jira/browse/HADOOP-4346 It is the JDK internal implementation that depends on GC to free up its cache of selectors. HADOOP-4346 avoids this by using hadoop's own cache. Raghu. Stas Oskin wrote: Hi. After tracing some more with the lsof utility, and I managed to stop the growth on the DataNode process, but still have issues with my DFS client. It seems that my DFS client opens hundreds of pipes and eventpolls. Here is a small part of the lsof output: java10508 root 387w FIFO0,6 6142565 pipe java10508 root 388r FIFO0,6 6142565 pipe java10508 root 389u 0,100 6142566 eventpoll java10508 root 390u FIFO0,6 6135311 pipe java10508 root 391r FIFO0,6 6135311 pipe java10508 root 392u 0,100 6135312 eventpoll java10508 root 393r FIFO0,6 6148234 pipe java10508 root 394w FIFO0,6 6142570 pipe java10508 root 395r FIFO0,6 6135857 pipe java10508 root 396r FIFO0,6 6142570 pipe java10508 root 397r 0,100 6142571 eventpoll java10508 root 398u FIFO0,6 6135319 pipe java10508 root 399w FIFO0,6 6135319 pipe I'm using FSDataInputStream and FSDataOutputStream, so this might be related to pipes? So, my questions are: 1) What happens these pipes/epolls to appear? 2) More important, how I can prevent their accumation and growth? Thanks in advance! 2009/6/21 Stas Oskin stas.os...@gmail.com Hi. I have HDFS client and HDFS datanode running on same machine. When I'm trying to access a dozen of files at once from the client, several times in a row, I'm starting to receive the following errors on client, and HDFS browse function. HDFS Client: Could not get block locations. Aborting... HDFS browse: Too many open files I can increase the maximum number of files that can opened, as I have it set to the default 1024, but would like to first solve the problem, as larger value just means it would run out of files again later on. So my questions are: 1) Does the HDFS datanode keeps any files opened, even after the HDFS client have already closed them? 2) Is it possible to find out, who keeps the opened files - datanode or client (so I could pin-point the source of the problem). Thanks in advance!
Re: Disk Usage Overhead of Hadoop Upgrade
The initial overhead is fairly small (extra hard link for each file). After that, the overhead grows as you delete the files (thus its blocks) that existed before the upgrade.. since the physical files for blocks are deleted only after you finalize. So the overhead == (the blocks that got deleted after the upgrade). Raghu. Stu Hood wrote: Hey gang, We're preparing to upgrade our cluster from Hadoop 0.15.3 to 0.18.3. How much disk usage overhead can we expect from the block conversion before we finalize the upgrade? In the worst case, will the upgrade cause our disk usage to double? Thanks, Stu Hood Search Team Technical Lead Email Apps Division, Rackspace Hosting
Re: HDFS data transfer!
Thanks Brian for the good advice. Slightly off topic from original post: there will be occasions where it is necessary or better to copy different portions of a file in parallel (distcp can benefit a lot). There is a proposal to let HDFS 'stitch' multiple files into one: something like NameNode.stitchFiles(Path to, Path[] files) This way a very large file can be copied more efficiently (with a map/red job, for e.g). Another use case is for high latency and high bandwidth connections (like coast-to-coast). High latency can be some what worked around by using large buffers for tcp connections, but usually users don't have that control. It is just simpler to use multiple connections. This will obviously be HDFS only interface (i.e. not a FileSystem method) at least initially. Raghu. Brian Bockelman wrote: Hey Sugandha, Transfer rates depend on the quality/quantity of your hardware and the quality of your client disk that is generating the data. I usually say that you should expect near-hardware-bottleneck speeds for an otherwise idle cluster. There should be no make it fast required (though you should reviewi the logs for errors if it's going slow). I would expect a 5GB file to take around 3-5 minutes to write on our cluster, but it's a well-tuned and operational cluster. As Todd (I think) mentioned before, we can't help any when you say I want to make it faster. You need to provide diagnostic information - logs, Ganglia plots, stack traces, something - that folks can look at. Brian On Jun 10, 2009, at 2:25 AM, Sugandha Naolekar wrote: But if I want to make it fast, then??? I want to place the data in HDFS and reoplicate it in fraction of seconds. Can that be possible. and How? On Wed, Jun 10, 2009 at 2:47 PM, kartik saxena kartik@gmail.com wrote: I would suppose about 2-3 hours. It took me some 2 days to load a 160 Gb file. Secura On Wed, Jun 10, 2009 at 11:56 AM, Sugandha Naolekar sugandha@gmail.comwrote:It Hello! If I try to transfer a 5GB VDI file from a remote host(not a part of hadoop cluster) into HDFS, and get it back, how much time is it supposed to take? No map-reduce involved. Simply Writing files in and out from HDFS through a simple code of java (usage of API's). -- Regards! Sugandha -- Regards! Sugandha
Re: Multiple NIC Cards
I still need to go through the whole thread. but we feel your pain. First, please try setting fs.default.name to namenode internal ip on the datanodes. This should make NN to attach internal ip so the datanodes (assuming your routing is correct). NameNode webUI should list internal ips for datanode. You might have to temporarily change NameNode code to listen on 0.0.0.0. That said, The issues you are facing are pretty unfortunate. As Steve mentioned Hadoop is all confused about hostname/ip and there is unecessary reliance on hostname and reverse DNS look ups in many many places. At least fairly straight fwd set ups with multiple NICs should be handled well. dfs.datanode.dns.interface should work like you expected (but not very surprised it didn't). Another thing you could try is setting dfs.datanode.address to the internal ip address (this might already be discussed in the thread). This should at least get all the bulk datatransfers happen over internal NICs. One way to make sure is to hover on the datanode node on NameNode webUI.. it shows the ip address. good luck. It might be better document your pains and findings in a Jira (with most of the details in one or more comments rather than in description). Raghu. John Martyniak wrote: So I changed all of the 0.0.0.0 on one machine to point to the 192.168.1.102 address. And still it picks up the hostname and ip address of the external network. I am kind of at my wits end with this, as I am not seeing a solution yet, except to take the machines off of the external network and leave them on the internal network which isn't an option. Has anybody had this problem before? What was the solution? -John On Jun 9, 2009, at 10:17 AM, Steve Loughran wrote: One thing to consider is that some of the various services of Hadoop are bound to 0:0:0:0, which means every Ipv4 address, you really want to bring up everything, including jetty services, on the en0 network adapter, by binding them to 192.168.1.102; this will cause anyone trying to talk to them over the other network to fail, which at least find the problem sooner rather than later
Re: Command-line jobConf options in 0.18.3
Tom White wrote: Actually, the space is needed, to be interpreted as a Hadoop option by ToolRunner. Without the space it sets a Java system property, which Hadoop will not automatically pick up. I don't think space is required. Something like -Dfs.default.name=host:port works. I don't see ToolRunner setting any java properties. Ian, try putting the options after the classname and see if that helps. Otherwise, it would be useful to see a snippet of the program code. right. the options that go to the class should appear after the class name. Note that it is not necessary to use ToolRunner (which I don't find very convenient in many cases). You can use GenericOptionsParser directly. an example : https://issues.apache.org/jira/browse/HADOOP-5961 Raghu. Thanks, Tom On Thu, Jun 4, 2009 at 8:23 PM, Vasyl Keretsman vasi...@gmail.com wrote: Perhaps, there should not be the space between -D and your option ? -Dprise.collopts= Vasyl 2009/6/4 Ian Soboroff ian.sobor...@nist.gov: bin/hadoop jar -files collopts -D prise.collopts=collopts p3l-3.5.jar gov.nist.nlpir.prise.mapred.MapReduceIndexer input output The 'prise.collopts' option doesn't appear in the JobConf. Ian Aaron Kimball aa...@cloudera.com writes: Can you give an example of the exact arguments you're sending on the command line? - Aaron On Wed, Jun 3, 2009 at 5:46 PM, Ian Soboroff ian.sobor...@nist.gov wrote: If after I call getConf to get the conf object, I manually add the key/ value pair, it's there when I need it. So it feels like ToolRunner isn't parsing my args for some reason. Ian On Jun 3, 2009, at 8:45 PM, Ian Soboroff wrote: Yes, and I get the JobConf via 'JobConf job = new JobConf(conf, the.class)'. The conf is the Configuration object that comes from getConf. Pretty much copied from the WordCount example (which this program used to be a long while back...) thanks, Ian On Jun 3, 2009, at 7:09 PM, Aaron Kimball wrote: Are you running your program via ToolRunner.run()? How do you instantiate the JobConf object? - Aaron On Wed, Jun 3, 2009 at 10:19 AM, Ian Soboroff ian.sobor...@nist.gov wrote: I'm backporting some code I wrote for 0.19.1 to 0.18.3 (long story), and I'm finding that when I run a job and try to pass options with -D on the command line, that the option values aren't showing up in my JobConf. I logged all the key/value pairs in the JobConf, and the option I passed through with -D isn't there. This worked in 0.19.1... did something change with command-line options from 18 to 19? Thanks, Ian
Re: Cluster Setup Issues : Datanode not being initialized.
Did you try 'telnet 198.55.35.229 54310' from this datanode? The log show that it is not able to connect to master:54310. ssh from datanode does not matter. Raghu. asif md wrote: I can SSH both ways .i.e. From master to slave and slave to master. the datanode is getting intialized at master but the log at slave looks like this / 2009-06-04 15:20:06,066 INFO org.apache.hadoop.dfs.DataNode: STARTUP_MSG: / STARTUP_MSG: Starting DataNode STARTUP_MSG: host = STARTUP_MSG: args = [] STARTUP_MSG: version = 0.18.3 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r 736250; compiled by 'ndaley' on Thu Jan 22 23:12:08 UTC 2009 / 2009-06-04 15:20:08,826 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/198.55.35.229:54310. Already tried 0 time(s). 2009-06-04 15:20:09,829 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/198.55.35.229:54310. Already tried 1 time(s). 2009-06-04 15:20:10,831 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/198.55.35.229:54310. Already tried 2 time(s). 2009-06-04 15:20:11,832 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/198.55.35.229:54310. Already tried 3 time(s). 2009-06-04 15:20:12,834 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/198.55.35.229:54310. Already tried 4 time(s). 2009-06-04 15:20:13,837 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/198.55.35.229:54310. Already tried 5 time(s). 2009-06-04 15:20:14,840 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/198.55.35.229:54310. Already tried 6 time(s). 2009-06-04 15:20:15,841 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/198.55.35.229:54310. Already tried 7 time(s). 2009-06-04 15:20:16,844 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/198.55.35.229:54310. Already tried 8 time(s). 2009-06-04 15:20:17,847 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/198.55.35.229:54310. Already tried 9 time(s). 2009-06-04 15:20:17,873 ERROR org.apache.hadoop.dfs.DataNode: java.io.IOException: Call to master/198.55.35.229:54310 failed on local exception: java.net.NoRouteToHostException: No route to host at org.apache.hadoop.ipc.Client.wrapException(Client.java:751) at org.apache.hadoop.ipc.Client.call(Client.java:719) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.dfs.$Proxy4.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:348) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:335) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:372) at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:309) at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:286) at org.apache.hadoop.dfs.DataNode.startDataNode(DataNode.java:277) at org.apache.hadoop.dfs.DataNode.init(DataNode.java:223) at org.apache.hadoop.dfs.DataNode.makeInstance(DataNode.java:3071) at org.apache.hadoop.dfs.DataNode.instantiateDataNode(DataNode.java:3026) at org.apache.hadoop.dfs.DataNode.createDataNode(DataNode.java:3034) at org.apache.hadoop.dfs.DataNode.main(DataNode.java:3156) Caused by: java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:100) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:301) at org.apache.hadoop.ipc.Client$Connection.access$1700(Client.java:178) at org.apache.hadoop.ipc.Client.getConnection(Client.java:820) at org.apache.hadoop.ipc.Client.call(Client.java:705) ... 13 more 2009-06-04 15:20:17,874 INFO org.apache.hadoop.dfs.DataNode: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down DataNode at *** **88 Please suggest. Asif. On Thu, Jun 4, 2009 at 4:15 PM, asif md asif.d...@gmail.com wrote: @Ravi thanx ravi .. i'm now using my a definded tmp dir so the second issue is resolved. But i have ssh keys tht have passwords. But i am able to ssh to the slave and master from the master. should i be able to do tht from the slave as well. @ALL Any suggestions. Thanx Asif. On Thu, Jun 4, 2009 at 3:17 PM, Ravi Phulari rphul...@yahoo-inc.comwrote: From logs looks like your Hadoop cluster is facing two different issues . At Slave 1. exception: java.net.NoRouteToHostException: No route to host in your logs Diagnosis - One of your nodes cannot be reached correctly. Make sure you can ssh to your master and
Re: Renaming all nodes in Hadoop cluster
Renaming datanodes should not affect HDFS. HDFS does not depend on hostname or ip for consistency of data. You can try renaming a few of the nodes. Of course, if you rename NameNode, you need to update the config file to reflect that. Stuart White wrote: Is it possible to rename all nodes in a Hadoop cluster and not lose the data stored on hdfs? Of course I'll need to update the master and slaves files, but I'm not familiar with how hdfs tracks where it has written all the splits of the files. Is it possible to retain the data written to hdfs when renaming all nodes in the cluster, and if so, what additional configuration changes, if any, are required?
Re: Question on HDFS write performance
Can you post the patch for these measurements? I can guess where these are measured but better to see the actual changes. For. e.g. the third datanode does only two things : receiving and writing data to the disk. So avg block writing time for you should be around sum of these two (~6-7k) but it is much larger (CRC verification should not affect much). Not sure why that is the case. How many simultaneous maps are you running? Looking at the first datanode stats, the fact that it spends a lot more time writing to disk compared to receiving, you are mostly harddisk bound. Raghu. Martin Mituzas wrote: I need to indentify the bottleneck of my current cluster when running io-bound benchmarks. I run a test with 4 nodes, 1 node as job tracker and namenode, 3 nodes as task tracker and data nodes. I run RandomWriter to generate 30G data with 15 mappers, and then run Sort on the generated data with 15 reducers. Replication is 3. I add log code into HDFS code and then analyze the generated log for randomwriter period and sort period. The result is as follows. I measured the following values: 1) average block preparation time: the time DFSClient spent to generate all packets for a block. 2) average block writing time: from the time DFSClient gets an allocated block from namenode in nextBlockOutputStream() to all Acks are received. 3) average network receiving time and average disk writing time for a block for the first, second, third datanode in the pipeline. RandomWriter Total 528 blocks, total size is 34063336366, full blocks(64M): 506 Average block preparation time by client is: 11456.21 Average writing time for one block(64M): 11931.49 Average time on No.0 target datanode: average network receiving time :112.44 average disk writing time :3035.04 Average time on No.1 target datanode: average network receiving time :3337.68 average disk writing time :2950.74 Average time on No.2 target datanode: average network receiving time :3171.18 average disk writing time :2646.38 sort Total 494 blocks, total size is 32318504139, full blocks(64M): 479 Average block preparation time by client is: 16237.59 Average writing time for one block(64M): 16642.67 Average time on No.0 target datanode: average network receiving time :164.28 average disk writing time :3331.50 Average time on No.1 target datanode: average network receiving time :2125.62 average disk writing time :3436.32 Average time on No.2 target datanode: average network receiving time :2856.56 average disk writing time :3426.04 And my question is why the network receiving time on the third node is larger than the other two nodes. Another question, how to identify the bottlenecks? Or You can tell me what other kinds of values should be collected. Thanks in advance.
Re: InputStream.open() efficiency
Stas Oskin wrote: Hi. Thanks for the answer. Would up to 5 minute of handlers cause any issues? 5 min should not cause any issues.. And same about writing? writing is not affected by the couple of issues I mentioned. Writing over a long time should work as well as writing over shorter time. Raghu. Regards. 2009/5/26 Raghu Angadi rang...@yahoo-inc.com 'in.seek(); in.read()' is certainly better than, 'in = fs.open(); in.seek(); in.read()' The difference is is exactly one open() call. So you would save an RPC to NameNode. There are couple of issues that affect apps that keep the handlers open very long time (many hours to days).. but those will be fixed soon. Raghu. Stas Oskin wrote: Hi. I'm looking to find out, how the InputStream.open() + skip(), compares to keeping a handle of InputStream() and just seeking the position. Has anyone compared these approaches, and can advice on their speed? Regards.
Re: Circumventing Hadoop's data placement policy
As hack, you could tunnel NN traffic from GridFTP clients through a different machine (by changing fs.default.name). Alternately these clients could use a socks proxy. The amount of traffic to NN is not much and tunneling should not affect performance. Raghu. Brian Bockelman wrote: Hey all, Had a problem I wanted to ask advice on. The Caltech site I work with currently have a few GridFTP servers which are on the same physical machines as the Hadoop datanodes, and a few that aren't. The GridFTP server has a libhdfs backend which writes incoming network data into HDFS. They've found that the GridFTP servers which are co-located with HDFS datanode have poor performance because data is incoming at a much faster rate than the HDD can handle. The standalone GridFTP servers, however, push data out to multiple nodes at one, and can handle the incoming data just fine (200MB/s). Is there any way to turn off the preference for the local node? Can anyone think of a good workaround to trick HDFS into thinking the client isn't on the same node? Brian
Re: Could only be replicated to 0 nodes, instead of 1
I think you should file a jira on this. Most likely this is what is happening : * two out of 3 dns can not take anymore blocks. * While picking nodes for a new block, NN mostly skips the third dn as well since '# active writes' on it is larger than '2 * avg'. * Even if there is one other block is being written on the 3rd, it is still greater than (2 * 1/3). To test this, if you write just one block to an idle cluster it should succeed. Writing from the client on the 3rd dn succeeds since local node is always favored. This particular problem is not that severe on a large cluster but HDFS should do the sensible thing. Raghu. Stas Oskin wrote: Hi. I'm testing Hadoop in our lab, and started getting the following message when trying to copy a file: Could only be replicated to 0 nodes, instead of 1 I have the following setup: * 3 machines, 2 of them with only 80GB of space, and 1 with 1.5GB * Two clients are copying files all the time (one of them is the 1.5GB machine) * The replication is set on 2 * I let the space on 2 smaller machines to end, to test the behavior Now, one of the clients (the one located on 1.5GB) works fine, and the other one - the external, unable to copy and displays the error + the exception below Any idea if this expected on my scenario? Or how it can be solved? Thanks in advance. 09/05/21 10:51:03 WARN dfs.DFSClient: NotReplicatedYetException sleeping /test/test.bin retries left 1 09/05/21 10:51:06 WARN dfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /test/test.bin could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1123 ) at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:330) at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25 ) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:890) at org.apache.hadoop.ipc.Client.call(Client.java:716) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39 ) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25 ) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82 ) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59 ) at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2450 ) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2333 ) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1745 ) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1922 ) 09/05/21 10:51:06 WARN dfs.DFSClient: Error Recovery for block null bad datanode[0] java.io.IOException: Could not get block locations. Aborting... at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2153 ) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400(DFSClient.java:1745 ) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1899 )
Re: Could only be replicated to 0 nodes, instead of 1
Brian Bockelman wrote: On May 21, 2009, at 2:01 PM, Raghu Angadi wrote: I think you should file a jira on this. Most likely this is what is happening : * two out of 3 dns can not take anymore blocks. * While picking nodes for a new block, NN mostly skips the third dn as well since '# active writes' on it is larger than '2 * avg'. * Even if there is one other block is being written on the 3rd, it is still greater than (2 * 1/3). To test this, if you write just one block to an idle cluster it should succeed. Writing from the client on the 3rd dn succeeds since local node is always favored. This particular problem is not that severe on a large cluster but HDFS should do the sensible thing. Hey Raghu, If this analysis is right, I would add it can happen even on large clusters! I've seen this error at our cluster when we're very full (97%) and very few nodes have any empty space. This usually happens because we have two very large nodes (10x bigger than the rest of the cluster), and HDFS tends to distribute writes randomly -- meaning the smaller nodes fill up quickly, until the balancer can catch up. Yes. This would bite when ever a large portion of nodes can not accept blocks. In general can happen whenever less than half the nodes have any space left. Raghu.
Re: Could only be replicated to 0 nodes, instead of 1
Stas Oskin wrote: I think you should file a jira on this. Most likely this is what is happening : Here it is - hope it's ok: https://issues.apache.org/jira/browse/HADOOP-5886 looks good. I will add my earlier post as comment. You could update the jira with any more tests. Next time, it would be better include larger stack traces, logs etc in subsequent comments rather than in the description. Thanks, Raghu.
Re: How to replace the storage on a datanode without formatting the namenode?
jason hadoop wrote: Raghu, your technique will only work well if you can complete steps 1-4 in less than the datanode timeout interval, which may be valid for Alexandria. I believe the timeout is 10 minutes. Correct I should have mentioned it as well. the process should finish in 10 minutes. Alternately user could stop NameNode or the entire cluster during this time or swap steps (2) and (3) Raghu. If you pass the timeout interval the namenode will start to rebalance the blocks, and when the datanode comes back it will delete all of the blocks it has rebalanced. On Thu, May 14, 2009 at 11:35 AM, Raghu Angadi rang...@yahoo-inc.comwrote: Along these lines, even simpler approach I would think is : 1) set data.dir to local and create the data. 2) stop the datanode 3) rsync local_dir network_dir 4) start datanode with data.dir with network_dir There is no need to format or rebalnace. This way you can switch between local and network multiple times (without needing to rsync data, if there are no changes made in the tests) Raghu. Alexandra Alecu wrote: Another possibility I am thinking about now, which is suitable for me as I do not actually have much data stored in the cluster when I want to perform this switch is to set the replication level really high and then simply remove the local storage locations and restart the cluster. With a bit of luck the high level of replication will allow a full recovery of the cluster on restart. Is this something that you would advice? Many thanks, Alexandra.
Re: How to replace the storage on a datanode without formatting the namenode?
Along these lines, even simpler approach I would think is : 1) set data.dir to local and create the data. 2) stop the datanode 3) rsync local_dir network_dir 4) start datanode with data.dir with network_dir There is no need to format or rebalnace. This way you can switch between local and network multiple times (without needing to rsync data, if there are no changes made in the tests) Raghu. Alexandra Alecu wrote: Another possibility I am thinking about now, which is suitable for me as I do not actually have much data stored in the cluster when I want to perform this switch is to set the replication level really high and then simply remove the local storage locations and restart the cluster. With a bit of luck the high level of replication will allow a full recovery of the cluster on restart. Is this something that you would advice? Many thanks, Alexandra.
Re: public IP for datanode on EC2
Philip Zeyliger wrote: You could use ssh to set up a SOCKS proxy between your machine and ec2, and setup org.apache.hadoop.net.SocksSocketFactory to be the socket factory. http://www.cloudera.com/blog/2008/12/03/securing-a-hadoop-cluster-through-a-gateway/ has more information. very useful write up. Regd the problem with reverse DNS mentioned (thats why you had to add a DNS record for internal ip) it is fixed in https://issues.apache.org/jira/browse/HADOOP-5191 (for HDFS access least). Some mapred parts are still affected (HADOOP-5610). Depending on reverse DNS should avoided. Ideally setting fs.default.name to internal ip should just work for clients.. both internally and externally (through proxies). Raghu.
Re: Suggestions for making writing faster? DFSClient waiting while writing chunk
It should not be waiting unnecessarily. But the client has to, if any of the datanodes in the pipeline is not able to receive the as fast as client is writing. IOW writing goes as fast as the slowest of nodes involved in the pipeline (1 client and 3 datanodes). But based on what your case is, you probably could benefit by increasing the buffer (number of unacked packets).. it would depend on where the datastream thread is blocked. Raghu. stack wrote: Writing a file, our application spends a load of time here: at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:485) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.writeChunk(DFSClient.java:2964) - locked 0x7f11054c2b68 (a java.util.LinkedList) - locked 0x7f11054c24c0 (a org.apache.hadoop.hdfs.DFSClient$DFSOutputStream) at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:150) at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:132) - locked 0x7f11054c24c0 (a org.apache.hadoop.hdfs.DFSClient$DFSOutputStream) at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:121) - locked 0x7f11054c24c0 (a org.apache.hadoop.hdfs.DFSClient$DFSOutputStream) at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:112) at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:86) - locked 0x7f11054c24c0 (a org.apache.hadoop.hdfs.DFSClient$DFSOutputStream) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:49) at java.io.DataOutputStream.write(DataOutputStream.java:90) - locked 0x7f1105694f28 (a org.apache.hadoop.fs.FSDataOutputStream) at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1020) - locked 0x7f1105694e98 (a org.apache.hadoop.io.SequenceFile$Writer) at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:984) Here is the code from around line 2964 in writeChunk. // If queue is full, then wait till we can create enough space while (!closed dataQueue.size() + ackQueue.size() maxPackets) { try { dataQueue.wait(); } catch (InterruptedException e) { } } The queue of packets is full and we're waiting for it to be cleared. Any suggestions for how I might get the DataStreamer to act more promptly clearing the package queue? This is hadoop 0.20 branch. Its a small cluster but relatively lightly loaded (so says ganglia). Thanks, St.Ack
Re: Huge DataNode Virtual Memory Usage
what do 'jmap' and 'jmap -histo:live' show?. Raghu. Stefan Will wrote: Chris, Thanks for the tip ... However I'm already running 1.6_10: java version 1.6.0_10 Java(TM) SE Runtime Environment (build 1.6.0_10-b33) Java HotSpot(TM) 64-Bit Server VM (build 11.0-b15, mixed mode) Do you know of a specific bug # in the JDK bug database that addresses this ? Cheers, Stefan From: Chris Collins ch...@scoutlabs.com Reply-To: core-user@hadoop.apache.org Date: Fri, 8 May 2009 20:34:21 -0700 To: core-user@hadoop.apache.org core-user@hadoop.apache.org Subject: Re: Huge DataNode Virtual Memory Usage Stefan, there was a nasty memory leak in in 1.6.x before 1.6 10. It manifested itself during major GC. We saw this on linux and solaris and dramatically improved with an upgrade. C On May 8, 2009, at 6:12 PM, Stefan Will wrote: Hi, I just ran into something rather scary: One of my datanode processes that I¹m running with Xmx256M, and a maximum number of Xceiver threads of 4095 had a virtual memory size of over 7GB (!). I know that the VM size on Linux isn¹t necessarily equal to the actual memory used, but I wouldn¹t expect it to be an order of magnitude higher either. I ran pmap on the process, and it showed around 1000 thread stack blocks with roughly 1MB each (which is the default size on the 64bit JDK). The largest block was 3GB in size which I can¹t figure out what it is for. Does anyone have any insights into this ? Anything that can be done to prevent this other than to restart the DFS regularly ? -- Stefan
Re: Is HDFS protocol written from scratch?
Philip Zeyliger wrote: It's over TCP/IP, in a custom protocol. See DataXceiver.java. My sense is that it's a custom protocol because Hadoop's IPC mechanism isn't optimized for large messages. yes, and job classes are not distributed using this. It is a very simple protocol used to read and write raw data to DataNodes. -- Philip On Thu, May 7, 2009 at 9:11 AM, Foss User foss...@gmail.com wrote: I understand that the blocks are transferred between various nodes using HDFS protocol. I believe, even the job classes are distributed as files using the same HDFS protocol. Is this protocol written over TCP/IP from scratch or this is a protocol that works on top of some other protocol like HTTP, etc.?
Re: java.io.EOFException: while trying to read 65557 bytes
Albert Sunwoo wrote: Thanks for the info! I was hoping to get some more specific information though. in short : we need to more info. There are typically 4 machines/processes involved in a write : the client and 3 datanodes writing the replicas. To see what really happened, you need to provide error message(s) for this block on these other parts (at least on 3 datanodes should be useful). This particular error just implies this datanode is the 2nd of the 3 datanodes (assuming replication of 3) in the write pipeline and its connection from the 1st datanode was closed. To deduce more we need more info... starting with what happened to that block on the first datanode. also the 3rd datanode is 10.102.0.106, the block you should grep for in other logs is blk_-7056150840276493498 etc.. You should try to see what could be useful information for others diagnose the problem... more than likely you will find the cause yourself in the process. Raghu. We are seeing these occur during every run, and as such it's not leaving some folks in our organization with a good feeling about the reliability of HDFS. Do these occur as a result of resources being unavailable? Perhaps the nodes are too busy and can no longer service reads from other nodes? Or if the jobs are causing too much network traffic? At first glance the machines do not seemed to be pinned, however I am wondering if sudden bursts of jobs can be causing these as well. If so does anyone have configuration recommendations to minimize or remove these errors under any of these circumstances, or perhaps there is another explanation? Thanks, Albert On 5/5/09 11:34 AM, Raghu Angadi rang...@yahoo-inc.com wrote: This can happen for example when a client is killed when it has some files open for write. In that case it is an expected error (the log should really be at WARN or INFO level). Raghu. Albert Sunwoo wrote: Hello Everyone, I know there's been some chatter about this before but I am seeing the errors below on just about every one of our nodes. Is there a definitive reason on why these are occuring, is there something that we can do to prevent these? 2009-05-04 21:35:11,764 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.102.0.105:50010, storageID=DS-991582569-127.0.0.1-50010-1240886381606, infoPort=50075, ipcPort=50020):DataXceiver java.io.EOFException: while trying to read 65557 bytes at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:264) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:308) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:372) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:524) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:357) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103) at java.lang.Thread.run(Thread.java:619) Followed by: 2009-05-04 21:35:20,891 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder blk_-7056150840276493498_10885 1 Exception java.io.InterruptedIOException: Interruped while waiting for IO on channel java.nio.channels.Socke tChannel[connected local=/10.102.0.105:37293 remote=/10.102.0.106:50010]. 59756 millis timeout left. at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:277) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123) at java.io.DataInputStream.readFully(DataInputStream.java:178) at java.io.DataInputStream.readLong(DataInputStream.java:399) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:853) at java.lang.Thread.run(Thread.java:619) Thanks, Albert
Re: Namenode failed to start with FSNamesystem initialization failed error
Tamir Kamara wrote: Hi Raghu, The thread you posted is my original post written when this problem first happened on my cluster. I can file a JIRA but I wouldn't be able to provide information other than what I already posted and I don't have the logs from that time. Should I still file ? yes. Jira is a better place for tracking and fixing bugs. I am pretty sure what you saw is a bug (either already or needs to be fixed). Raghu. Thanks, Tamir On Tue, May 5, 2009 at 9:14 PM, Raghu Angadi rang...@yahoo-inc.com wrote: Tamir, Please file a jira on the problem you are seeing with 'saveLeases'. In the past there have been multiple fixes in this area (HADOOP-3418, HADOOP-3724, and more mentioned in HADOOP-3724). Also refer the thread you started http://www.mail-archive.com/core-user@hadoop.apache.org/msg09397.html I think another user reported the same problem recently. These are indeed very serious and very annoying bugs. Raghu. Tamir Kamara wrote: I didn't have a space problem which led to it (I think). The corruption started after I bounced the cluster. At the time, I tried to investigate what led to the corruption but didn't find anything useful in the logs besides this line: saveLeases found path /tmp/temp623789763/tmp659456056/_temporary_attempt_200904211331_0010_r_02_0/part-2 but no matching entry in namespace I also tried to recover from the secondary name node files but the corruption my too wide-spread and I had to format. Tamir On Mon, May 4, 2009 at 4:48 PM, Stas Oskin stas.os...@gmail.com wrote: Hi. Same conditions - where the space has run out and the fs got corrupted? Or it got corrupted by itself (which is even more worrying)? Regards. 2009/5/4 Tamir Kamara tamirkam...@gmail.com I had the same problem a couple of weeks ago with 0.19.1. Had to reformat the cluster too... On Mon, May 4, 2009 at 3:50 PM, Stas Oskin stas.os...@gmail.com wrote: Hi. After rebooting the NameNode server, I found out the NameNode doesn't start anymore. The logs contained this error: FSNamesystem initialization failed I suspected filesystem corruption, so I tried to recover from SecondaryNameNode. Problem is, it was completely empty! I had an issue that might have caused this - the root mount has run out of space. But, both the NameNode and the SecondaryNameNode directories were on another mount point with plenty of space there - so it's very strange that they were impacted in any way. Perhaps the logs, which were located on root mount and as a result, could not be written, have caused this? To get back HDFS running, i had to format the HDFS (including manually erasing the files from DataNodes). While this reasonable in test environment - production-wise it would be very bad. Any idea why it happened, and what can be done to prevent it in the future? I'm using the stable 0.18.3 version of Hadoop. Thanks in advance!
Re: Namenode failed to start with FSNamesystem initialization failed error
Stas, This is indeed a serious issue. Did you happen to store the the corrupt image? Can this be reproduced using the image? Usually you can recover manually from a corrupt or truncated image. But more importantly we want to find how it got in to this state. Raghu. Stas Oskin wrote: Hi. This quite worry-some issue. Can anyone advice on this? I'm really concerned it could appear in production, and cause a huge data loss. Is there any way to recover from this? Regards. 2009/5/5 Tamir Kamara tamirkam...@gmail.com I didn't have a space problem which led to it (I think). The corruption started after I bounced the cluster. At the time, I tried to investigate what led to the corruption but didn't find anything useful in the logs besides this line: saveLeases found path /tmp/temp623789763/tmp659456056/_temporary_attempt_200904211331_0010_r_02_0/part-2 but no matching entry in namespace I also tried to recover from the secondary name node files but the corruption my too wide-spread and I had to format. Tamir On Mon, May 4, 2009 at 4:48 PM, Stas Oskin stas.os...@gmail.com wrote: Hi. Same conditions - where the space has run out and the fs got corrupted? Or it got corrupted by itself (which is even more worrying)? Regards. 2009/5/4 Tamir Kamara tamirkam...@gmail.com I had the same problem a couple of weeks ago with 0.19.1. Had to reformat the cluster too... On Mon, May 4, 2009 at 3:50 PM, Stas Oskin stas.os...@gmail.com wrote: Hi. After rebooting the NameNode server, I found out the NameNode doesn't start anymore. The logs contained this error: FSNamesystem initialization failed I suspected filesystem corruption, so I tried to recover from SecondaryNameNode. Problem is, it was completely empty! I had an issue that might have caused this - the root mount has run out of space. But, both the NameNode and the SecondaryNameNode directories were on another mount point with plenty of space there - so it's very strange that they were impacted in any way. Perhaps the logs, which were located on root mount and as a result, could not be written, have caused this? To get back HDFS running, i had to format the HDFS (including manually erasing the files from DataNodes). While this reasonable in test environment - production-wise it would be very bad. Any idea why it happened, and what can be done to prevent it in the future? I'm using the stable 0.18.3 version of Hadoop. Thanks in advance!
Re: Namenode failed to start with FSNamesystem initialization failed error
Tamir, Please file a jira on the problem you are seeing with 'saveLeases'. In the past there have been multiple fixes in this area (HADOOP-3418, HADOOP-3724, and more mentioned in HADOOP-3724). Also refer the thread you started http://www.mail-archive.com/core-user@hadoop.apache.org/msg09397.html I think another user reported the same problem recently. These are indeed very serious and very annoying bugs. Raghu. Tamir Kamara wrote: I didn't have a space problem which led to it (I think). The corruption started after I bounced the cluster. At the time, I tried to investigate what led to the corruption but didn't find anything useful in the logs besides this line: saveLeases found path /tmp/temp623789763/tmp659456056/_temporary_attempt_200904211331_0010_r_02_0/part-2 but no matching entry in namespace I also tried to recover from the secondary name node files but the corruption my too wide-spread and I had to format. Tamir On Mon, May 4, 2009 at 4:48 PM, Stas Oskin stas.os...@gmail.com wrote: Hi. Same conditions - where the space has run out and the fs got corrupted? Or it got corrupted by itself (which is even more worrying)? Regards. 2009/5/4 Tamir Kamara tamirkam...@gmail.com I had the same problem a couple of weeks ago with 0.19.1. Had to reformat the cluster too... On Mon, May 4, 2009 at 3:50 PM, Stas Oskin stas.os...@gmail.com wrote: Hi. After rebooting the NameNode server, I found out the NameNode doesn't start anymore. The logs contained this error: FSNamesystem initialization failed I suspected filesystem corruption, so I tried to recover from SecondaryNameNode. Problem is, it was completely empty! I had an issue that might have caused this - the root mount has run out of space. But, both the NameNode and the SecondaryNameNode directories were on another mount point with plenty of space there - so it's very strange that they were impacted in any way. Perhaps the logs, which were located on root mount and as a result, could not be written, have caused this? To get back HDFS running, i had to format the HDFS (including manually erasing the files from DataNodes). While this reasonable in test environment - production-wise it would be very bad. Any idea why it happened, and what can be done to prevent it in the future? I'm using the stable 0.18.3 version of Hadoop. Thanks in advance!
Re: java.io.EOFException: while trying to read 65557 bytes
This can happen for example when a client is killed when it has some files open for write. In that case it is an expected error (the log should really be at WARN or INFO level). Raghu. Albert Sunwoo wrote: Hello Everyone, I know there's been some chatter about this before but I am seeing the errors below on just about every one of our nodes. Is there a definitive reason on why these are occuring, is there something that we can do to prevent these? 2009-05-04 21:35:11,764 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.102.0.105:50010, storageID=DS-991582569-127.0.0.1-50010-1240886381606, infoPort=50075, ipcPort=50020):DataXceiver java.io.EOFException: while trying to read 65557 bytes at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:264) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:308) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:372) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:524) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:357) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103) at java.lang.Thread.run(Thread.java:619) Followed by: 2009-05-04 21:35:20,891 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder blk_-7056150840276493498_10885 1 Exception java.io.InterruptedIOException: Interruped while waiting for IO on channel java.nio.channels.Socke tChannel[connected local=/10.102.0.105:37293 remote=/10.102.0.106:50010]. 59756 millis timeout left. at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:277) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123) at java.io.DataInputStream.readFully(DataInputStream.java:178) at java.io.DataInputStream.readLong(DataInputStream.java:399) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:853) at java.lang.Thread.run(Thread.java:619) Thanks, Albert
Re: Namenode failed to start with FSNamesystem initialization failed error
Stas Oskin wrote: Actually, we discovered today an annoying bug in our test-app, which might have moved some of the HDFS files to the cluster, including the metadata files. oops! presumably it could have removed the image file itself. I presume it could be the possible reason for such behavior? :) certainly. It could lead to many different failures. If you had stack trace of the exception, it would be more clear what the error was this time. Raghu. 2009/5/5 Stas Oskin stas.os...@gmail.com Hi Raghu. The only lead I have, is that my root mount has filled-up completely. This in itself should not have caused the metadata corruption, as it has been stored on another mount point, which had plenty of space. But perhaps the fact that NameNode/SecNameNode didn't have enough space for logs has caused this? Unfortunately I was pressed in time to get the cluster up and running, and didn't preserve the logs or the image. If this happens again - I will surely do so. Regards. 2009/5/5 Raghu Angadi rang...@yahoo-inc.com Stas, This is indeed a serious issue. Did you happen to store the the corrupt image? Can this be reproduced using the image? Usually you can recover manually from a corrupt or truncated image. But more importantly we want to find how it got in to this state. Raghu. Stas Oskin wrote: Hi. This quite worry-some issue. Can anyone advice on this? I'm really concerned it could appear in production, and cause a huge data loss. Is there any way to recover from this? Regards. 2009/5/5 Tamir Kamara tamirkam...@gmail.com I didn't have a space problem which led to it (I think). The corruption started after I bounced the cluster. At the time, I tried to investigate what led to the corruption but didn't find anything useful in the logs besides this line: saveLeases found path /tmp/temp623789763/tmp659456056/_temporary_attempt_200904211331_0010_r_02_0/part-2 but no matching entry in namespace I also tried to recover from the secondary name node files but the corruption my too wide-spread and I had to format. Tamir On Mon, May 4, 2009 at 4:48 PM, Stas Oskin stas.os...@gmail.com wrote: Hi. Same conditions - where the space has run out and the fs got corrupted? Or it got corrupted by itself (which is even more worrying)? Regards. 2009/5/4 Tamir Kamara tamirkam...@gmail.com I had the same problem a couple of weeks ago with 0.19.1. Had to reformat the cluster too... On Mon, May 4, 2009 at 3:50 PM, Stas Oskin stas.os...@gmail.com wrote: Hi. After rebooting the NameNode server, I found out the NameNode doesn't start anymore. The logs contained this error: FSNamesystem initialization failed I suspected filesystem corruption, so I tried to recover from SecondaryNameNode. Problem is, it was completely empty! I had an issue that might have caused this - the root mount has run out of space. But, both the NameNode and the SecondaryNameNode directories were on another mount point with plenty of space there - so it's very strange that they were impacted in any way. Perhaps the logs, which were located on root mount and as a result, could not be written, have caused this? To get back HDFS running, i had to format the HDFS (including manually erasing the files from DataNodes). While this reasonable in test environment - production-wise it would be very bad. Any idea why it happened, and what can be done to prevent it in the future? I'm using the stable 0.18.3 version of Hadoop. Thanks in advance!
Re: Namenode failed to start with FSNamesystem initialization failed error
the image is stored in two files : fsimage and edits (under namenode-directory/current/). Stas Oskin wrote: Well, it definitely caused the SecondaryNameNode to crash, and also seems to have triggered some strange issues today as well. By the way, how the image file is named?
Re: No route to host prevents from storing files to HDFS
There is some mismatch here.. what is the expected ip address of this machine (or does it have multiple interfaces and properly routed)? Looking at the Receiving Block message DN thinks its address is 192.168.253.20 but NN thinks it is 253.32 (and client is able to connect using 253.32). If you want to find the destination ip that this DN is unable to connect to, you can check client's log for this block number. Stas Oskin wrote: Hi. 2009/4/22 jason hadoop jason.had...@gmail.com Most likely that machine is affected by some firewall somewhere that prevents traffic on port 50075. The no route to host is a strong indicator, particularly if the Datanote registered with the namenode. Yes, this was my first thought as well. But there is no firewall, and the port can be connected via netcat from any other machine. Any other idea? Thanks.
Re: No route to host prevents from storing files to HDFS
Stas Oskin wrote: Tried in step 3 to telnet both the 50010 and the 8010 ports of the problematic datanode - both worked. Shouldn't you be testing connecting _from_ the datanode? The error you posted is while this DN is trying connect to another DN. Raghu. I agree there is indeed an interesting problem :). Question is how it can be solved. Thanks.
Re: More Replication on dfs
Aseem, Regd over-replication, it is mostly app related issue as Alex mentioned. But if you are concerned about under-replicated blocks in fsck output : These blocks should not stay under-replicated if you have enough nodes and enough space on them (check NameNode webui). Try grep-ing for one of the blocks in NameNode log (and datnode logs as well, since you have just 3 nodes). Raghu. Puri, Aseem wrote: Alex, Ouput of $ bin/hadoop fsck / command after running HBase data insert command in a table is: . . . . . /hbase/test/903188508/tags/info/4897652949308499876: Under replicated blk_-5193 695109439554521_3133. Target Replicas is 3 but found 1 replica(s). . /hbase/test/903188508/tags/mapfiles/4897652949308499876/data: Under replicated blk_-1213602857020415242_3132. Target Replicas is 3 but found 1 replica(s). . /hbase/test/903188508/tags/mapfiles/4897652949308499876/index: Under replicated blk_3934493034551838567_3132. Target Replicas is 3 but found 1 replica(s). . /user/HadoopAdmin/hbase table.doc: Under replicated blk_4339521803948458144_103 1. Target Replicas is 3 but found 2 replica(s). . /user/HadoopAdmin/input/bin.doc: Under replicated blk_-3661765932004150973_1030 . Target Replicas is 3 but found 2 replica(s). . /user/HadoopAdmin/input/file01.txt: Under replicated blk_2744169131466786624_10 01. Target Replicas is 3 but found 2 replica(s). . /user/HadoopAdmin/input/file02.txt: Under replicated blk_2021956984317789924_10 02. Target Replicas is 3 but found 2 replica(s). . /user/HadoopAdmin/input/test.txt: Under replicated blk_-3062256167060082648_100 4. Target Replicas is 3 but found 2 replica(s). ... /user/HadoopAdmin/output/part-0: Under replicated blk_8908973033976428484_1 010. Target Replicas is 3 but found 2 replica(s). Status: HEALTHY Total size:48510226 B Total dirs:492 Total files: 439 (Files currently being written: 2) Total blocks (validated): 401 (avg. block size 120973 B) (Total open file blocks (not validated): 2) Minimally replicated blocks: 401 (100.0 %) Over-replicated blocks:0 (0.0 %) Under-replicated blocks: 399 (99.50124 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor:2 Average block replication: 1.3117207 Corrupt blocks:0 Missing replicas: 675 (128.327 %) Number of data-nodes: 2 Number of racks: 1 The filesystem under path '/' is HEALTHY Please tell what is wrong. Aseem -Original Message- From: Alex Loddengaard [mailto:a...@cloudera.com] Sent: Friday, April 10, 2009 11:04 PM To: core-user@hadoop.apache.org Subject: Re: More Replication on dfs Aseem, How are you verifying that blocks are not being replicated? Have you ran fsck? *bin/hadoop fsck /* I'd be surprised if replication really wasn't happening. Can you run fsck and pay attention to Under-replicated blocks and Mis-replicated blocks? In fact, can you just copy-paste the output of fsck? Alex On Thu, Apr 9, 2009 at 11:23 PM, Puri, Aseem aseem.p...@honeywell.comwrote: Hi I also tried the command $ bin/hadoop balancer. But still the same problem. Aseem -Original Message- From: Puri, Aseem [mailto:aseem.p...@honeywell.com] Sent: Friday, April 10, 2009 11:18 AM To: core-user@hadoop.apache.org Subject: RE: More Replication on dfs Hi Alex, Thanks for sharing your knowledge. Till now I have three machines and I have to check the behavior of Hadoop so I want replication factor should be 2. I started my Hadoop server with replication factor 3. After that I upload 3 files to implement word count program. But as my all files are stored on one machine and replicated to other datanodes also, so my map reduce program takes input from one Datanode only. I want my files to be on different data node so to check functionality of map reduce properly. Also before starting my Hadoop server again with replication factor 2 I formatted all Datanodes and deleted all old data manually. Please suggest what I should do now. Regards, Aseem Puri -Original Message- From: Mithila Nagendra [mailto:mnage...@asu.edu] Sent: Friday, April 10, 2009 10:56 AM To: core-user@hadoop.apache.org Subject: Re: More Replication on dfs To add to the question, how does one decide what is the optimal replication factor for a cluster. For instance what would be the appropriate replication factor for a cluster consisting of 5 nodes. Mithila On Fri, Apr 10, 2009 at 8:20 AM, Alex Loddengaard a...@cloudera.com wrote: Did you load any files when replication was set to 3? If so, you'll have to rebalance: http://hadoop.apache.org/core/docs/r0.19.1/commands_manual.html#balance r http://hadoop.apache.org/core/docs/r0.19.1/hdfs_user_guide.html#Rebalanc er Note that most people run HDFS with a replication factor of 3. There have been cases when clusters running with a replication of 2 discovered new bugs, because replication
Re: DataXceiver Errors in 0.19.1
It need not be anything to worry about. Do you see anything at user level (task, job, copy, or script) fail because of this? On a distributed system with many nodes, there would be some errors on some of the nodes for various reasons (load, hardware, reboot, etc). HDFS usually should work around it (because of multiple replicas). In this particular case, client is trying to write some data and one of the DataNodes writing a replica might have gone down. HDFS should recover from it and write to rest of the nodes. Please check if the write actually succeeded. Raghu. Tamir Kamara wrote: Hi, I've recently upgraded to 0.19.1 and now there're some DataXceiver errors in the datanodes logs. There're also messages about interruption while waiting for IO. Both messages are below. Can I do something to fix it ? Thanks, Tamir 2009-04-13 09:57:20,334 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.14.3:50010, storageID=DS-727246419-127.0.0.1-50010-1234873914501, infoPort=50075, ipcPort=50020):DataXceiver java.io.EOFException: while trying to read 65557 bytes at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:264) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:308) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:372) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:524) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:357) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103) at java.lang.Thread.run(Unknown Source) 2009-04-13 09:57:20,333 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder blk_8486030874928774495_54856 1 Exception java.io.InterruptedIOException: Interruped while waiting for IO on channel java.nio.channels.SocketChannel[connected local=/192.168.14.3:50439 remote=/192.168.14.7:50010]. 58972 millis timeout left. at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:277) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123) at java.io.DataInputStream.readFully(Unknown Source) at java.io.DataInputStream.readLong(Unknown Source) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:853) at java.lang.Thread.run(Unknown Source)
Re: Changing block size of hadoop
Aaron Kimball wrote: Blocks already written to HDFS will remain their current size. Blocks are immutable objects. That procedure would set the size used for all subsequently-written blocks. I don't think you can change the block size while the cluster is running, because that would require the NameNode and DataNodes to re-read their configurations, which they only do at startup. - Aaron Block size is a client side configuration. NameNode and DataNode don't need to restart. In this particular case, even if client's config is changed, MR may not use the new config for partial set of maps or reducers. Raghu. On Sun, Apr 12, 2009 at 6:08 AM, Rakhi Khatwani rakhi.khatw...@gmail.comwrote: Hi, I would like to know if it is feasbile to change the blocksize of Hadoop while map reduce jobs are executing? and if not would the following work? 1. stop map-reduce 2. stop-hbase 3. stop hadoop 4. change hadoop-sites.xml to reduce the blocksize 5. restart all whether the data in the hbase tables will be safe and automatically split after changing the block size of hadoop?? Thanks, Raakhi
Re: swap hard drives between datanodes
IP Adress mismatch should not matter. What is the actual error you saw? The mismatch might be unintentional. Raghu. Mike Andrews wrote: i tried swapping two hot-swap sata drives between two nodes in a cluster, but it didn't work: after restart, one of the datanodes shut down since namenode said it reported a block belonging to another node, which i guess namenode thinks is a fatal error. is this caused by the hadoop/datanode/current/VERSION file having the IP address and other ID information of the datanode hard-coded? it'd be great to be able to do a manual gross cluster rebalance by just physically swapping hard drives, but seems like this is not possible in the current version 0.18.3.
Re: swap hard drives between datanodes
Raghu Angadi wrote: IP Adress mismatch should not matter. What is the actual error you saw? The mismatch might be unintentional. The reason I say ip address should not matter is that if you change the ip address of a datanode, it should still work correctly. Raghu. Raghu. Mike Andrews wrote: i tried swapping two hot-swap sata drives between two nodes in a cluster, but it didn't work: after restart, one of the datanodes shut down since namenode said it reported a block belonging to another node, which i guess namenode thinks is a fatal error. is this caused by the hadoop/datanode/current/VERSION file having the IP address and other ID information of the datanode hard-coded? it'd be great to be able to do a manual gross cluster rebalance by just physically swapping hard drives, but seems like this is not possible in the current version 0.18.3.
Re: Socket closed Exception
If it is NameNode, then there is probably a log about closing the socket around that time. Raghu. lohit wrote: Recently we are seeing lot of Socket closed exception in our cluster. Many task's open/create/getFileInfo calls get back 'SocketException' with message 'Socket closed'. We seem to see many tasks fail with same error around same time. There are no warning or info messages in NameNode /TaskTracker/Task logs. (This is on HDFS 0.15) Are there cases where NameNode closes socket due heavy load or during conention of resource of anykind? Thanks, Lohit
Re: about dfsadmin -report
stchu wrote: But when the web-ui shows the node dead, -report still shows in service and the living nodes=3 (in web-ui: living=2 dead=1). please file a jira and describe how to reproduce in as much detail as you can in a comment. thanks, Raghu. stchu 2009/3/26 Raghu Angadi rang...@yahoo-inc.com stchu wrote: Hi, I do a test about the datanode crash. I stop the networking on one of the datanode. The Web app and fsck report that datanode dead after 10 mins. But dfsadmin -report are not report that over 25 mins. Is this correct? Nope. Both web-ui and '-report' from the same source of info. They should be consistent. Raghu. Thanks for your guide. stchu
Re: about dfsadmin -report
stchu wrote: Hi, I do a test about the datanode crash. I stop the networking on one of the datanode. The Web app and fsck report that datanode dead after 10 mins. But dfsadmin -report are not report that over 25 mins. Is this correct? Nope. Both web-ui and '-report' from the same source of info. They should be consistent. Raghu. Thanks for your guide. stchu
Re: hadoop need help please suggest
What is scale you are thinking of? (10s, 100s or more nodes)? The memory for metadata at NameNode you mentioned is that main issue with small files. There are multiple alternatives for the dealing with that. This issue is discussed many times here. Also please use core-user@ id alone for asking for help.. you don't need to send to core-devel@ Raghu. snehal nagmote wrote: Hello Sir, I have some doubts, please help me. we have requirement of scalable storage system, we have developed one agro-advisory system in which farmers will sent the crop pictures particularly in sequential manner some 6-7 photos of 3-4 kb each would be stored in storage server and these photos would be read sequentially by scientist to detect the problem, writing to images would not be done. So for storing these images we are using hadoop file system, is it feasible to use hadoop file system for the same purpose. As also the images are of only 3-4 kb and hadoop reads the data in blocks of size 64 mb how can we increase the performance, what could be the tricks and tweaks that should be done to use hadoop for such kind of purpose. Next problem is as hadoop stores all the metadata in memory,can we use some mechanism to store the files in the block of some greater size because as the files would be of small size,so it will store the lots metadata and will overflow the main memory please suggest what could be done regards, Snehal
Re: hadoop configuration problem hostname-ip address
you need https://issues.apache.org/jira/browse/HADOOP-5191 I don't why there is no response to the simple patch I attached. alternately you could use hostname that it expects instead of ip address. Raghu. snehal nagmote wrote: Hi, I am using Hadoop version 0.19. I set up a hadoop cluster for testing purpose with 1 master and one slave. I set up keyless ssh between the master node and all the slave nodes as well as modified the /etc/hosts/ on all nodes so hostname lookup works I am using ip addresses for the namenode and datanode rather tahn hostname as with hostname , datanode was not coming up But when i try to execute any job or program, it gives me following exception following exceptions: FAILED Error initializing *java*.*lang*.*IllegalArgumentException*: *Wrong* *FS*: hdfs://172.16.6.102:21011/user/root/test expected: hdfs://namnodemc:21011 Can you please help me out , from where it is taking hostname Regards, Snehal Nagmote IIIT-H
Re: DataNode stops cleaning disk?
Igor, A few things you could do (may be better to file a jira, with a short description and more details in follow up comments) : 1) Pick one of block ids from the open files and grep for it in DataNode and NameNode logs (there is one log file for each day) 2) Pick one of the over-replicated blocks (if above block is not one of them) and trace it in NameNode log 3) take jstack of the datanode in this state. Since you still have over-replicated blocks, you probably has more datanodes in early stages of this problem. Igor Bolotin wrote: Caught this issue again on one of the clusters. DF and DU sizes match very closely with information reported by dfsadmin command. If DN reports the space properly, then the original problem you reported that DataNode runs out of disk should not happen. You can follow up with more investigation if you are interested. Other alternative is to upgrade to latest 0.19.x to see if the problem persists. There have been many fixes since 0.19.0. Raghu. Lsof reports some 1000 open files in DFS data directories on the problematic datanode, but total size for open files is only about 10GB. I can't really track the space usage to individual files - there are way too many files/blocks for detailed analysis. Here is something interesting - fsck before datanode restart reports very significant number of over-replicated blocks (~10% of blocks are over-replicated): Status: HEALTHY Total size:1472758591906 B (Total open files size: 29050588133 B) Total dirs:58431 Total files: 375703 (Files currently being written: 418) Total blocks (validated): 387205 (avg. block size 3803562 B) (Total open file blocks (not validated): 595) Minimally replicated blocks: 387205 (100.0 %) Over-replicated blocks:38782 (10.015883 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor:3 Average block replication: 3.1003888 Corrupt blocks:0 Missing replicas: 0 (0.0 %) Number of data-nodes: 7 Number of racks: 1 After datanode restart - over-replicated nodes are practically gone: Status: HEALTHY Total size:1310669475298 B (Total open files size: 29535016933 B) Total dirs:59431 Total files: 377177 (Files currently being written: 387) Total blocks (validated): 386661 (avg. block size 3389712 B) (Total open file blocks (not validated): 607) Minimally replicated blocks: 386661 (100.0 %) Over-replicated blocks:272 (0.070345856 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor:3 Average block replication: 3.0007036 Corrupt blocks:0 Missing replicas: 0 (0.0 %) Number of data-nodes: 7 Number of racks: 1 What might be the cause for over-replication? Best regards, Igor -Original Message- From: Igor Bolotin Sent: Monday, March 09, 2009 2:50 PM To: core-user@hadoop.apache.org Subject: RE: DataNode stops cleaning disk? My mistake about 'current' directory - that's the one that consumes all the disk space and 'du' on that directory matches exactly with namenode web ui reported size. I'm waiting for the next time this happens to collect more details, but ever since I wrote the first email - everything works perfectly well (another application of Murphy law). Thanks, Igor -Original Message- From: Raghu Angadi [mailto:rang...@yahoo-inc.com] Sent: Thursday, March 05, 2009 12:06 PM To: core-user@hadoop.apache.org Subject: Re: DataNode stops cleaning disk? Igor Bolotin wrote: That's what I saw just yesterday on one of the data nodes with this situation (will confirm also next time it happens): - Tmp and current were either empty or almost empty last time I checked. - du on the entire data directory matched exactly with reported used space in NameNode web UI and it did report that it uses some most of the available disk space. - nothing else was using disk space (actually - it's dedicated DFS cluster). If 'du' command (you can run in the shell) counts properly then you should be able to see which files are taking space. If 'du' can't but 'df' reports very less space available, then it is possible (though never seen) that datanode is keeping a a lot these files open.. 'ls -l /proc/datanodepid/fd' lists these files. If it is not datanode, then check lsof to find who is holding these files. hope this helps. Raghu. Thank you for help! Igor -Original Message- From: Raghu Angadi [mailto:rang...@yahoo-inc.com] Sent: Thursday, March 05, 2009 11:05 AM To: core-user@hadoop.apache.org Subject: Re: DataNode stops cleaning disk? This is unexpected unless some other process is eating up space. Couple of things to collect next time (along with log): - All the contents under datanode-directory/ (especially including 'tmp
Re: Not a host:port pair when running balancer
Doug Cutting wrote: Konstantin Shvachko wrote: Clarifying: port # is missing in your configuration, should be property namefs.default.name/name valuehdfs://hvcwydev0601:8020/value /property where 8020 is your port number. That's the work-around, but it's a bug. One should not need to specify the default port number (8020). Please file an issue in Jira. Yes, Balancer should use NameNode.getAddress(conf) to get NameNode addrees. Raghu.
Re: Error while putting data onto hdfs
Amandeep Khurana wrote: My dfs.datanode.socket.write.timeout is set to 0. This had to be done to get Hbase to work. ah.. I see, we should fix that. Not sure how others haven't seen it till now. Affects only those with write.timeout set to 0 on the clients. Since setting it to 0 itself is a work around, please change that to some extremely large value for now. Raghu. Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Wed, Mar 11, 2009 at 10:23 AM, Raghu Angadi rang...@yahoo-inc.comwrote: Did you change dfs.datanode.socket.write.timeout to 5 seconds? The exception message says so. It is extremely small. The default is 8 minutes and is intentionally pretty high. Its purpose is mainly to catch extremely unresponsive datanodes and other network issues. Raghu. Amandeep Khurana wrote: I was trying to put a 1 gig file onto HDFS and I got the following error: 09/03/10 18:23:16 WARN hdfs.DFSClient: DataStreamer Exception: java.net.SocketTimeoutException: 5000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/171.69.102.53:34414 remote=/ 171.69.102.51:50010] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:162) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107) at java.io.BufferedOutputStream.write(Unknown Source) at java.io.DataOutputStream.write(Unknown Source) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2209) 09/03/10 18:23:16 WARN hdfs.DFSClient: Error Recovery for block blk_2971879428934911606_36678 bad datanode[0] 171.69.102.51:50010 put: All datanodes 171.69.102.51:50010 are bad. Aborting... Exception closing file /user/amkhuran/221rawdata/1g java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:198) at org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java:65) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3084) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3053) at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.close(DFSClient.java:942) at org.apache.hadoop.hdfs.DFSClient.close(DFSClient.java:210) at org.apache.hadoop.hdfs.DistributedFileSystem.close(DistributedFileSystem.java:243) at org.apache.hadoop.fs.FsShell.close(FsShell.java:1842) at org.apache.hadoop.fs.FsShell.main(FsShell.java:1856) Whats going wrong? Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz
Re: HDFS is corrupt, need to salvage the data.
Mayuran, It takes very long for a lot of iterations if we have to go through each debugging step, one at a time. May be a jira is a good place. - Run fsck with blocks option. - Check if those ids match with ids in file names found by 'find'. - Check which directory are these files in.. and verify if that matches with datanode configured directory You are saying there is nothing wrong in the log files, but does it imply that datanode sees those 157 missing blocks? May be you should post the log or verify that yourself. If DN is working correctly according to you, then you should not have 100% of blocks missing. There are many possibilities, it not easy for me list the the right one in your case without much info or list all possible conditions. Raghu. Mayuran Yogarajah wrote: Mayuran Yogarajah wrote: Raghu Angadi wrote: The block files usually don't disappear easily. Check on the datanode if you find any files starting with blk. Also check datanode log to see what happened there... may be use started on a different directory or something like that. Raghu. There are indeed blk files: find -name 'blk*' | wc -l 158 I didn't see anything out of the ordinary in the datanode log. At this point is there anything I can do to recover the files? Or do I need to reformat the data node and load the data in again ? thanks Sorry to resend this but I didn't receive a response and wanted to know how to proceed. Is it possible to recover the data at this stage? Or is it gone ? thanks
Re: Error while putting data onto hdfs
Amandeep Khurana wrote: What happens if you set it to 0? How is it a workaround? HBase needs it in pre-19.0 (related story : http://www.nabble.com/Datanode-Xceivers-td21372227.html). It should not matter if you move to 0.19.0 or newer. And how would it matter if I change is to a large value? very large value like 100 years is same as setting it to 0 (for all practical purposes). Raghu. Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Wed, Mar 11, 2009 at 12:00 PM, Raghu Angadi rang...@yahoo-inc.comwrote: Amandeep Khurana wrote: My dfs.datanode.socket.write.timeout is set to 0. This had to be done to get Hbase to work. ah.. I see, we should fix that. Not sure how others haven't seen it till now. Affects only those with write.timeout set to 0 on the clients. Since setting it to 0 itself is a work around, please change that to some extremely large value for now. Raghu. Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Wed, Mar 11, 2009 at 10:23 AM, Raghu Angadi rang...@yahoo-inc.com wrote: Did you change dfs.datanode.socket.write.timeout to 5 seconds? The exception message says so. It is extremely small. The default is 8 minutes and is intentionally pretty high. Its purpose is mainly to catch extremely unresponsive datanodes and other network issues. Raghu. Amandeep Khurana wrote: I was trying to put a 1 gig file onto HDFS and I got the following error: 09/03/10 18:23:16 WARN hdfs.DFSClient: DataStreamer Exception: java.net.SocketTimeoutException: 5000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/171.69.102.53:34414 remote=/ 171.69.102.51:50010] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:162) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107) at java.io.BufferedOutputStream.write(Unknown Source) at java.io.DataOutputStream.write(Unknown Source) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2209) 09/03/10 18:23:16 WARN hdfs.DFSClient: Error Recovery for block blk_2971879428934911606_36678 bad datanode[0] 171.69.102.51:50010 put: All datanodes 171.69.102.51:50010 are bad. Aborting... Exception closing file /user/amkhuran/221rawdata/1g java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:198) at org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java:65) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3084) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3053) at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.close(DFSClient.java:942) at org.apache.hadoop.hdfs.DFSClient.close(DFSClient.java:210) at org.apache.hadoop.hdfs.DistributedFileSystem.close(DistributedFileSystem.java:243) at org.apache.hadoop.fs.FsShell.close(FsShell.java:1842) at org.apache.hadoop.fs.FsShell.main(FsShell.java:1856) Whats going wrong? Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz
Re: Error while putting data onto hdfs
Raghu Angadi wrote: Amandeep Khurana wrote: My dfs.datanode.socket.write.timeout is set to 0. This had to be done to get Hbase to work. ah.. I see, we should fix that. Not sure how others haven't seen it till now. Affects only those with write.timeout set to 0 on the clients. filed : https://issues.apache.org/jira/browse/HADOOP-5464 Since setting it to 0 itself is a work around, please change that to some extremely large value for now. Raghu.
Re: HDFS is corrupt, need to salvage the data.
Mayuran Yogarajah wrote: lohit wrote: How many Datanodes do you have. From the output it looks like at the point when you ran fsck, you had only one datanode connected to your NameNode. Did you have others? Also, I see that your default replication is set to 1. Can you check if your datanodes are up and running. Lohit There is only one data node at the moment. Does this mean the data is not recoverable? The HD on the machine seems fine so I'm a little confused as to what caused the HDFS to become corrupted. The block files usually don't disappear easily. Check on the datanode if you find any files starting with blk. Also check datanode log to see what happened there... may be use started on a different directory or something like that. Raghu.
Re: DataNode stops cleaning disk?
This is unexpected unless some other process is eating up space. Couple of things to collect next time (along with log): - All the contents under datanode-directory/ (especially including 'tmp' and 'current') - Does 'du' of this directory match with what is reported to NameNode (shown on webui) by this DataNode. - Is there anything else taking disk space on the machine? Raghu. Igor Bolotin wrote: Normally I dislike writing about problems without being able to provide some more information, but unfortunately in this case I just can't find anything. Here is the situation - DFS cluster running Hadoop version 0.19.0. The cluster is running on multiple servers with practically identical hardware. Everything works perfectly well, except for one thing - from time to time one of the data nodes (every time it's a different node) starts to consume more and more disk space. The node keeps going and if we don't do anything - it runs out of space completely (ignoring 20GB reserved space settings). Once restarted - it cleans disk rapidly and goes back to approximately the same utilization as the rest of data nodes in the cluster. Scanning datanodes and namenode logs and comparing thread dumps (stacks) from nodes experiencing problem and those that run normally didn't produce any clues. Running balancer tool didn't help at all. FSCK shows that everything is healthy and number of over-replicated blocks is not significant. To me - it just looks like at some point the data node stops cleaning invalidated/deleted blocks, but keeps reporting space consumed by these blocks as not used, but I'm not familiar enough with the internals and just plain don't have enough free time to start digging deeper. Anyone has an idea what is wrong or what else we can do to find out what's wrong or maybe where to start looking in the code? Thanks, Igor
Re: Hadoop Write Performance
what is the hadoop version? You could check log on a datanode around that time. You could post any suspicious errors. For e.g. you can trace a particular block in client and datanode logs. Most likely it not a NameNode issue, but you can check NameNode log as well. Raghu. Xavier Stevens wrote: Does anyone have an expected or experienced write speed to HDFS outside of Map/Reduce? Any recommendations on properties to tweak in hadoop-site.xml? Currently I have a multi-threaded writer where each thread is writing to a different file. But after a while I get this: java.io.IOException: Could not get block locations. Aborting... at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFS Client.java:2081) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1300(DFSClient.ja va:1702) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClie nt.java:1818) Which is perhaps indicating that the namenode is overwhelmed? Thanks, -Xavier
Re: Too many open files in 0.18.3
Sean, A few things in your messages is not clear to me. Currently this is what I make out of it : 1) with 1k limit, you do see the problem. 2) with 16 limit - (?) not clear if you see the problem 3) with 8k you don't see the problem 3a) with or without the patch, I don't know. But if you do use the patch and things do improve, please let us know. Raghu. Sean Knapp wrote: Raghu, Thanks for the quick response. I've been beating up on the cluster for a while now and so far so good. I'm still at 8k... what should I expect to find with 16k versus 1k? The 8k didn't appear to be affecting things to begin with. Regards, Sean On Thu, Feb 12, 2009 at 2:07 PM, Raghu Angadi rang...@yahoo-inc.com wrote: You are most likely hit by https://issues.apache.org/jira/browse/HADOOP-4346 . I hope it gets back ported. There is a 0.18 patch posted there. btw, does 16k help in your case? Ideally 1k should be enough (with small number of clients). Please try the above patch with 1k limit. Raghu. Sean Knapp wrote: Hi all, I'm continually running into the Too many open files error on 18.3: DataXceiveServer: java.io.IOException: Too many open files at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:145) at sun.nio.ch.ServerSocketAdaptor.accept(ServerSocketAdaptor.java:96) at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:997) at java.lang.Thread.run(Thread.java:619) I'm writing thousands of files in the course of a few minutes, but nothing that seems too unreasonable, especially given the numbers below. I begin getting a surge of these warnings right as I hit 1024 files open by the DataNode: had...@u10:~$ ps ux | awk '/dfs\.DataNode/ { print $2 }' | xargs -i ls /proc/{}/fd | wc -l 1023 This is a bit unexpected, however, since I've configured my open file limit to be 16k: had...@u10:~$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 268288 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files (-n) 16384 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 268288 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Note, I've also set dfs.datanode.max.xcievers to 8192 in hadoop-site.xml. Thanks in advance, Sean
Re: Too many open files in 0.18.3
Sean Knapp wrote: Raghu, Apologies for the confusion. I was seeing the problem with any setting for dfs.datanode.max.xcievers... 1k, 2k and 8k. Likewise, I was also seeing the problem with different open file settings, all the way up to 32k. Since I installed the patch, HDFS has been performing much better. The current settings that work for me are 16k max open files with dfs.datanode.max.xcievers=8k, though under heavy balancer load I do start to hit the 16k max. Thanks. That clarifies many things. Even with the patch, if you have lot of simultaneous clients, DataNode does need a lot of file descriptors (something like 6 times the number of active 'xceivers'). In your case you seem to have lot of simultaneous clients. I suggest increasing file limit to much higher (something like 64k). Raghu. Regards, Sean 2009/2/13 Raghu Angadi rang...@yahoo-inc.com
Re: Too many open files in 0.18.3
You are most likely hit by https://issues.apache.org/jira/browse/HADOOP-4346 . I hope it gets back ported. There is a 0.18 patch posted there. btw, does 16k help in your case? Ideally 1k should be enough (with small number of clients). Please try the above patch with 1k limit. Raghu. Sean Knapp wrote: Hi all, I'm continually running into the Too many open files error on 18.3: DataXceiveServer: java.io.IOException: Too many open files at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:145) at sun.nio.ch.ServerSocketAdaptor.accept(ServerSocketAdaptor.java:96) at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:997) at java.lang.Thread.run(Thread.java:619) I'm writing thousands of files in the course of a few minutes, but nothing that seems too unreasonable, especially given the numbers below. I begin getting a surge of these warnings right as I hit 1024 files open by the DataNode: had...@u10:~$ ps ux | awk '/dfs\.DataNode/ { print $2 }' | xargs -i ls /proc/{}/fd | wc -l 1023 This is a bit unexpected, however, since I've configured my open file limit to be 16k: had...@u10:~$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 268288 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files (-n) 16384 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 268288 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Note, I've also set dfs.datanode.max.xcievers to 8192 in hadoop-site.xml. Thanks in advance, Sean
Re: stable version
Vadim Zaliva wrote: The particular problem I am having is this one: https://issues.apache.org/jira/browse/HADOOP-2669 I am observing it in version 19. Could anybody confirm that it have been fixed in 18, as Jira claims? I am wondering why bug fix for this problem might have been committed to 18 branch but not 19. If it was commited to both, then perhaps the problem was not completely solved and downgrading to 18 will not help me. If you read through the comments, it will see that the the root cause was never found. The patch just fixes one of the suspects. If you are still seeing this, please file another jira and link it HADOOP-2669. How easy is it for you reproduce this? I guess one of the reasons for incomplete diagnosis is that it is not simple to reproduce. Raghu. Vadim On Wed, Feb 11, 2009 at 00:48, Rasit OZDAS rasitoz...@gmail.com wrote: Yes, version 18.3 is the most stable one. It has added patches, without not-proven new functionality. 2009/2/11 Owen O'Malley omal...@apache.org: On Feb 10, 2009, at 7:21 PM, Vadim Zaliva wrote: Maybe version 0.18 is better suited for production environment? Yahoo is mostly on 0.18.3 + some patches at this point. -- Owen -- M. Raşit ÖZDAŞ
Re: can't read the SequenceFile correctly
+1 on something like getValidBytes(). Just the existence of this would warn many programmers about getBytes(). Raghu. Owen O'Malley wrote: On Feb 6, 2009, at 8:52 AM, Bhupesh Bansal wrote: Hey Tom, I got also burned by this ?? Why does BytesWritable.getBytes() returns non-vaild bytes ?? Or we should add a BytesWritable.getValidBytes() kind of function. It does it because continually resizing the array to the valid length is very expensive. It would be a reasonable patch to add a getValidBytes, but most methods in Java's libraries are aware of this and let you pass in byte[], offset, and length. So once you realize what the problem is, you can work around it. -- Owen
Re: Connect to namenode
I don't think it is intentional. Please file a jira with all the details about how to reproduce (with actual configuration files). thanks, Raghu. Habermaas, William wrote: After creation and startup of the hadoop namenode, you can only connect to the namenode via hostname and not IP. EX. hostname for box is sunsystem07, ip is 10.120.16.99 If you use the following url, hdfs://10.120.16.99, to connect to the namenode, the following message will be printed: Wrong FS: hdfs://10.120.16.99:9000/, expected: hdfs://sunsystem07:9000 You will only be able to connect successfully if hdfs://sunsystem07:9000 is used. It seems reasonable to allow connection either by IP or name. Is there a reason for this behavior or is it a bug?
Re: problem with completion notification from block movement
Karl Kleinpaste wrote: On Sun, 2009-02-01 at 17:58 -0800, jason hadoop wrote: The Datanode's use multiple threads with locking and one of the assumptions is that the block report (1ce per hour by default) takes little time. The datanode will pause while the block report is running and if it happens to take a while weird things start to happen. Thank you for responding, this is very informative for us. Having looked through the source code with a co-worker regarding periodic scan and then checking the logs once again, we find that we are finding reports of this sort: BlockReport of 1158499 blocks got processed in 308860 msecs BlockReport of 1159840 blocks got processed in 237925 msecs BlockReport of 1161274 blocks got processed in 177853 msecs BlockReport of 1162408 blocks got processed in 285094 msecs BlockReport of 1164194 blocks got processed in 184478 msecs BlockReport of 1165673 blocks got processed in 226401 msecs The 3rd of these exactly straddles the particular example timeline I discussed in my original email about this question. I suspect I'll find more of the same as I look through other related errors. You could ask for complete fix in https://issues.apache.org/jira/browse/HADOOP-4584 . I don't think current patch there fixes your problem. Raghu. --karl
Re: Question about HDFS capacity and remaining
Doug Cutting wrote: Ext2 by default reserves 5% of the drive for use by root only. That'd be 45MB of your 907GB capacity which would account for most of the discrepancy. You can adjust this with tune2fs. plus, I think DataNode reports only 98% of the space by default. Raghu. Doug Bryan Duxbury wrote: There are no non-dfs files on the partitions in question. df -h indicates that there is 907GB capacity, but only 853GB remaining, with 200M used. The only thing I can think of is the filesystem overhead. -Bryan On Jan 29, 2009, at 4:06 PM, Hairong Kuang wrote: It's taken by non-dfs files. Hairong On 1/29/09 3:23 PM, Bryan Duxbury br...@rapleaf.com wrote: Hey all, I'm currently installing a new cluster, and noticed something a little confusing. My DFS is *completely* empty - 0 files in DFS. However, in the namenode web interface, the reported capacity is 3.49 TB, but the remaining is 3.25TB. Where'd that .24TB go? There are literally zero other files on the partitions hosting the DFS data directories. Where am I losing 240GB? -Bryan
Re: tools for scrubbing HDFS data nodes?
Owen O'Malley wrote: On Jan 28, 2009, at 6:16 PM, Sriram Rao wrote: By scrub I mean, have a tool that reads every block on a given data node. That way, I'd be able to find corrupted blocks proactively rather than having an app read the file and find it. The datanode already has a thread that checks the blocks periodically for exactly that purpose. since Hadoop 0.16.0. scans all the blocks every 3 weeks (by default, interval can be changed). Raghu.
Re: HDFS - millions of files in one directory?
Mark Kerzner wrote: Raghu, if I write all files only one, is the cost the same in one directory or do I need to find the optimal directory size and when full start another bucket? If you write only once, then writing won't be much of an issue. You can write them in lexical order to help with buffer copies. These are all implementation details that a user should not depend on. That said, the rest of the discussion in this thread is going in the right direction : to get you to use fewer files that combines a lot of these small files. Large number of small files has overhead in many places in HDFS : strain on DataNodes, NameNode memory, etc. Raghu.
Re: Zeroconf for hadoop
Nitay wrote: Why not use the distributed coordination service ZooKeeper? When nodes come up they write some ephemeral file in a known ZooKeeper directory and anyone who's interested, i.e. NameNode, can put a watch on the directory and get notified when new children come up. NameNode does not do active discovery. It is the DataNodes that contact NameNode about their presence. So with ZooKeeper or zeroconf, DataNode should be able to discover who their NN is and connect to it. Raghu. On Mon, Jan 26, 2009 at 10:59 AM, Allen Wittenauer a...@yahoo-inc.com wrote: On 1/25/09 8:45 AM, nitesh bhatia niteshbhatia...@gmail.com wrote: Apple provides opensource discovery service called Bonjour (zeroconf). Is it possible to integrate Zeroconf with Hadoop so that discovery of nodes become automatic ? Presently for setting up multi-node cluster we need to add IPs manually. Integrating it with bonjour can make this process automatic. How do you deal with multiple grids? How do you deal with security?
Re: Zeroconf for hadoop
nitesh bhatia wrote: Hi Apple provides opensource discovery service called Bonjour (zeroconf). Is it possible to integrate Zeroconf with Hadoop so that discovery of nodes become automatic ? Presently for setting up multi-node cluster we need to add IPs manually. Integrating it with bonjour can make this process automatic. It would be nice to have. Note that it is the slaves (DataNodes, TaskTrackers) that need to do the discovery. NameNode just passively accepts the DataNodes that want to be part of the cluster. In that sense, NN should announce itself and DNs try to find where the NN is. It will nice to have zeroconf feature in some form and we might discover many more uses for it. Of course, a cluster should not require it. Raghu.
Re: Hadoop 0.19 over OS X : dfs error
nitesh bhatia wrote: Thanks. It worked. :) in hadoop-env.sh its required to write exact path for java framework. I changed it to export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home and it started. In hadoop 0.18.2 export JAVA_HOME=/Library/Java/Home is working fine. I am confused why we need to give exact path in 0.19 version. Most likely reason is that your /Library/Java/Home some how ends up using JDK 1.5. 0.19 and up require JDK 1.6.x. Raghu. Thankyou --nitesh On Sun, Jan 25, 2009 at 1:52 PM, Joerg Rieger joerg.rie...@mni.fh-giessen.de wrote: Hello, what path did you set in conf/hadoop-env.sh? Before Hadoop 0.19 I had in hadoop-env.sh: export JAVA_HOME=/Library/Java/Home But that path, despite using java-preferences to change Java versions, still uses the Java 1.5 version, e.g.: $ /Library/Java/Home/bin/java -version java version 1.5.0_16 Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_16-b06-284) Java HotSpot(TM) Client VM (build 1.5.0_16-133, mixed mode, sharing) You have to change the setting to: export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home Joerg On 25.01.2009, at 00:16, nitesh bhatia wrote: Hi My current default settings are for java 1.6 nMac:hadoop-0.19.0 Aryan$ $JAVA_HOME/bin/java -version java version 1.6.0_07 Java(TM) SE Runtime Environment (build 1.6.0_07-b06-153) Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_07-b06-57, mixed mode) The system is working fine with Hadoop 0.18.2. --nitesh On Sun, Jan 25, 2009 at 4:15 AM, Craig Macdonald cra...@dcs.gla.ac.uk wrote: Hi, I guess that the java on your PATH is different from the setting of your $JAVA_HOME env variable. Try $JAVA_HOME/bin/java -version? Also, there is a program called Java Preferences on each system for changing the default java version used. Craig nitesh bhatia wrote: Hi I am trying to setup Hadoop 0.19 on OS X. Current Java Version is java version 1.6.0_07 Java(TM) SE Runtime Environment (build 1.6.0_07-b06-153) Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_07-b06-57, mixed mode) When I am trying to format dfs using bin/hadoop dfs -format command. I am getting following errors: nMac:hadoop-0.19.0 Aryan$ bin/hadoop dfs -format Exception in thread main java.lang.UnsupportedClassVersionError: Bad version number in .class file at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:675) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124) at java.net.URLClassLoader.defineClass(URLClassLoader.java:260) at java.net.URLClassLoader.access$100(URLClassLoader.java:56) at java.net.URLClassLoader$1.run(URLClassLoader.java:195) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:316) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:280) at java.lang.ClassLoader.loadClass(ClassLoader.java:251) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:374) Exception in thread main java.lang.UnsupportedClassVersionError: Bad version number in .class file at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:675) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124) at java.net.URLClassLoader.defineClass(URLClassLoader.java:260) at java.net.URLClassLoader.access$100(URLClassLoader.java:56) at java.net.URLClassLoader$1.run(URLClassLoader.java:195) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:316) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:280) at java.lang.ClassLoader.loadClass(ClassLoader.java:251) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:374) I am not sure why this error is coming. I am having latest Java version. Can anyone help me out with this? Thanks Nitesh -- Nitesh Bhatia Dhirubhai Ambani Institute of Information Communication Technology Gandhinagar Gujarat Life is never perfect. It just depends where you draw the line. visit: http://www.awaaaz.com - connecting through music http://www.volstreet.com - lets volunteer for better tomorrow http://www.instibuzz.com - Voice opinions, Transact easily, Have fun --
Re: HDFS loosing blocks or connection error
It seems hdfs isn't so robust or reliable as the website says and/or I have a configuration issue. quite possible. How robust does the website say it is? I agree debuggings failures like the following is pretty hard for casual users. You need look at the logs for block, or run 'bin/hadoop fsck /stats.txt' etc. Reason could as simple as no live datanodes or as complex as strange network behavior triggering a bug in DFSClient. You can start by looking at or attaching client log around the lines that contain block id. Also, note the version you are running. Raghu. Zak, Richard [USA] wrote: Might there be a reason for why this seems to routinely happen to me when using Hadoop 0.19.0 on Amazon EC2? 09/01/23 11:45:52 INFO hdfs.DFSClient: Could not obtain block blk_-1757733438820764312_6736 from any node: java.io.IOException: No live nodes contain current block 09/01/23 11:45:55 INFO hdfs.DFSClient: Could not obtain block blk_-1757733438820764312_6736 from any node: java.io.IOException: No live nodes contain current block 09/01/23 11:45:58 INFO hdfs.DFSClient: Could not obtain block blk_-1757733438820764312_6736 from any node: java.io.IOException: No live nodes contain current block 09/01/23 11:46:01 WARN hdfs.DFSClient: DFS Read: java.io.IOException: Could not obtain block: blk_-1757733438820764312_6736 file=/stats.txt Richard J. Zak
Re: HDFS - millions of files in one directory?
If you are adding and deleting files in the directory, you might notice CPU penalty (for many loads, higher CPU on NN is not an issue). This is mainly because HDFS does a binary search on files in a directory each time it inserts a new file. If the directory is relatively idle, then there is no penalty. Raghu. Mark Kerzner wrote: Hi, there is a performance penalty in Windows (pardon the expression) if you put too many files in the same directory. The OS becomes very slow, stops seeing them, and lies about their status to my Java requests. I do not know if this is also a problem in Linux, but in HDFS - do I need to balance a directory tree if I want to store millions of files, or can I put them all in the same directory? Thank you, Mark
Re: HDFS - millions of files in one directory?
Raghu Angadi wrote: If you are adding and deleting files in the directory, you might notice CPU penalty (for many loads, higher CPU on NN is not an issue). This is mainly because HDFS does a binary search on files in a directory each time it inserts a new file. I should add that equal or even bigger cost is the memmove that ArrayList does when you add or delete entries. ArrayList, rather than a map is used mainly to save memory, them most precious resource for NameNode. Raghu. If the directory is relatively idle, then there is no penalty. Raghu. Mark Kerzner wrote: Hi, there is a performance penalty in Windows (pardon the expression) if you put too many files in the same directory. The OS becomes very slow, stops seeing them, and lies about their status to my Java requests. I do not know if this is also a problem in Linux, but in HDFS - do I need to balance a directory tree if I want to store millions of files, or can I put them all in the same directory? Thank you, Mark
Re: 0.18.1 datanode psuedo deadlock problem
Sagar Naik wrote: Hi Raghu, The periodic du and block reports thread thrash the disk. (Block Reports takes abt on an avg 21 mins ) and I think all the datanode threads are not able to do much and freeze yes, that is the known problem we talked about in the earlier mails in this thread. When you have millions of blocks, one hour for du and block report intervals is too often for you. May be you could increase it to something like 6 or 12 hours. It still does not fix the block report problem since DataNode does the scan in-line. As I mentioned in earlier mails, we should really fix the block report problem. As simple fix would scan (very slowly, unlike DU) the directories in the background. Even after fixing block reports, you should be aware that excessive number of block does impact the performance. No system can guarantee performance when overloaded. What we want to do is to make Hadoop degrade gracefully.. rather than DNs getting killed. Raghu.
Re: 0.18.1 datanode psuedo deadlock problem
The scan required for each block report is well known issue and it can be fixed. It was discussed multiple times (e.g. https://issues.apache.org/jira/browse/HADOOP-3232?focusedCommentId=12587795#action_12587795 ). Earlier, inline 'du' on datanodes used to cause the same problem and they they were moved to a separate thread (HADOOP-3232). block reports can do the same... Though 2M blocks on DN is very large, there is no reason block reports should break things. Once we fix block reports, something else might break.. but that is different issue. Raghu. Jason Venner wrote: The problem we are having is that datanodes periodically stall for 10-15 minutes and drop off the active list and then come back. What is going on is that a long operation set is holding the lock on on FSDataset.volumes, and all of the other block service requests stall behind this lock. DataNode: [/data/dfs-video-18/dfs/data] daemon prio=10 tid=0x4d7ad400 nid=0x7c40 runnable [0x4c698000..0x4c6990d0] java.lang.Thread.State: RUNNABLE at java.lang.String.lastIndexOf(String.java:1628) at java.io.File.getName(File.java:399) at org.apache.hadoop.dfs.FSDataset$FSDir.getGenerationStampFromFile(FSDataset.java:148) at org.apache.hadoop.dfs.FSDataset$FSDir.getBlockInfo(FSDataset.java:181) at org.apache.hadoop.dfs.FSDataset$FSVolume.getBlockInfo(FSDataset.java:412) at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.getBlockInfo(FSDataset.java:511) - locked 0x551e8d48 (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet) at org.apache.hadoop.dfs.FSDataset.getBlockReport(FSDataset.java:1053) at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:708) at org.apache.hadoop.dfs.DataNode.run(DataNode.java:2890) at java.lang.Thread.run(Thread.java:619) This is basically taking a stat on every hdfs block on the datanode, which in our case is ~ 2million, and can take 10+ minutes (we may be experiencing problems with our raid controller but have no visibility into it) at the OS level the file system seems fine and operations eventually finish. It appears that a couple of different data structures are being locked with the single object FSDataset$Volume. Then this happens: org.apache.hadoop.dfs.datanode$dataxcei...@1bcee17 daemon prio=10 tid=0x4da8d000 nid=0x7ae4 waiting for monitor entry [0x459fe000..0x459ff0d0] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.getNextVolume(FSDataset.java:473) - waiting to lock 0x551e8d48 (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet) at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:934) - locked 0x54e550e0 (a org.apache.hadoop.dfs.FSDataset) at org.apache.hadoop.dfs.DataNode$BlockReceiver.init(DataNode.java:2322) at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1187) at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1045) at java.lang.Thread.run(Thread.java:619) which locks the FSDataset while waiting on the volume object and now all of the Datanode operations stall waiting on the FSDataset object. -- Our particular installation doesn't use multiple directories for hdfs, so a first simple hack for a local fix would be to modify getNextVolume to just return the single volume and not be synchronized A richer alternative would be to make the locking more fine grained on FSDataset$FSVolumeSet. Of course we are also trying to fix the file system performance and dfs block loading that results in the block report taking a long time. Any suggestions or warnings? Thanks.
Re: 0.18.1 datanode psuedo deadlock problem
2M files is excessive. But there is no reason block reports should break. My preference is to make block reports handle this better. DNs dropping in and out of the cluster causes too many other problems. Raghu. Konstantin Shvachko wrote: Hi Jason, 2 million blocks per data-node is not going to work. There were discussions about it previously, please check the mail archives. This means you have a lot of very small files, which HDFS is not designed to support. A general recommendation is to group small files into large ones, introducing some kind of record structure delimiting those small files, and control it in on the application level. Thanks, --Konstantin
Re: xceiverCount limit reason
Jean-Adrien wrote: Is it the responsibility of hadoop client too manage its connection pool with the server ? In which case the problem would be an HBase problem? Anyway I found my problem, it is not a matter of performances. Essentially, yes. Client has to close the file to relinquish connections, if clients are using the common read/write interface. Currently if a client keeps many hdfs files open, it results in many threads held at the DataNodes. As you noticed, timeout at DNs helps. Various solutions are possible at different levels: application(hbase), Client API, HDFS, etc. https://issues.apache.org/jira/browse/HADOOP-3856 is proposal at HDFS level. Raghu.
Re: Question about the Namenode edit log and syncing the edit log to disk. 0.19.0
Did you look at FSEditLog.EditLogFileOutputStream.flushAndSync()? This code was re-organized sometime back. But the guarantees it provides should be exactly same as before. Please let us know otherwise. Raghu. Jason Venner wrote: I have always assumed (which is clearly my error) that edit log writes were flushed to storage to ensure that the edit log was consistent during machine crash recovery. I have been working through FSEditLog.java and I don't see any calls of force(true) on the file channel or sync on the file descriptor, and the edit log is not opened with an 's' or 'd' ie: the open flags are rw and not rws or rwd. The only thing I see in the code, is that the space in the file where the updates will be written is preallocated. Have I missed the mechanism that the edit log data is flushed to the disk? Is the edit log data not forcibly flushed to the disk, instead reling on the host operating system to perform the physical writes at a later date? Thanks -- Jason
Re: cannot allocate memory error
Your OS is running out of memory. Usually a sign of too many processes (or threads) on the machine. Check what else is happening on the system. Raghu. sagar arlekar wrote: Hello, I am new to hadoop. I am running hapdoop 0.17 in a Eucalyptus cloud instance (its a centos image on xen) bin/hadoop dfs -ls / gives the following Exception 08/12/31 08:58:10 WARN fs.FileSystem: localhost:9000 is a deprecated filesystem name. Use hdfs://localhost:9000/ instead. 08/12/31 08:58:10 WARN fs.FileSystem: uri=hdfs://localhost:9000 javax.security.auth.login.LoginException: Login failed: Cannot run program whoami: java.io.IOException: error=12, Cannot allocate memory at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:250) at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:275) at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:257) at org.apache.hadoop.security.UserGroupInformation.login(UserGroupInformation.java:67) at org.apache.hadoop.fs.FileSystem$Cache$Key.init(FileSystem.java:1353) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1289) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:203) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:108) at org.apache.hadoop.fs.FsShell.init(FsShell.java:87) at org.apache.hadoop.fs.FsShell.run(FsShell.java:1717) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.fs.FsShell.main(FsShell.java:1866) Bad connection to FS. command aborted. Running the command again gives. bin/hadoop dfs -ls / Error occurred during initialization of VM Could not reserve enough space for object heap Changing value of 'mapred.child.java.opts' property in hadoop-site.xml did not help. Kindly help me. What could I do to give more memory to hadoop? BTW is there a way to search through the mail archive? I only saw the mails listed according to year and months. Regards, Sagar
Re: Performance testing
Sandeep Dhawan wrote: Hi, I am trying to create a hadoop cluster which can handle 2000 write requests per second. In each write request I would writing a line of size 1KB in a file. This is essentially a matter of deciding how many datanodes (with the given configuration) do you need to write 3*2000*2 files per second (assuming each 1KB is a separate HDFS file). You can test this on single datanode. For e.g. if your datanode supports 1000 of 1KB files per second (even with multiple processes creating at the same time), then you you need 12 datanodes (+ any factor of safety you want to add). How many nodes or disks do you have approximately? Raghu. I would be using machine having following configuration: Platfom: Red Hat Linux 9.0 CPU : 2.07 GHz RAM : 1GB Can anyone help in giving me some pointers/guideline as to how to go about setting up such a cluster. What are the configuration parameters in hadoop with which we can tweak to ehance the performance of the hadoop cluster. Thanks, Sandeep
Re: Performance testing
I should add that your test should both create and delete files. Raghu. Raghu Angadi wrote: Sandeep Dhawan wrote: Hi, I am trying to create a hadoop cluster which can handle 2000 write requests per second. In each write request I would writing a line of size 1KB in a file. This is essentially a matter of deciding how many datanodes (with the given configuration) do you need to write 3*2000*2 files per second (assuming each 1KB is a separate HDFS file). You can test this on single datanode. For e.g. if your datanode supports 1000 of 1KB files per second (even with multiple processes creating at the same time), then you you need 12 datanodes (+ any factor of safety you want to add). How many nodes or disks do you have approximately? Raghu. I would be using machine having following configuration: Platfom: Red Hat Linux 9.0 CPU : 2.07 GHz RAM : 1GB Can anyone help in giving me some pointers/guideline as to how to go about setting up such a cluster. What are the configuration parameters in hadoop with which we can tweak to ehance the performance of the hadoop cluster. Thanks, Sandeep
Re: Map-Reduce job is using external IP instead of internal
Your configuration for task tracker, job tracker might be using external hostnames. Essentially any hostnames in configuration files should resolve to internal ips. Raghu. Genady wrote: Hi, We're using Hadoop 0.18.2/Hbase 0.18.1 four-nodes cluster on CentOS Linux, in /etc/hosts the following mapping were added to make hadoop to use internal IPs (fast Ethernet 1GB card): 10.1.0.56 master.hostname master 10.1.0.55 slave1.hostname slave1 10.1.0.53 slave3.hostname slave3 10.1.0.50 slave2.hostname slave2 If I do local copy to hadoop dfs it works perfectly, hadoop is using internal IP only to copy data, but when Map-Reduce job starts it's clear( from monitoring network cards data) that hadoop is using external IP in addition to internal with pretty big rates ~5Mb/s in each node, is it something I should add in hadoop configuration to force hadoop to use only internal(LAN) IP? Genady Gilin
Re: DFS replication and Error Recovery on failure
Konstantin Shvachko wrote: 1) If i set value of dfs.replication to 3 only in hadoop-site.xml of namenode(master) and then restart the cluster will this take effect. or i have to change hadoop-site.xml at all slaves ? dfs.replication is the name-node parameter, so you need to restart only the name-node in order to reset the value. Actually this is a client parameter. NameNode strictly does not need to be restarted. All the files created by the clients using new value will have the new replication. Raghu.
Re: Does datanode acts as readonly in case of DiskFull ?
Sagar Naik wrote: Hi , I would like to know what happens in case of DiskFull on a datanode Does the datanode acts as block server only ? Yes. I think so. Does it rejects anymore Block creation request OR Namenode does not list it for new blocks yes. NN will not allocate it any more blocks. Did you notice anything different? (which is quite possible). Raghu. Hadoop 18 -Sagar
Re: Hit a roadbump in solving truncated block issue
Brian Bockelman wrote: Hey, I hit a bit of a roadbump in solving the truncated block issue at our site: namely, some of the blocks appear perfectly valid to the datanode. The block verifies, but it is still the wrong size (it appears that the metadata is too small too). What's the best way to proceed? It appears that either (a) the block scanner needs to report to the datanode the size of the block it just verified, which is possibly a scaling issue or (b) the metadata file needs to save the correct block size, which is a pretty major modification, as it requires a change of the on-disk format. This should be detected by the NameNode. i.e. it should detect this replica is shorter (either compared to other replicas or the expected size). There are various fixes (recent or being worked on) to this area of NameNode and it is mostly covered by of those or should be soon. Raghu. Ideas? Brian
Re: File loss at Nebraska
Brian Bockelman wrote: On Dec 9, 2008, at 4:58 PM, Edward Capriolo wrote: Also it might be useful to strongly word hadoop-default.conf as many people might not know a downside exists for using 2 rather then 3 as the replication factor. Before reading this thread I would have thought 2 to be sufficient. I think 2 should be sufficient, but running with 2 replicas instead of 3 exposes some namenode bugs which are harder to trigger. Whether 2 is sufficient or not, I completely agree with later part. We should treat this as what I think it fundamentally is : fixing Namenode. I guess lately some of these bugs either got more likely or some similar bugs crept in. Sticking with 3 is a very good advise for maximizing reliability.. but from a opportunistic developer point of view a big cluster running with replication of 2 is great test case :-).. over all I think is a good thing for Hadoop. Raghu.
Re: Hadoop datanode crashed - SIGBUS
FYI : Datanode does not run any user code and does not link with any native/JNI code. Raghu. Chris Collins wrote: Was there anything mentioned as part of the tombstone message about problematic frame? What java are you using? There are a few reasons for SIGBUS errors, one is illegal address alignment, but from java thats very unlikelythere were some issues with the native zip library in older vm's. As Brian pointed out, sometimes this points to a hw issue. C On Dec 1, 2008, at 1:32 PM, Sagar Naik wrote: Brian Bockelman wrote: Hardware/memory problems? I m not sure. SIGBUS is relatively rare; it sometimes indicates a hardware error in the memory system, depending on your arch. *uname -a : * Linux hdimg53 2.6.15-1.2054_FC5smp #1 SMP Tue Mar 14 16:05:46 EST 2006 i686 i686 i386 GNU/Linux *top's top* Cpu(s): 0.1% us, 1.1% sy, 0.0% ni, 98.0% id, 0.8% wa, 0.0% hi, 0.0% si Mem: 8288280k total, 1575680k used, 6712600k free, 5392k buffers Swap: 16386292k total, 68k used, 16386224k free, 522408k cached 8 core , xeon 2GHz Brian On Dec 1, 2008, at 3:00 PM, Sagar Naik wrote: Couple of the datanodes crashed with the following error The /tmp is 15% occupied # # An unexpected error has been detected by Java Runtime Environment: # # SIGBUS (0x7) at pc=0xb4edcb6a, pid=10111, tid=1212181408 # [Too many errors, abort] Pl suggest how should I go to debug this particular problem -Sagar Thanks to Brian -Sagar
Re: Namenode BlocksMap on Disk
Dennis Kubes wrote: From time to time a message pops up on the mailing list about OOM errors for the namenode because of too many files. Most recently there was a 1.7 million file installation that was failing. I know the simple solution to this is to have a larger java heap for the namenode. But the non-simple way would be to convert the BlocksMap for the NameNode to be stored on disk and then queried and updated for operations. This would eliminate memory problems for large file installations but also might degrade performance slightly. Questions: 1) Is there any current work to allow the namenode to store on disk versus is memory? This could be a configurable option. 2) Besides possible slight degradation in performance, is there a reason why the BlocksMap shouldn't or couldn't be stored on disk? As Doug mentioned the main worry is that this will drastically reduce performance. Part of the reason is that large chunk of the work on NamenNode happens under a single global lock. So if there is seek under this lock, it affects every thing else. One good long term fix for this is to make it easy to split the namespace between multiple namenodes.. There was some work done on supporting volumes. Also the fact that HDFS now supports symbolic links might make this easier for someone adventurous to use that as a quick hack to get around this. If you have a rough prototype implementation I am sure there will be a lot of interest in evaluating it. If Java has any disk based or memory mapped data structures, that might be the quickest way to try its affects. Raghu.
NN JVM process takes a lot more memory than assigned
There is one instance of NN where JVM process takes 40GB memory though jvm is started with 24GB. Java heap is still 24GB. Looks like it ends up taking a lot of memory outside. There are a lot entries in pmap similar to below that account for the difference. Anyone knows what this might be? From 'pmap -d' 2ab50800 64308 rwx-- 2ab50800 000:0 [ anon ] 2ab50becd0001228 - 2ab50becd000 000:0 [ anon ] 2ab50c00 64328 rwx-- 2ab50c00 000:0 [ anon ] 2ab50fed20001208 - 2ab50fed2000 000:0 [ anon ] 2ab51000 64720 rwx-- 2ab51000 000:0 [ anon ] 2ab513f34000 816 - 2ab513f34000 000:0 [ anon ] 2ab51400 64552 rwx-- 2ab51400 000:0 [ anon ] Raghu.
Re: Hadoop 18.1 ls stalls
Give an example stack trace of a thread that is blocked on callQ and a handler thread that is not. trace of all the threads would be even better. If your handlers waiting on callQ, then they are waiting for work. Raghu. Sagar Naik wrote: *Problem:* The ls is taking noticeable time to respond. *system:* I have about 1.6 million files and namenode is prety much full with heap(2400MB). I have configured dfs.handler.count to 100, and all IPC server threads are locked onto a Queue (I think callQueue) In the namenode logs, I occasionally see Out of memory Exception. Any suggestions, to improve the response time of system ? I can increase the dfs.handler.count to 256, but I guess I will see more OOM exceptions.
Re: Hadoop 18.1 ls stalls
Sagar Naik wrote: Thanks Raghu, *datapoints:* - So when I use FSShell client, it gets into retry mode for getFilesInfo() call and takes a long time. What does retry mode mean? - Also, when do a ls operation, it takes secs(4/5) . - 1.6 million files and namenode is mostly full with heap(2400M) (from ui) When you say 'ls', how many does it return? (ie. ls of one file, or -lsr of thousands of files etc). None of the IPC threads in your stack trace is doing any work.
Re: ipc problems afer upgrading to hadoop 0.18
Johannes Zillmann wrote: OK, that there is a exception was right. (this happened in a ipc-server not reachable-test) I just missed that fact that now a other exception is thrown in difference to previous version. Previous versions has thrown exceptions like ConnectException or SocketTimeoutException. Now java.io.IOException: Call failed on local exception is thrown. It will soon be fixed : https://issues.apache.org/jira/browse/HADOOP-4659 Raghu. Thats it! thanks Johannes On Nov 18, 2008, at 7:58 PM, Johannes Zillmann wrote: Hi there, having a custom client-server based on hadoop's ipc. Now after upgrading to hadoop 0.18.1 (from 0.17) i get following exception: ... Caused by: java.io.IOException: Call failed on local exception at org.apache.hadoop.ipc.Client.call(Client.java:718) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) ... Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:358) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441) Any Ideas ? Johannes ~~~ 101tec GmbH Halle (Saale), Saxony-Anhalt, Germany http://www.101tec.com ~~~ 101tec GmbH Halle (Saale), Saxony-Anhalt, Germany http://www.101tec.com
Re: hadoop 0.18.2 Checksum ok was sent and should not be sent again
Rong-en Fan wrote: I believe it was for debug purpose and was removed after 0.18.2 released. Yes. It is fixed in 0.18.3 (HADOOP-4499). Raghu. On Mon, Nov 17, 2008 at 8:57 PM, Alexander Aristov [EMAIL PROTECTED] wrote: Hi all I upgraded hadoop to the 0.18.2 version and tried to run a test job, distcopy from S3 to HDFS I got a lot of info-level errors although the job successfully finished. Any ideas? Can I simply suppress INFOs in log4j and forget about the error?
Re: Large number of deletes takes out DataNodes
This is a long known issue.. deleting files takes a lot of time and datanode does not heart beat during that time. Please file a jira so that the issue percolates up :) There are more of these cases that result in datanode being marked dead. As work around, you can double or triple heartbeat.recheck.interval (default 5000 millseconds) in config. Raghu. Jonathan Gray wrote: In many of my jobs I create intermediate data in HDFS. I keep this around for a number of days for inspection until I delete it in large batches. If I attempt to delete all of it at once, the flood of delete messages to the datanodes seems to cause starvation as they do not seem to responding to the namenode heartbeat. There is about 5% cpu utilization and the logs just show the deletion of blocks. In the worst case (if I'm deleting a few terabytes, about 50% of total capacity across 10 nodes) this causes the master to expire the datanode leases. Once the datanode finishes deletions, it reports back to master and is added back to the cluster. At this point, its blocks have already started to be reassigned so it then starts as an empty node. In one run, this happened to 8 out of 10 nodes before getting back to a steady state. There were a couple moments during that run that a number of the blocks had replication 1. Obviously I can handle this by deleting less at any one time, but it seems like there might be something wrong. With no CPU utilization, why does the datanode not respond to the namenode? Thanks. Jonathan Gray
Re: Datanode block scans
Brian Bockelman wrote: Hey all, I noticed that the maximum throttle for the datanode block scanner is hardcoded at 8MB/s. I think this is insufficient; on a fully loaded Sun Thumper, a full scan at 8MB/s would take something like 70 days. Is it possible to make this throttle a bit smarter? At the very least, would anyone object to a patch which exposed this throttle as a config option? Alternately, a smarter idea would be to throttle the block scanner at (8MB/s) * (# of volumes), under the assumption that there is at least 1 disk per volume. Making the max configurable seems useful. Either of the above options is fine, though the first one might be simpler for configuration. 8MB/s is calculated for around 4TB of data on a node. given 80k seconds a day, it is around 6-7 days. 8-10 MB/s is not too bad a load on 2-4 disk machine. Hm... on second thought, however trivial the resulting disk I/O would be, on the Thumper example, the maximum throttle would be 3Gbps: that's a nontrivial load on the bus. How do other big sites handle this? We're currently at 110TB raw, are considering converting ~240TB over from another file system, and are planning to grow to 800TB during 2009. A quick calculation shows that to do a weekly scan at that size, we're talking ~10Gbps of sustained reads. You have a 110 TB on single datanode and moving to 800TB nodes? Note that this rate applies to amount of data on a single datanode. Raghu. I still worry that the rate is too low; if we have a suspicious node, or users report a problematic file, waiting a week for a full scan is too long. I've asked a student to implement a tool which can trigger a full block scan of a path (the idea would be able to do hadoop fsck /path/to/file -deep). What would be the best approach for him to take to initiate a high-rate full volume or full datanode scan?
Re: Datanode block scans
How often is safe depends on what probabilities you are willing to accept. I just checked on one of clusters with 4PB of data, the scanner fixes about 1 block a day. Assuming avg size of 64MB per block (pretty high), probability that 3 copies of one replica go bad in 3 weeks is of the range 1e-12. In reality it is mostly 2-3 orders less probable. Raghu. Brian Bockelman wrote: On Nov 13, 2008, at 11:32 AM, Raghu Angadi wrote: Brian Bockelman wrote: Hey all, I noticed that the maximum throttle for the datanode block scanner is hardcoded at 8MB/s. I think this is insufficient; on a fully loaded Sun Thumper, a full scan at 8MB/s would take something like 70 days. Is it possible to make this throttle a bit smarter? At the very least, would anyone object to a patch which exposed this throttle as a config option? Alternately, a smarter idea would be to throttle the block scanner at (8MB/s) * (# of volumes), under the assumption that there is at least 1 disk per volume. Making the max configurable seems useful. Either of the above options is fine, though the first one might be simpler for configuration. 8MB/s is calculated for around 4TB of data on a node. given 80k seconds a day, it is around 6-7 days. 8-10 MB/s is not too bad a load on 2-4 disk machine. Hm... on second thought, however trivial the resulting disk I/O would be, on the Thumper example, the maximum throttle would be 3Gbps: that's a nontrivial load on the bus. How do other big sites handle this? We're currently at 110TB raw, are considering converting ~240TB over from another file system, and are planning to grow to 800TB during 2009. A quick calculation shows that to do a weekly scan at that size, we're talking ~10Gbps of sustained reads. You have a 110 TB on single datanode and moving to 800TB nodes? Note that this rate applies to amount of data on a single datanode. Nah -110TB total in the system (200 datanodes), and will move to 800TB total (probably 250 datanodes). However, we do have some larger nodes (we range from 80GB to 48TB per node); recent and planned purchases are in the 4-8TB per node range, but I'd sure hate to throw away 48TB of disks :) On the 48TB node, a scan at 8MB/s would take 70 days. I'd have to run at a rate of 80MB/s to scan through in 7 days. While 80MB/s over 48 disks is not much, I was curious about how the rest of the system would perform (the node is in production on a different file system right now, so borrowing it is not easy...); 80MB/s sounds like an awful lot for background noise. Do any other large sites run such large nodes? How long of a period between block scans do sites use in order to feel safe ? Brian Raghu. I still worry that the rate is too low; if we have a suspicious node, or users report a problematic file, waiting a week for a full scan is too long. I've asked a student to implement a tool which can trigger a full block scan of a path (the idea would be able to do hadoop fsck /path/to/file -deep). What would be the best approach for him to take to initiate a high-rate full volume or full datanode scan?
Re: File Descriptors not cleaned up
Jason Venner wrote: We have just realized one reason for the '/no live node contains block/' error from /DFSClient/ is an indication that the /DFSClient/ was unable to open a connection due to insufficient available file descriptors. FsShell is particularly bad about consuming descriptors and leaving the containing objects for the Garbage Collector to reclaim the descriptors. We will submit a patch in a few days. please do. We know more since I last replied on July 31st. Hadoop itself does not have any finalizers that depend on GC to close fds. If GC is affecting number of fds, you are likely a victim of HADOOP-4346. thanks, Raghu.
Re: LeaseExpiredException and too many xceiver
Config on most Y! clusters sets dfs.datanode.max.xcievers to a large value .. something like 1k to 2k. You could try that. Raghu. Nathan Marz wrote: Looks like the exception on the datanode got truncated a little bit. Here's the full exception: 2008-10-31 14:20:09,978 ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(10.100.11.115:50010, storageID=DS-2129547091-10.100.11.115-50010-1225485937590, infoPort=50075, ipcPort=50020):DataXceiver: java.io.IOException: xceiverCount 257 exceeds the limit of concurrent xcievers 256 at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1030) at java.lang.Thread.run(Thread.java:619) On Oct 31, 2008, at 2:49 PM, Nathan Marz wrote: Hello, We are seeing some really bad errors on our hadoop cluster. After reformatting the whole cluster, the first job we run immediately fails with Could not find block locations... errrors. In the namenode logs, we see a ton of errors like: 2008-10-31 14:20:44,799 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 7276, call addBlock(/tmp/dustintmp/shredded_dataunits/_t$ org.apache.hadoop.dfs.LeaseExpiredException: No lease on /tmp/dustintmp/shredded_dataunits/_temporary/_attempt_200810311418_0002_m_23_0$ at org.apache.hadoop.dfs.FSNamesystem.checkLease(FSNamesystem.java:1166) at org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1097) at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:330) at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888) In the datanode logs, we see a ton of errors like: 2008-10-31 14:20:09,978 ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(10.100.11.115:50010, storageID=DS-2129547091-10.100.11.1$ of concurrent xcievers 256 at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1030) at java.lang.Thread.run(Thread.java:619) Anyone have any ideas on what may be wrong? Thanks, Nathan Marz Rapleaf
Re: TaskTrackers disengaging from JobTracker
Devaraj Das wrote: I wrote a patch to address the NPE in JobTracker.killJob() and compiled it against TRUNK. I've put this on the cluster and it's now been holding steady for the last hour or so.. so that plus whatever other differences there are between 18.1 and TRUNK may have fixed things. (I'll submit the patch to the JIRA as soon as it finishes cranking against the JUnit tests) Aaron, I don't think this is a solution to the problem you are seeing. The IPC handlers are tolerant to exceptions. In particular, they must not die in the event of an exception during RPC processing. Could you please get a stack trace of the JobTracker threads (without your patch) when the TTs are unable to talk to it. Access the url http://jt-host:jt-info-port/stacks That will tell us what the handlers are up to. Devaraj fwded the stacks that Aaron sent. As he suspected there is a deadlock in RPC server. I will file a blocker for 0.18 and above. This deadlock is more likely on a busy network. Raghu.
Re: Datanode not detecting full disk
Stefan Will wrote: Hi Raghu, Each DN machine has 3 partitions, e.g.: FilesystemSize Used Avail Use% Mounted on /dev/sda1 20G 8.0G 11G 44% / /dev/sda3 1.4T 756G 508G 60% /data tmpfs 3.9G 0 3.9G 0% /dev/shm All of the paths in hadoop-site.xml point to /data, which is the partition that filled up to 100% (I deleted a bunch of files from HDFS since then). So I guess the question is whether the DN looks at just the partition its data directory is on, or all partitions when it determines disk usage. Datanode checks df on /data alone. What is dfs.df.interval set to? Also if you set multiple paths for dfs.data.dir, availble for each of these adds up, that would be wrong in your case since all of these are under one partition. Raghu.
Re: TaskTrackers disengaging from JobTracker
Raghu Angadi wrote: Devaraj fwded the stacks that Aaron sent. As he suspected there is a deadlock in RPC server. I will file a blocker for 0.18 and above. This deadlock is more likely on a busy network. Aaron, Could you try the patch attached to https://issues.apache.org/jira/browse/HADOOP-4552 ? Thanks, Raghu.
Re: Could not obtain block error
If have only one copy of the block and it is mostly corrupted.. Namenode itself can not correct it. Of course, DFSClient should not print error in a infinite loop. I think there was an old bug where crc file got overwritten by 0 length file. One work around for you is to go to the datanode and remove the .crc file for this block (find /datanodedir -name blk_5994030096182059653\*). Be careful not to remove the block file itself. longer term fix : upgrade to more recent version. Raghu. murali krishna wrote: Hi, When I try to read one of the file from dfs, I get the following error in an infinite loop (using 0.15.3) “08/10/28 23:43:15 INFO fs.DFSClient: Could not obtain block blk_5994030096182059653 from any node: java.io.IOException: No live nodes contain current block” Fsck showed that the file is HEALTHY but under replicated (1 instead of configured 2). I checked the datanode log where the only replica exists for that block and I can see repeated errors while serving that bock. 2008-10-22 23:55:39,378 WARN org.apache.hadoop.dfs.DataNode: Failed to transfer blk_59940300961820596 53 to 68.142.212.228:50010 got java.io.EOFException at java.io.DataInputStream.readShort(DataInputStream.java:298) at org.apache.hadoop.dfs.DataNode$BlockSender.init(DataNode.java:1061) at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1446) at java.lang.Thread.run(Thread.java:619) Any idea what is going on and how can I fix this ? Thanks, Murali