DataNode/TaskTracker memory constraints.
Hi. All Hadoop components are started with -Xmx1000M as per default. I am planning to throw in some data/task nodes here and there in my arch. However most machines have only 4G physical RAM so allocating 2G + overhead ~2.5G to hadoop is a little risky since they could very well become inaccessible if it needs to compete with other processes for RAM. I have experienced this many times with java processes going haywire where I run other services in parallell. Anyway I would like to understand the reasoning about having 1G allocated per process. I figure that the DataNode could survive with a little less as well the TaskTracker if the jobs running in it do not consume so much memory. Of course each process would like to have even more memory than 1G but if I need to cut down I would like to know which to cut and what I loose by doing so. Any thoughts? Trial and error is of course an option but I would like to hear the basic thoughts about how memory should be utilized to gain max out of the boxes. Kindly //Marcus -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ http://blogg.tailsweep.com/
concurrent modification of input files
Hi I have a scenario where, while a map/reduce is working on a file, the input file may get deleted and copied with a new version of the file. All my files are compressed and hence each file is worked on by a single node. I tried simulating the scenario of deleting the file before mapreduce was over, and map/reduce went ahead without complaining. Can I assume this will be the case always. Many thanks in adv Sandhya
Re: The error occurred when a lot of files created use fuse-dfs
Hey, What version of Hadoop are you running? Have you taken a look at HADOOP-4775? https://issues.apache.org/jira/browse/HADOOP-4775 Basically, fuse-dfs is not usable on Hadoop 0.19.0 without a patch. Brian On Dec 15, 2008, at 12:24 AM, zhuweimin wrote: Dear fuse-dfs users I copy 1000 files into hadoop from local disk use fuse-dfs, Display the following error when the 600th files are copied: cp: cannot create regular file `/mnt/dfs/user/hadoop/fuse3/10m/ 10m_33.dat': Input/output error cp: cannot create regular file `/mnt/dfs/user/hadoop/fuse3/10m/ 10m_34.dat': Input/output error cp: cannot create regular file `/mnt/dfs/user/hadoop/fuse3/10m/ 10m_35.dat': Input/output error cp: cannot create regular file `/mnt/dfs/user/hadoop/fuse3/10m/ 10m_36.dat': Input/output error cp: cannot create regular file `/mnt/dfs/user/hadoop/fuse3/10m/ 10m_37.dat': Input/output error cp: cannot create regular file `/mnt/dfs/user/hadoop/fuse3/10m/ 10m_38.dat': Input/output error ... It is necessary to remount the fuse-dfs. Do you think about of the error. thanks
Re: Suggestions of proper usage of key parameter ?
To expand a bit on Owen's remarks: It should be pointed out that in the case of a single MapReduce pass from an input dataset to an output dataset, the keys into the mapper and out of the reducer may not be particularly interesting to you. However, more complicated algorithms often involve multiple MapReduce jobs, where an input dataset has an MR pass over it, yielding an intermediate dataset which undergoes yet another MR pass, yielding a final dataset. In such cases, this provides continuity of interface, where every stage has both key and value components. Very often the dataset gets partitioned or organized in such a way that it makes sense to stamp a key on to values early on in the process, and continue to allow the keys to flow through the system between passes. This interface makes that much more convenient. For example, you may have an input record which is joined against multiple other datasets. Each other data set join may involve a separate mapreduce pass, but the primary key will be the same the whole time. As for determining the input key types: The default TextInputFormat gives you a line offset and a line of text. This offset may not be particularly useful to your application. On the other hand, the KeyValueTextInputFormat will read each line of text from the input file, and split this into a key Text object and a value Text object based on the first tab char in the line. This matches the formatting of output files done by the default TextOutputFormat. Chained MRs should set the input format to this one, as Hadoop won't know that this is your intended use case. If your intermediate reducer outputs more complicated datatypes, you may want to use SequenceFileOutputFormat, which marshals your data types into a binary file format. The SequenceFileInputFormat will then read in the data, and demarshal it into the same data types that you had already encoded. (Your final pass may want to use TextOutputFormat to get everything back to a human-readable form) - Aaron On Sun, Dec 14, 2008 at 11:39 PM, Owen O'Malley omal...@apache.org wrote: On Dec 14, 2008, at 4:47 PM, Ricky Ho wrote: Yes, I am referring to the key INPUT INTO the map() function and the key EMITTED FROM the reduce() function. Can someone explain why do we need a key in these cases and what is the proper use of it ? It was a design choice and could have been done as: R1 - map - K,V - reduce - R2 instead of K1,V1 - map - K2,V2 - reduce - K3,V3 but since the input of the reduce is sorted on K2, the output of the reduce is also typically sorted and therefore keyed. Since jobs are often chained together, it makes sense to make the reduce input match the map input. Of course everything you could do with the first option is possible with the second using either K1 = R1 or V1 = R1. Having the keys is often convenient... Who determines what the key should be ? (by the corresponding InputFormat implementation class) ? The InputFormat makes the choice. In this case, what is the key in the map() call ? (name of the input file) ? TextInputFormat uses the byte offset as the key and the line as the value. What if the reduce() function emits multiple key, value entries or not emitting any entry at all ? Is this considered OK ? Yes. What if the reduce() function emits a key, value entry whose key is not the same as the input key parameter to the reduce() function ? Is this OK ? Yes, although the reduce output is not re-sorted, so the results won't be sorted unless K3 is a subset of K2. If there is a two Map/Reduce cycle chained together. Is the key input into the 2nd round map() function determined by the key emitted from the 1st round reduce() function ? Yes. -- Owen
RE: Suggestions of proper usage of key parameter ?
Thanks for the detail explanation of Aaron and Owen's response. One key takeaway point is that the OutputFormat of a previous Job need to be compatible to the InputFormat of its subsequent job. Otherwise, the key/value demarcation will screw up. It still not very clear to me when to emit multiple entries versus single entry in the reduce() function. Here is a typical case where a reduce() function counts the number of times a particular word appears in a particular document. There are multiple possible choices in how the reduce() function can be structured. And which one to choose is not clear to me ... Choice 1: Emit one entry in the reduce(), using doc_name as key == # k2 is doc_name, v2_list is a list of [word, count] reduce(k2, v2_list) { count_holder = Hash.new for v in v2_list { word = v[0] count = v[1] count_holder[word] += count } arr = Array.new for e in count_holder.entries { word = e.key count = e.value arr.add([word, count]) } doc_name = k2 emit(doc_name, arr) } In this case, the output will still be a single entry per invocation of reduce() ... { key:file1, value:[[apple, 25], [orange, 16]] } Choice 2: Emit multiple entries in the reduce(), using [doc_name, word] as key reduce(k2, v2_list) { count_holder = Hash.new for v in v2_list { word = v[0] count = v[1] count_holder[word] += count } doc_name = k2 for e in count_holder.entries { word = e.key count = e.value emit([doc_name, word], count) } } In this case, the output will have multiple entries per invocation of reduce() ... { key:[file1, apple], value:25 } { key:[file1, orange], value:16 } Choice 3: Emit one entry in the reduce(), using null key reduce(k2, v2_list) { doc_name = k2 count_holder = Hash.new for v in v2_list { word = v[0] count = v[1] count_holder[word] += count } arr = Array.new for e in count_holder.entries { word = e.key count = e.value arr.add([doc_name, word, count]) } emit(null, arr) } In this case, the output will be a single entry per invocation of reduce() ... { key:null, value:[[file1, apple, 25], [file1, orange, 16]] } Can you compare the above options and shed some light ? Rgds, Ricky -Original Message- From: Aaron Kimball [mailto:aa...@cloudera.com] Sent: Monday, December 15, 2008 4:13 AM To: core-user@hadoop.apache.org Subject: Re: Suggestions of proper usage of key parameter ? To expand a bit on Owen's remarks: It should be pointed out that in the case of a single MapReduce pass from an input dataset to an output dataset, the keys into the mapper and out of the reducer may not be particularly interesting to you. However, more complicated algorithms often involve multiple MapReduce jobs, where an input dataset has an MR pass over it, yielding an intermediate dataset which undergoes yet another MR pass, yielding a final dataset. In such cases, this provides continuity of interface, where every stage has both key and value components. Very often the dataset gets partitioned or organized in such a way that it makes sense to stamp a key on to values early on in the process, and continue to allow the keys to flow through the system between passes. This interface makes that much more convenient. For example, you may have an input record which is joined against multiple other datasets. Each other data set join may involve a separate mapreduce pass, but the primary key will be the same the whole time. As for determining the input key types: The default TextInputFormat gives you a line offset and a line of text. This offset may not be particularly useful to your application. On the other hand, the KeyValueTextInputFormat will read each line of text from the input file, and split this into a key Text object and a value Text object based on the first tab char in the line. This matches the formatting of output files done by the default TextOutputFormat. Chained MRs should set the input format to this one, as Hadoop won't know that this is your intended use case. If your intermediate reducer outputs more complicated datatypes, you may want to use SequenceFileOutputFormat, which marshals your data types into a binary file format. The SequenceFileInputFormat will then read in the data, and demarshal it into the same data types that you had already encoded. (Your final pass may want to use TextOutputFormat to get everything back to a human-readable form) - Aaron On Sun, Dec 14, 2008 at 11:39 PM, Owen O'Malley omal...@apache.org wrote: On Dec 14, 2008, at 4:47 PM, Ricky Ho wrote: Yes, I am
Re: concurrent modification of input files
On Dec 15, 2008, at 8:08 AM, Sandhya E wrote: I have a scenario where, while a map/reduce is working on a file, the input file may get deleted and copied with a new version of the file. All my files are compressed and hence each file is worked on by a single node. I tried simulating the scenario of deleting the file before mapreduce was over, and map/reduce went ahead without complaining. Can I assume this will be the case always. No, the results will be completely non-deterministic. Don't do this. That said, the thing that will save you in micro-tests of this is that if the file is missing at some point, the task will fail and retry. -- Owen
NameNode fatal crash - 0.18.1
I have a 10+1 node cluster, each slave running DataNode/TaskTracker/HBase RegionServer. At the time of this crash, NameNode and SecondaryNameNode were both hosted on same master. We do a nightly backup and about 95% of the way through, HDFS crashed with... NameNode shows: 2008-12-15 01:49:31,178 ERROR org.apache.hadoop.fs.FSNamesystem: Unable to sync edit log. Fatal Error. 2008-12-15 01:49:31,178 FATAL org.apache.hadoop.fs.FSNamesystem: Fatal Error : All storage directories are inaccessible. 2008-12-15 01:49:31,179 INFO org.apache.hadoop.dfs.NameNode: SHUTDOWN_MSG: Every single DataNode shows: 2008-12-15 01:49:32,340 WARN org.apache.hadoop.dfs.DataNode: java.io.IOException: Call failed on local exception at org.apache.hadoop.ipc.Client.call(Client.java:718) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.dfs.$Proxy4.sendHeartbeat(Unknown Source) at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:655) at org.apache.hadoop.dfs.DataNode.run(DataNode.java:2888) at java.lang.Thread.run(Thread.java:636) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441) This is virtually all of the information I have. At the same time as the backup, we have normal HBase traffic and our hourly batch MR jobs. So slave nodes were pretty heavily loaded, but don't see anything in DN logs besides this Call failed. There are no space issues or anything else, Ganglia shows high CPU load around this time which has been typical every night, but I don't see any issues in DN's or NN about expired leases/no heartbeats/etc. Is there a way to prevent this failure from happening in the first place? I guess just reduce total load across cluster? Second question is about how to recover once NameNode does fail... When trying to bring HDFS back up, we get hundreds of: 2008-12-15 07:54:13,265 ERROR org.apache.hadoop.dfs.LeaseManager: XXX not found in lease.paths And then 2008-12-15 07:54:13,267 ERROR org.apache.hadoop.fs.FSNamesystem: FSNamesystem initialization failed. Is there a way to recover from this? As of time of this crash, we had SecondaryNameNode on the same node. Moving it to another node with sufficient memory now, but would that even prevent this kind of FS botching? Also, my SecondaryNameNode is telling me it cannot connect when trying to do a checkpoint: 2008-12-15 09:59:48,017 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in doCheckpoint: 2008-12-15 09:59:48,018 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) I changed my masters file to just contain the hostname of the secondarynamenode, this seems to have properly started the NameNode where I launched the ./bin/start-dfs.sh from and started SecondaryNameNode on correct node as well. But it seems to be unable to connect back to primary. I have hadoop-site.xml pointing to fs.default.name of primary, but otherwise there are not links back. Where would I specify to the secondary where primary is located? We're also upgrading to Hadoop 0.19.0 at this time. Thank you for any help. Jonathan Gray
Re: occasional createBlockException in Hadoop .18.1
Hi, Some data points on this issue. 1) du runs for 20-30 secs. 2) after some time , I dont see any activity in datanode logs 3) I cant even jstack the datanode (forced it , gave me a DebuggerException, double checked the pid), the datanode:50075/stacks takes forever to respond I can telnet to datanode:50010 I think, the disk is bad or something Pl suggest some pointers to analyze this problem -Sagar Sagar Naik wrote: CLIENT EXCEPTION: 2008-12-14 08:41:46,919 [Thread-90] INFO org.apache.hadoop.dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.50.80.133:54045 remote=/10.50.80.108:50010] 2008-12-14 08:41:46,919 [Thread-90] INFO org.apache.hadoop.dfs.DFSClient: Abandoning block blk_-7364265396616885025_5870078 2008-12-14 08:41:46,920 [Thread-90] INFO org.apache.hadoop.dfs.DFSClient: Waiting to find target node: 10.50.80.108:50010 DATANODE 2008-12-14 08:40:39,215 INFO org.apache.hadoop.dfs.DataNode: Receiving block blk_-7364265396616885025_5870078 src: /10.50.80.133:54045 dest: /10.50.80.133:50010 . . . . . I occasionally see the datanode as deadnode. When the datanode is deadnode, I see the du forked from datanode. The du is seen in D state Any pointers to debug this information would help me -Sagar
RE: NameNode fatal crash - 0.18.1
I have fixed the issue with the SecondaryNameNode not contacting primary with the 'dfs.http.address' config option. Other issues still unsolved. -Original Message- From: Jonathan Gray [mailto:jl...@streamy.com] Sent: Monday, December 15, 2008 10:55 AM To: core-user@hadoop.apache.org Subject: NameNode fatal crash - 0.18.1 I have a 10+1 node cluster, each slave running DataNode/TaskTracker/HBase RegionServer. At the time of this crash, NameNode and SecondaryNameNode were both hosted on same master. We do a nightly backup and about 95% of the way through, HDFS crashed with... NameNode shows: 2008-12-15 01:49:31,178 ERROR org.apache.hadoop.fs.FSNamesystem: Unable to sync edit log. Fatal Error. 2008-12-15 01:49:31,178 FATAL org.apache.hadoop.fs.FSNamesystem: Fatal Error : All storage directories are inaccessible. 2008-12-15 01:49:31,179 INFO org.apache.hadoop.dfs.NameNode: SHUTDOWN_MSG: Every single DataNode shows: 2008-12-15 01:49:32,340 WARN org.apache.hadoop.dfs.DataNode: java.io.IOException: Call failed on local exception at org.apache.hadoop.ipc.Client.call(Client.java:718) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.dfs.$Proxy4.sendHeartbeat(Unknown Source) at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:655) at org.apache.hadoop.dfs.DataNode.run(DataNode.java:2888) at java.lang.Thread.run(Thread.java:636) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499 ) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441) This is virtually all of the information I have. At the same time as the backup, we have normal HBase traffic and our hourly batch MR jobs. So slave nodes were pretty heavily loaded, but don't see anything in DN logs besides this Call failed. There are no space issues or anything else, Ganglia shows high CPU load around this time which has been typical every night, but I don't see any issues in DN's or NN about expired leases/no heartbeats/etc. Is there a way to prevent this failure from happening in the first place? I guess just reduce total load across cluster? Second question is about how to recover once NameNode does fail... When trying to bring HDFS back up, we get hundreds of: 2008-12-15 07:54:13,265 ERROR org.apache.hadoop.dfs.LeaseManager: XXX not found in lease.paths And then 2008-12-15 07:54:13,267 ERROR org.apache.hadoop.fs.FSNamesystem: FSNamesystem initialization failed. Is there a way to recover from this? As of time of this crash, we had SecondaryNameNode on the same node. Moving it to another node with sufficient memory now, but would that even prevent this kind of FS botching? Also, my SecondaryNameNode is telling me it cannot connect when trying to do a checkpoint: 2008-12-15 09:59:48,017 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in doCheckpoint: 2008-12-15 09:59:48,018 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) I changed my masters file to just contain the hostname of the secondarynamenode, this seems to have properly started the NameNode where I launched the ./bin/start-dfs.sh from and started SecondaryNameNode on correct node as well. But it seems to be unable to connect back to primary. I have hadoop-site.xml pointing to fs.default.name of primary, but otherwise there are not links back. Where would I specify to the secondary where primary is located? We're also upgrading to Hadoop 0.19.0 at this time. Thank you for any help. Jonathan Gray
Re: 5 node Hadoop Cluster!!! Some Doubts...
Sid, For such a small cluster, just put the Jobtracker and Namenode on the same machine and the Tasktrackers and Datanodes in pairs on the other machines. I can't think of anything else that would have an impact on performance for you. J-D On Thu, Dec 11, 2008 at 6:20 PM, Siddharth Malhotra sid86.malho...@gmail.com wrote: Hey, I am student and working on a project using Hadoop. I have successfully implemented the project on single node and Pseudo Distributed Mode. I would now be implementing it on a 5 node cluster but I wanted know if there would be any specific way I should setup the Tasktraker, Jobtracker Namenode, Datanode. Like should I setup the Namenode, JOBtracker to the same node or should I put it on different nodes. Similerly can I also setup the Datanode and Tasktracker on the Masternode or only on the slave nodes. Does any type of specific configuration help to improve the performance as such? Awaiting a reply soon. Thanks Sid -- Siddharth Malhotra Mobile: +41 76 275 3991 Mail: : sid86.malho...@gmail.com