Re: HDFS is not loading evenly across all nodes.
Yes, it will be kept on the machine you issue the dfs -put command if it's got a datanode running. Otherwise, a random datanode will be chosen to store the datablocks. On Fri, Jun 19, 2009 at 10:41 AM, Rajeev Gupta graj...@in.ibm.com wrote: If you're inserting into HDFS from a machine running a DataNode, the local datanode will always be chosen as one of the three replica targets. Does that mean that if replication factor is 1, whole file will be kept on one node only? Thanks and regards. -Rajeev Gupta Aaron Kimball aa...@cloudera.c omTo core-user@hadoop.apache.org 06/19/2009 01:56 cc AM Subject Re: HDFS is not loading evenly Please respond to across all nodes. core-u...@hadoop. apache.org Did you run the dfs put commands from the master node? If you're inserting into HDFS from a machine running a DataNode, the local datanode will always be chosen as one of the three replica targets. For more balanced loading, you should use an off-cluster machine as the point of origin. If you experience uneven block distribution, you should also periodically rebalance your cluster by running bin/start-balancer.sh every so often. It will work in the background to move blocks from heavily-laden nodes to underutilized ones. - Aaron On Thu, Jun 18, 2009 at 12:57 PM, openresearch qiming...@openresearchinc.com wrote: Hi all I dfs put a large dataset onto a 10-node cluster. When I observe the Hadoop progress (via web:50070) and each local file system (via df -k), I notice that my master node is hit 5-10 times harder than others, so hard drive is get full quicker than others. Last night load, it actually crash when hard drive was full. To my understand, data should wrap around all nodes evenly (in a round-robin fashion using 64M as a unit). Is it expected behavior of Hadoop? Can anyone suggest a good troubleshooting way? Thanks -- View this message in context: http://www.nabble.com/HDFS-is-not-loading-evenly-across-all-nodes.-tp24099585p24099585.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: input/output error while setting up superblock
I don't think HDFS is a good place to store your Xen image file as it will likely be updated/appended frequently in small blocks. With the way HDFS is designed for, you can't quite use it like a regular filesystem (e.g. ones that support frequent small block appends/updates in files). My suggestion is to use another filesystem like NAS or SAN. /Taeho 2009/5/22 신승엽 mikas...@naver.com Hi, I have a problem to use hdfs. I mounted hdfs using fuse-dfs. I created a dummy file for 'Xen' in hdfs and then formated the dummy file using 'mke2fs'. But the operation was faced error. The error message is as follows. [r...@localhost hdfs]# mke2fs -j -F ./file_dumy mke2fs 1.40.2 (12-Jul-2007) ./file_dumy: Input/output error while setting up superblock Also, I copyed an image file of xen to hdfs. But Xen couldn't the image files in hdfs. r...@localhost hdfs]# fdisk -l fedora6_demo.img last_lba(): I don't know how to handle files with mode 81a4 You must set cylinders. You can do this from the extra functions menu. Disk fedora6_demo.img: 0 MB, 0 bytes 255 heads, 63 sectors/track, 0 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System fedora6_demo.img1 * 1 156 1253038+ 83 Linux Could you answer me anything about this problem. Thank you.
Reduce won't start until Map stage reaches 100%?
Dear All, With Hadoop 0.19.0, Reduce stage does not start until Map stage gets to the 100% completion. Has anyone faced the similar situation? ... ... - map 90% reduce 0% - map 91% reduce 0% - map 92% reduce 0% - map 93% reduce 0% - map 94% reduce 0% - map 95% reduce 0% - map 96% reduce 0% - map 97% reduce 0% - map 98% reduce 0% - map 99% reduce 0% - map 100% reduce 0% - map 100% reduce 1% - map 100% reduce 2% - map 100% reduce 3% - map 100% reduce 4% - map 100% reduce 5% - map 100% reduce 6% - map 100% reduce 7% - map 100% reduce 8% - map 100% reduce 9% Thank you all in advance, /Taeho
Re: Transferring data between different Hadoop clusters
Thanks for your prompt reply. When using the command ./bin/hadoop distcp hftp://cluster1:50070/path hdfs://cluster2/path - Should this command be given in cluster1? - What does port 50070 specify? Is it the one in fs.default.name, or dfs.http.address? /Taeho On Mon, Feb 2, 2009 at 12:40 PM, Mark Chadwick mchadw...@invitemedia.comwrote: Taeho, The distcp command is perfect for this. If you're copying between two clusters running the same version of Hadoop, you can do something like: ./bin/hadoop distcp hdfs://cluster1/path hdfs://cluster2/path If you're copying between 0.18 and 0.19, the command will look like: ./bin/hadoop distcp hftp://cluster1:50070/path hdfs://cluster2/path Hope that helps, -Mark On Sun, Feb 1, 2009 at 9:48 PM, Taeho Kang tka...@gmail.com wrote: Dear all, There have been times where I needed to transfer some big data from one version of Hadoop cluster to another. (e.g. from hadoop 0.18 to hadoop 0.19 cluster) Other than copying files from one cluster to a local file system and upload it to another, is there a tool that does it? Thanks in advance, Regards, /Taeho
Datanode log for errors
Hi, I have encountered some IOExceptions in Datanode, while some intermediate/temporary map-reduce data is written to HDFS. 2008-11-25 18:27:08,070 INFO org.apache.hadoop.dfs.DataNode: writeBlock blk_-460494523413678075 received exception java.io.IOException: Block blk_-460494523413678075 is valid, and cannot be written to. 2008-11-25 18:27:08,070 ERROR org.apache.hadoop.dfs.DataNode: 10.31.xx.xxx:50010:DataXceiver: java.io.IOException: Block blk_-460494523413678075 is valid, and cannot be written to. at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:616) at org.apache.hadoop.dfs.DataNode$BlockReceiver.init(DataNode.java:1995) at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1074) at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:938) at java.lang.Thread.run(Thread.java:619) It looks like one of the HDD partitons has a problem with being written to, but the log doesn't show which partition. Is there a way to find it out? (Or it could be a new feature for the next version...) Thanks in advance, /Taeho
Re: Question on opening file info from namenode in DFSClient
Hi, thanks for your reply Dhruba, One of my co-workers is writing a BigTable-like application that could be used for online, near-real-time, services. So since the application could be hooked into online services, there would times when a large number of users (e.g. 1000 users) request to access few files in a very short time. Of course, in a batch process job, this is a rare case, but for online services, it's quite a common case. I think HBase developers would have run into similar issues as well. Is this enough explanation? Thanks in advance, Taeho On Tue, Nov 4, 2008 at 3:12 AM, Dhruba Borthakur [EMAIL PROTECTED] wrote: In the current code, details about block locations of a file are cached on the client when the file is opened. This cache remains with the client until the file is closed. If the same file is re-opened by the same DFSClient, it re-contacts the namenode and refetches the block locations. This works ok for most map-reduce apps because it is rare that the same DSClient re-opens the same file again. Can you pl explain your use-case? thanks, dhruba On Sun, Nov 2, 2008 at 10:57 PM, Taeho Kang [EMAIL PROTECTED] wrote: Dear Hadoop Users and Developers, I was wondering if there's a plan to add file info cache in DFSClient? It could eliminate network travelling cost for contacting Namenode and I think it would greatly improve the DFSClient's performance. The code I was looking at was this --- DFSClient.java /** * Grab the open-file info from namenode */ synchronized void openInfo() throws IOException { /* Maybe, we could add a file info cache here! */ LocatedBlocks newInfo = callGetBlockLocations(src, 0, prefetchSize); if (newInfo == null) { throw new IOException(Cannot open filename + src); } if (locatedBlocks != null) { IteratorLocatedBlock oldIter = locatedBlocks.getLocatedBlocks().iterator(); IteratorLocatedBlock newIter = newInfo.getLocatedBlocks().iterator(); while (oldIter.hasNext() newIter.hasNext()) { if (! oldIter.next().getBlock().equals(newIter.next().getBlock())) { throw new IOException(Blocklist for + src + has changed!); } } } this.locatedBlocks = newInfo; this.currentNode = null; } --- Does anybody have an opinion on this matter? Thank you in advance, Taeho
Question on opening file info from namenode in DFSClient
Dear Hadoop Users and Developers, I was wondering if there's a plan to add file info cache in DFSClient? It could eliminate network travelling cost for contacting Namenode and I think it would greatly improve the DFSClient's performance. The code I was looking at was this --- DFSClient.java /** * Grab the open-file info from namenode */ synchronized void openInfo() throws IOException { /* Maybe, we could add a file info cache here! */ LocatedBlocks newInfo = callGetBlockLocations(src, 0, prefetchSize); if (newInfo == null) { throw new IOException(Cannot open filename + src); } if (locatedBlocks != null) { IteratorLocatedBlock oldIter = locatedBlocks.getLocatedBlocks().iterator(); IteratorLocatedBlock newIter = newInfo.getLocatedBlocks().iterator(); while (oldIter.hasNext() newIter.hasNext()) { if (! oldIter.next().getBlock().equals(newIter.next().getBlock())) { throw new IOException(Blocklist for + src + has changed!); } } } this.locatedBlocks = newInfo; this.currentNode = null; } --- Does anybody have an opinion on this matter? Thank you in advance, Taeho
Re: Add new data directory during runtime
Since configuration file is loaded when a datanode starts up, it's not possible to have the change in dfs.datadir applied in runtime. Please let me know if I'm wrong. On Fri, Oct 17, 2008 at 10:08 AM, Jinyeon Lee [EMAIL PROTECTED] wrote: Is it possible to add more data directories by changing the configuration `dfs.data.dir' during runtime? Regards, Lee, Jin Yeon
Re: dual core configuration
First of all, mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum are both set to 2 in hadoop-default.xml file; this file is read before hadoop-site.xml file so any properties that aren't set in hadoop-site.xml will follow the values set in hadoop-default.xml. As for the question on why only one core is utilized... I think it really depends on the process scheduling of the underlying OS. It's not like two tasks (two JVM subprocesses spawned by the tasktracker) will always run on independent cores as there are other processes which need one or more cores to be run. By the way, what tools did you use to find out which tasks (or processes) use which cores? /Taeho On Wed, Oct 8, 2008 at 1:01 PM, Alex Loddengaard [EMAIL PROTECTED]wrote: Taeho, I was going to suggest this change as well, but it's documented that mapred.tasktracker.map.tasks.maximum defaults to 2. Can you explain why Elia is only having one core utilized when this config option is set to 2? Here is the documentation I'm referring to: http://hadoop.apache.org/core/docs/r0.18.1/cluster_setup.html Alex On Tue, Oct 7, 2008 at 8:27 PM, Taeho Kang [EMAIL PROTECTED] wrote: You can have your node (tasktracker) running more than 1 task simultaneously. You may set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum properties found in hadoop-site.xml file. You should change hadoop-site.xml file on all your slave nodes depending on how many cores each slave has. For example, you don't really want to have 8 tasks running at once on a 2 core machine. /Taeho On Wed, Oct 8, 2008 at 5:53 AM, Elia Mazzawi [EMAIL PROTECTED]wrote: hello, I have some dual core nodes, and I've noticed hadoop is only running 1 instance, and so is only using 1 on the CPU's on each node. is there a configuration to tell it to run more than once? or do i need to turn each machine into 2 nodes? Thanks.
Re: dual core configuration
You can have your node (tasktracker) running more than 1 task simultaneously. You may set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum properties found in hadoop-site.xml file. You should change hadoop-site.xml file on all your slave nodes depending on how many cores each slave has. For example, you don't really want to have 8 tasks running at once on a 2 core machine. /Taeho On Wed, Oct 8, 2008 at 5:53 AM, Elia Mazzawi [EMAIL PROTECTED]wrote: hello, I have some dual core nodes, and I've noticed hadoop is only running 1 instance, and so is only using 1 on the CPU's on each node. is there a configuration to tell it to run more than once? or do i need to turn each machine into 2 nodes? Thanks.
Re: Add jar file via -libjars - giving errors
Adding your jar files in the $HADOOP_HOME/lib folder works, but you would have to restart all your tasktrackers to have your jar files loaded. If you repackage your map-reduce jar file (e.g. hadoop-0.18.0-examples.jar) with your jar file and run your job with the newly repackaged jar file, it would work, too. On Tue, Oct 7, 2008 at 6:55 AM, Tarandeep Singh [EMAIL PROTECTED] wrote: thanks Mahadev for the reply. So that means I have to copy my jar file in the $HADOOP_HOME/lib folder on all slave machines like before. One more question- I am adding a conf file (just like HADOOP_SITE.xml) via -conf option and I am able to query parameters in mapper/reducers. But is there a way I can query the parameters in my job driver class - public class jobDriver extends Configured { someMethod( ) { ToolRunner.run( new MyJob( ), commandLineArgs); // I want to query parameters present in my conf file here } } public class MyJob extends Configured implements Tool { } Thanks, Taran On Mon, Oct 6, 2008 at 2:46 PM, Mahadev Konar [EMAIL PROTECTED] wrote: HI Tarandeep, the libjars options does not add the jar on the client side. Their is an open jira for that ( id ont remember which one)... Oyu have to add the jar to the HADOOP_CLASSPATH on the client side so that it gets picked up on the client side as well. mahadev On 10/6/08 2:30 PM, Tarandeep Singh [EMAIL PROTECTED] wrote: Hi, I want to add a jar file (that is required by mappers and reducers) to the classpath. Initially I had copied the jar file to all the slave nodes in the $HADOOP_HOME/lib directory and it was working fine. However when I tried the libjars option to add jar files - $HADOOP_HOME/bin/hadoop jar myApp.jar -conf $MY_CONF_FILE -libjars jdom.jar I got this error- java.lang.NoClassDefFoundError: org/jdom/input/SAXBuilder Can someone please tell me what needs to be fixed here ? Thanks, Taran
Re: nagios to monitor hadoop datanodes!
The easiest approach I can think of is to write a simple Nagios plugin that checks if the datanode JVM process is alive. Or you may write a Nagios-plugin that checks for error or warning messages in datanode logs. (I am sure you can find quite a few log-checking Nagios plugin in nagiosplugin.org) If you are unsure of how to write nagios-plugin, I suggest you to read stuff from link Leverage Nagios with plug-ins you write http://www.ibm.com/developerworks/aix/library/au-nagios/ as it's got good explanations and examples on how to write nagios plugin. Or if you've got time to burn, you might want to read Nagios documentation, too. Let me know if you need help on this matter. /Taeho On Tue, Oct 7, 2008 at 2:05 AM, Gerardo Velez [EMAIL PROTECTED]wrote: Hi Everyone! I would like to implement Nagios health monitoring of a Hadoop grid. Some of you have some experience here, do you hace any approach or advice I could use. At this time I've been only playing with jsp's files that hadoop has integrated into it. so I;m not sure if it could be a good idea that nagios monitor request info to these jsp? Thanks in advance! -- Gerardo
Questions on dfs.datanode.du.reserved
Dear All, I have few questions on dfs.datanode.du.reserved property in hadoop-site.xml configuration... Assuming that I have dfs.datanode.du.reserved = 10GB and the partition assigned for HDFS has already been filled up to its capacity. (In this case, it will be Total disk size minus 10GB) What happens if I change dfs.datanode.du.reserved value to something greater than 10GB, like 20GB? Will HDFS remove or move blocks to meet that settings? Also, is it possible to set that dfs.datanode.du.reserved separately on each partition? (e.g. reserve 30GB for /data1 partition, reserve 100GB for /data2 partition) Many thanks, Taeho
Re: How to order all the output file if I use more than one reduce node?
You may want to write a partitioner that partitions the output from mappers in a way that fits your definition of sorted data (e.g. all keys in part-1 are greater than those in part-0.) Once you've done it, just merging all the reduce output from 0 to N will give you a sorted result file. On Thu, Aug 7, 2008 at 10:26 AM, Kevin [EMAIL PROTECTED] wrote: I suppose you meant to sort the result globally across files. AFAIK, This is not currently supported unless you have only one reducer. It is said that version 0.19 will introduce such capability. -Kevin On Wed, Aug 6, 2008 at 6:01 PM, Xing [EMAIL PROTECTED] wrote: If I use one node for reduce, hadoop can sort the result. If I use 30 nodes for reduce, the result is part-0 ~ part-00029. How make all the 30 parts sort globally and all the files in part-1 are greater that part-0 ? Thanks a lot Xing
Re: Are lines broken in dfs and/or in InputSplit
I guess a quick way to find an answer for your question is to look at size of data block files stored in datanodes. If they are all the same (e.g. 64MB), then you could say lines are NOT preserved in block level as DFS simply cuts the original file into exact 64MB pieces. They are almost all the same, by the way, except for few blocks which may represent files smaller than 64MB or some blocks that may represent the end blocks of a file. /Taeho On Thu, Aug 7, 2008 at 9:23 AM, Kevin [EMAIL PROTECTED] wrote: Hi, I guess this thread is old. But I eventually need to raise the question again as I am more into dfs now. Would a line be broken between adjacent blocks in dfs? Can line be preserved in block level? -Kevin On Wed, Jul 16, 2008 at 4:57 PM, Chris Douglas [EMAIL PROTECTED] wrote: InputFormats don't have a concept of blocks; each FileSplit contains a list of locations that advise the framework where it should prefer to schedule the map (i.e. on the node that contains most of the data (in practice, IIRC this is the the location of the first byte of the block, which may not actually contain the bulk of the data)). For LineRecordReader, this means that it will open a stream, seek to its start position, read (opening up a connection to the node that contains that block, with luck a local read) to the first record delimiter, then return lines as Text records to the map until the end of that split precedes the start offset at the beginning of a read (i.e. the end of split A and the start of split B will likely be in the middle of a record, so A will emit that record and B will start from the end of that record). I think it's fair to say that blocks and records are orthogonal abstractions to HDFS and map/reduce. -C On Jul 15, 2008, at 5:07 PM, Kevin wrote: Hi, I was trying to parse text input with line-based information in mapper and this problem becomes an issue. I wonder if lines are preserved or broken when a file is cut into blocks by dfs. Also, it looks that although TextInputFormat breaks file into lines records, the InputSplit passed to InputFormat may not preserve lines. If this is the case, is it possible to restore the lines for mapper input, or I have to drop broken lines? Thank you. Best, -Kevin
Re: java.io.IOException: Cannot allocate memory
Are you using HadoopStreaming? If so, then subprocess created by HadoopStreaming Job can take as much memory as it needs. In that case, the system will run out of memory and other processes (e.g. TaskTracker) may not be able to run properly or even be killed by the OS. /Taeho On Fri, Aug 1, 2008 at 2:24 AM, Xavier Stevens [EMAIL PROTECTED]wrote: We're currently running jobs on machines with around 16GB of memory with 8 map tasks per machine. We used to run with max heap set to 2048m. Since we started using version 0.17.1 we've been getting a lot of these errors: task_200807251330_0042_m_000146_0: Caused by: java.io.IOException: java.io.IOException: Cannot allocate memory task_200807251330_0042_m_000146_0: at java.lang.UNIXProcess.init(UNIXProcess.java:148) task_200807251330_0042_m_000146_0: at java.lang.ProcessImpl.start(ProcessImpl.java:65) task_200807251330_0042_m_000146_0: at java.lang.ProcessBuilder.start(ProcessBuilder.java:451) task_200807251330_0042_m_000146_0: at org.apache.hadoop.util.Shell.runCommand(Shell.java:149) task_200807251330_0042_m_000146_0: at org.apache.hadoop.util.Shell.run(Shell.java:134) task_200807251330_0042_m_000146_0: at org.apache.hadoop.fs.DF.getAvailable(DF.java:73) task_200807251330_0042_m_000146_0: at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathF orWrite(LocalDirAllocator.java:296) task_200807251330_0042_m_000146_0: at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllo cator.java:124) task_200807251330_0042_m_000146_0: at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFil e.java:107) task_200807251330_0042_m_000146_0: at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.ja va:734) task_200807251330_0042_m_000146_0: at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1600(MapTask.jav a:272) task_200807251330_0042_m_000146_0: at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask .java:707) We haven't changed our heapsizes at all. Has anyone else experienced this? Is there a way around it other than reducing heap sizes excessively low? I've tried all the way down to 1024m max heap and I still get this error. -Xavier
Re: Name node heap space problem
Check how much memory is allocated for the JVM running namenode. In a file HADOOP_INSTALL/conf/hadoop-env.sh you should change a line that starts with export HADOOP_HEAPSIZE=1000 It's set to 1GB by default. On Fri, Jul 25, 2008 at 2:51 AM, Gert Pfeifer [EMAIL PROTECTED] wrote: Update on this one... I put some more memory in the machine running the name node. Now fsck is running. Unfortunately ls fails with a time-out. I identified one directory that causes the trouble. I can run fsck on it but not ls. What could be the problem? Gert Gert Pfeifer schrieb: Hi, I am running a Hadoop DFS on a cluster of 5 data nodes with a name node and one secondary name node. I have 1788874 files and directories, 1465394 blocks = 3254268 total. Heap Size max is 3.47 GB. My problem is that I produce many small files. Therefore I have a cron job which just runs daily across the new files and copies them into bigger files and deletes the small files. Apart from this program, even a fsck kills the cluster. The problem is that, as soon as I start this program, the heap space of the name node reaches 100 %. What could be the problem? There are not many small files right now and still it doesn't work. I guess we have this problem since the upgrade to 0.17. Here is some additional data about the DFS: Capacity : 2 TB DFS Remaining : 1.19 TB DFS Used: 719.35 GB DFS Used% : 35.16 % Thanks for hints, Gert
Re: more than one reducer?
I don't know if there is any in-place mechanism for what you're looking for. However, you could write a partitioner that distributes data in a way that lower keys go to lower numbered reduce, and higher keys go to higher numbered reduce. (e.g. Key starting with 'A~D' goes to part-, 'E~H' goes to part-0001, and so on.) If you knew how well keys are distributed beforehand, then you could distribute data quite equally to each reducer as well. When you are done, simply download the result files and just merge them together and you have sorted output. On Tue, Jul 22, 2008 at 9:08 AM, Mori Bellamy [EMAIL PROTECTED] wrote: hey all, i was wondering if its possible to split up the reduce task amongst more than one machine. i figured it might be possible for the map output to be copied to multiple machines; then each reducer could sort its keys and then combine them into one big sorted output (a la mergesort). does anybody know if there is an in-place mechanism for this?
Re: Timeouts when running balancer
By setting dfs.balance.bandwidthPerSec to 1GB/sec, each datanode is able to utilize up to 1GB/sec for block balancing. It seems to be too high as even a gigabit ethernet can't handle that much data per sec. When you get timeouts, it probably means your network is saturated. Maybe you were running a big map reduce job which required lots of data transfer among nodes by then? Try setting it to be 10~30MB/sec and see what happens. On Sat, Jul 19, 2008 at 1:56 AM, David J. O'Dell [EMAIL PROTECTED] wrote: I'm trying to re balance my cluster as I've added to more nodes. When I run balancer with the default threshold I am seeing timeouts in the logs: 2008-07-18 09:50:46,636 INFO org.apache.hadoop.dfs.Balancer: Decided to move block -8432927406854991437 with a length of 128 MB bytes from 10.11.6.234:50010 to 10.11.6.235:50010 using proxy source 10.11.6.234:50010 2008-07-18 09:50:46,636 INFO org.apache.hadoop.dfs.Balancer: Starting Block mover for -8432927406854991437 from 10.11.6.234:50010 to 10.11.6.235:50010 2008-07-18 09:52:46,826 WARN org.apache.hadoop.dfs.Balancer: Timeout moving block -8432927406854991437 from 10.11.6.234:50010 to 10.11.6.235:50010 through 10.11.6.234:50010 I read in the balancer guide- http://issues.apache.org/jira/secure/attachment/12370966/BalancerUserGuide2 That the default transfer rate is 1mb/sec I tried increasing this to 1gb/sec but I'm still seeing the timeouts. All of the nodes have gigE nics and are on the same switch. -- David O'Dell Director, Operations e: [EMAIL PROTECTED] t: (415) 738-5152 180 Townsend St., Third Floor San Francisco, CA 94107
Re: newbie in streaming: How to execute a single executable
1. You will have to modify your c++ binary (or any other binary) in a way that it takes input from stdin and outputs to stdout. 2. If you run your job as a mapper only job, you'll have as many result files as the number of mappers created. On Fri, Jul 11, 2008 at 4:14 AM, Charan Thota [EMAIL PROTECTED] wrote: Hi, I'm a newbie in streaming in hadoop. I want to know how to execute a single c++ executable? Should it be a mapper only job? the executable is to cluster a set of points present in a file. so, it cannot be really said to be a mapper or reducer.Also, there is no code present,except for the executable. please tell me how to execute this on hadoop. is there any other way (apart from streaming) to do this? Thank you Charan T. -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
MapReduce with multi-languages
Dear Hadoop User Group, What are elegant ways to do mapred jobs on text-based data encoded with something other than UTF-8? It looks like Hadoop assumes the text data is always in UTF-8 and handles data that way - encoding with UTF-8 and decoding with UTF-8. And whenever the data is not in UTF-8 encoded format, problems arise. Here is what I'm thinking of to clear the situation.. correct and advise me if you see my approaches look bad! (1) Re-encode the original data with UTF-8? (2) Replace the part of source code where UTF-8 encoder and decoder are used? Or has anyone of you guys had trouble with running map-red job on data with multi-languages? Any suggestions/advices are welcome and appreciated! Regards, Taeho
Re: Inconsistency in namenode's and datanode's namespaceID
No, I don't think it's a bug. Your datanodes' data partition/directory was probably used in other HDFS setup and thus had other namespaceID. Or you could've used other partition/directory for your new HDFS setup by setting different values for dfs.data.dir on your datanode. But in this case, you can't access your old HDFS's data. On Thu, Jul 3, 2008 at 4:21 AM, Xuan Dzung Doan [EMAIL PROTECTED] wrote: I was following the quickstart guide to run pseudo-distributed operations with Hadoop 0.16.4. I got it to work successfully the first time. But I failed to repeat the steps (I tried to re-do everything from re-formating the HDFS). Then by looking at the log files of the daemons, I found out the datanode failed to start because its namespaceID didn't match with the namenode's. I after that found that the namespaceID is stored in the text file VERSION under dfs/data/current and dfs/name/current for the datanode and the namenode, respectively. The reformatting step does change namespaceID of the namenode, but not for the datanode, and that's the cause for the inconsistency. So after reformatting, if I manually update namespaceID for the datanode, things will work totally fine again. I guess there are probably others who had this same experience. Is it a bug in Hadoop 0.16.4? If so, has it been taken care of in later versions? Thanks, David.
Question on HadoopStreaming and Memory Usage
Dear All, I've got a question about hadoop streaming with its memory management. Does hadoop streaming have a mechanism to prevent over-usage of memory by its subprocesses (Map or Reduce function)? Say, a binary used for reduce phase allocates itself lots and lots of memory to the point it starves other important processes like a Datanode or TaskTracker process. Does Hadoop Streaming prevent such cases? Thank you in advance, Taeho
Re: Questions on how to use DistributedCache
Thank you for your clarification! One more question here, The API doc says... DistributedCache is a facility provided by the Map-Reduce framework to cache files (text, archives, jars etc.) needed by applications. My question is... Is it also possible distribute some some binary files (to be executed in slave nodes in a MapReduce job)?? p.s. I have tried it, it's not been successful. Is this normal? /Taeho On Thu, May 22, 2008 at 7:15 PM, Devaraj Das [EMAIL PROTECTED] wrote: -Original Message- From: Taeho Kang [mailto:[EMAIL PROTECTED] Sent: Thursday, May 22, 2008 3:41 PM To: core-user@hadoop.apache.org Subject: Re: Questions on how to use DistributedCache Thanks for your reply. Just one more thing to ask.. From what I see from the source code, it looks like the files/jars registered in DistributedCache gets uploaded to DFS and then downloaded to slave nodes. Is there a way I can specify the path in the slave nodes where files/jars get downloaded to? No that is not possible. They get localized to specific directories (as per mapred.local.dir). The files are optionally symlinked in the current working directory of the task. /Taeho On Thu, May 22, 2008 at 4:20 PM, Arun C Murthy [EMAIL PROTECTED] wrote: On May 21, 2008, at 10:45 PM, Taeho Kang wrote: Dear all, I am trying to use DistributedCache class for distributing files required for running my jobs. While API documentation provides good guidelines, Is there any tips or usage examples (e.g. sample codes)? http://hadoop.apache.org/core/docs/current/ mapred_tutorial.html#DistributedCache and http://hadoop.apache.org/core/docs/current/ mapred_tutorial.html#Example%3A+WordCount+v2.0 Arun If you could share your experience with me, I would really appreciate it. Thank you in advance, /Taeho
Questions on how to use DistributedCache
Dear all, I am trying to use DistributedCache class for distributing files required for running my jobs. While API documentation provides good guidelines, Is there any tips or usage examples (e.g. sample codes)? If you could share your experience with me, I would really appreciate it. Thank you in advance, /Taeho
Re: Trash option in hadoop-site.xml configuration.
Thank you for the clarification. Here is my another question. If two different clients ordered move to trash with different interval, (e.g. client #1 with fs.trash.interval = 60; client #2 with fs.trash.interval = 120) what would happen? Does namenode keep track of all these info? /Taeho On 3/20/08, dhruba Borthakur [EMAIL PROTECTED] wrote: The trash feature is a client side option and depends on the client configuration file. If the client's configuration specifies that Trash is enabled, then the HDFS client invokes a rename to Trash instead of a delete. Now, if Trash is enabled on the Namenode, then the Namenode periodically removes contents from the Trash directory. This design might be confusing to some users. But it provides the flexibility that different clients in the cluster can have either Trash enabled or disabled. Thanks, dhruba -Original Message- From: Taeho Kang [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 19, 2008 3:13 AM To: [EMAIL PROTECTED]; core-user@hadoop.apache.org; [EMAIL PROTECTED] Subject: Trash option in hadoop-site.xml configuration. Hello, I have these two machines that acts as a client to HDFS. Node #1 has Trash option enabled (e.g. fs.trash.interval set to 60) and Node #2 has Trash option off (e.g. fs.trash.interval set to 0) When I order file deletion from Node #2, the file gets deleted right away. while the file gets moved to trash when I do the same from Node #1. This is a bit of surprise to me, because I thought Trash option that I have set in the master node's config file applies to everyone who connects to / uses the HDFS. Was there any reason why Trash option was implemented in this way? Thank you in advance, /Taeho
Trash option in hadoop-site.xml configuration.
Hello, I have these two machines that acts as a client to HDFS. Node #1 has Trash option enabled (e.g. fs.trash.interval set to 60) and Node #2 has Trash option off (e.g. fs.trash.interval set to 0) When I order file deletion from Node #2, the file gets deleted right away. while the file gets moved to trash when I do the same from Node #1. This is a bit of surprise to me, because I thought Trash option that I have set in the master node's config file applies to everyone who connects to / uses the HDFS. Was there any reason why Trash option was implemented in this way? Thank you in advance, /Taeho
Upgrade Hadoop from 0.12 to 0.16 - don't do it!!
Hello all, I wanted to share my experience with you who wanted to upgrade Hadoop from 0.12 or previous versions to more recent versions like 0.16 After installing 0.16 and tried start-dfs.sh, Namenode gave me an exception saying I had to use -upgrade option. I gave -upgrade option and Namenode and Datanodes came up alright. But when I tried finalizing the upgrade, but it didn't work, nor did -rollback option work. From there on, the only way I could have the cluster up and running was to give -upgrade option. So here is my advice : Move to 0.13 first and then do the upgrade from there. However, following the steps found in the wiki, I was able to upgrade from 0.12 to 0.13 alright. I hope it's not going to be too painful upgrading from 0.13 to 0.14 or upwards, using -upgrade / -rollback / -finalize options :-) Also, if anybody wanted to share any good or painful experience with me, I would really appreciate it! /Taeho