Re: Hadoop0.20 - Class Not Found exception
Amandeep Khurana wrote: I'm getting the following error while starting a MR job: Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver at org.apache.hadoop.mapred.lib.db.DBInputFormat.configure(DBInputFormat.java:297) ... 21 more Caused by: java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClassInternal(Unknown Source) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Unknown Source) at org.apache.hadoop.mapred.lib.db.DBConfiguration.getConnection(DBConfiguration.java:123) at org.apache.hadoop.mapred.lib.db.DBInputFormat.configure(DBInputFormat.java:292) ... 21 more Interestingly, the relevant jar is bundled into the MR job jar and its also there in the $HADOOP_HOME/lib directory. Exactly same thing worked with 0.19.. Not sure what could have changed or I broke to cause this error... could be classloader hierarchy; the JDBC driver needs to be at the right level. Try preheating the driver by loading it in your own code, then jdbc:URLs might work, and take it out of the MR Job JAR
Re: FYI, Large-scale graph computing at Google
Edward J. Yoon wrote: I just made a wiki page -- http://wiki.apache.org/hadoop/Hambrug -- Let's discuss about the graph computing framework named Hambrug. ok, first Q, why the Hambrug. To me that's just Hamburg typed wrong, which is going to cause lots of confusion. What about something more graphy? like "descartes"
Re: FYI, Large-scale graph computing at Google
Patterson, Josh wrote: Steve, I'm a little lost here; Is this a replacement for M/R or is it some new code that sits ontop of M/R that runs an iteration over some sort of graph's vertexes? My quick scan of Google's article didn't seem to yeild a distinction. Either way, I'd say for our data that a graph processing lib for M/R would be interesting. I'm thinking of graph algorithms that get implemented as MR jobs; work with HDFS, HBase, etc.
Re: FYI, Large-scale graph computing at Google
mike anderson wrote: This would be really useful for my current projects. I'd be more than happy to help out if needed. well the first bit of code to play with then is this http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/extras/citerank/ the standalone.xml file is the one you want to build and run with, the other would require you to check out and build two levels up, but gives you the ability to bring up local or remote clusters to test. Call run-local to run it locally., which should give you some stats like this: [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool: Counters: 11 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool: File Systems [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool: Local bytes read=209445683448 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool: Local bytes written=173943642259 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool: Map-Reduce Framework [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool: Reduce input groups=9985124 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool: Combine output records=34 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool: Map input records=24383448 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool: Reduce output records=16494967 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool: Map output bytes=1243216870 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool: Map input bytes=1528854187 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool: Combine input records=4528655 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool: Map output records=41958636 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool: Reduce input records=37430015 == Exiting project "citerank" == BUILD SUCCESSFUL - at 25/06/09 17:09 Total time: 9 minutes 1 second -- Steve Loughran http://www.1060.org/blogxter/publish/5 Author: Ant in Action http://antbook.org/
Re: FYI, Large-scale graph computing at Google
Edward J. Yoon wrote: What do you think about another new computation framework on HDFS? On Mon, Jun 22, 2009 at 3:50 PM, Edward J. Yoon wrote: http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html -- It sounds like Pregel seems, a computing framework based on dynamic programming for the graph operations. I guess maybe they removed the file communications/intermediate files during iterations. Anyway, What do you think? I have a colleague (paolo) who would be interested in adding a set of graph algorithms on top of the MR engine
Re: Hadoop 0.20.0, xml parsing related error
Ram Kulbak wrote: Hi, The exception is a result of having xerces in the classpath. To resolve, make sure you are using Java 6 and set the following system property: -Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl This can also be resolved by the Configuration class(line 1045) making sure it loads the DocumentBuilderFactory bundled with the JVM and not a 'random' classpath-dependent factory.. Hope this helps, Ram Lovely -I've noted this in the comments of the bugrep
Re: "Too many open files" error, which gets resolved after some time
Stas Oskin wrote: Hi. So what would be the recommended approach to pre-0.20.x series? To insure each file is used only by one thread, and then it safe to close the handle in that thread? Regards. good question -I'm not sure. For anythiong you get with FileSystem.get(), its now dangerous to close, so try just setting the reference to null and hoping that GC will do the finalize() when needed
Re: "Too many open files" error, which gets resolved after some time
Raghu Angadi wrote: Is this before 0.20.0? Assuming you have closed these streams, it is mostly https://issues.apache.org/jira/browse/HADOOP-4346 It is the JDK internal implementation that depends on GC to free up its cache of selectors. HADOOP-4346 avoids this by using hadoop's own cache. yes, and it's that change that led to my stack traces :( http://jira.smartfrog.org/jira/browse/SFOS-1208
Re: "Too many open files" error, which gets resolved after some time
Scott Carey wrote: Furthermore, if for some reason it is required to dispose of any objects after others are GC'd, weak references and a weak reference queue will perform significantly better in throughput and latency - orders of magnitude better - than finalizers. Good point. I would make sense for the FileSystem cache to be weak referenced, so that on long-lived processes the client references will get cleaned up without waiting for app termination
Re: "Too many open files" error, which gets resolved after some time
jason hadoop wrote: Yes. Otherwise the file descriptors will flow away like water. I also strongly suggest having at least 64k file descriptors as the open file limit. On Sun, Jun 21, 2009 at 12:43 PM, Stas Oskin wrote: Hi. Thanks for the advice. So you advice explicitly closing each and every file handle that I receive from HDFS? Regards. I must disagree somewhat If you use FileSystem.get() to get your client filesystem class, then that is shared by all threads/classes that use it. Call close() on that and any other thread or class holding a reference is in trouble. You have to wait for the finalizers for them to get cleaned up. If you use FileSystem.newInstance() - which came in fairly recently (0.20? 0.21?) then you can call close() safely. So: it depends on how you get your handle. see: https://issues.apache.org/jira/browse/HADOOP-5933 Also: the too many open files problem can be caused in the NN -you need to set up the Kernel to have lots more file handles around. Lots.
Re: Name Node HA (HADOOP-4539)
Andrew Wharton wrote: https://issues.apache.org/jira/browse/HADOOP-4539 I am curious about the state of this fix. It is listed as "Incompatible", but is resolved and committed (according to the comments). Is the backup name node going to make it into 0.21? Will it remove the SPOF for HDFS? And if so, what is the proposed release timeline for 0.21? The way to deal with HA -which the BackupNode doesn't promise- is to get involved in developing and testing the leading edge source tree. The 0.21 cutoff is approaching, BackupNode is in there, but it needs a lot more tests. If you want to aid the development, helping to get more automated BackupNode tests in there (indeed, tests that simulate more complex NN failures, like a corrupt EditLog) would go a long way. -steve
Re: Hadoop Eclipse Plugin
Praveen Yarlagadda wrote: Hi, I have a problem configuring Hadoop Map/Reduce plugin with Eclipse. Setup Details: I have a namenode, a jobtracker and two data nodes, all running on ubuntu. My set up works fine with example programs. I want to connect to this setup from eclipse. namenode - 10.20.104.62 - 54310(port) jobtracker - 10.20.104.53 - 54311(port) I run eclipse on a different windows m/c. I want to configure map/reduce plugin with eclipse, so that I can access HDFS from windows. Map/Reduce master Host - With jobtracker IP, it did not work Port - With jobtracker port, it did not work DFS master Host - With namenode IP, It did not work Port - With namenode port, it did not work I tried other combination too by giving namenode details for Map/Reduce master and jobtracker details for DFS master. It did not work either. 1. check the ports really are open by doing a netstat -a -p on the namenode and job tracker , netstat -a -p | grep 54310 on the NN netstat -a -p | grep 54311 on the JT 2l Then, from the windows machine, see if you can connect to them oustide ecipse telnet 10.20.104.62 54310 telnet 10.20.104.53 - 54311 If you can't connect, then firewalls are interfering If everything works, the problem is in the eclipse plugin (which I don't use, and cannot assist with) -- Steve Loughran http://www.1060.org/blogxter/publish/5 Author: Ant in Action http://antbook.org/
Re: Running Hadoop/Hbase in a OSGi container
Ninad Raut wrote: OSGi provides navigability to your components and create a life cycle for each of those components viz; install. start, stop, un- deploy etc. This is the reason why we are thinking of creating components using OSGi. The problem we are facing is our components using mapreduce and HDFS, as such OSGi container cannot detect hadoop mapred engine or HDFS. I have searched through the net and looks like people are working or have achieved success in running hadoop in OSGi container Ninad 1. I am doing work on a simple lifecycle for the services, start/stop/ping, which is not OSGI (which worries a lot about classloading and versioning, check out HADOOP-3628 for this. 2. You can run it under OSGi systems, such as the OSGi branch of SmartFrog : http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/branches/core-branch-osgi/, or under non-OSGi tools. Either way, these tools are left dealing with classloading and the like. 3. Any container is going to have to deal with the problem that there are bits of all the services that call System.Exit() by running under a security manager, trapping the call, raising an exception etc. 4. Any container is going to have to then deal with the fact that from 0.20 onwards, Hadoop does things with security policy that are incompatible with normal Java security managers. whatever security manager you have for trapping system exits, can't extend the default one. 5. any container also has to deal with every service (namenode, job tracker, etc) makes a lot of assumptions about singletons, that they have exclusive use of filesystem objects retrieved through FileSystem.get(), and the like. While OSGi can do that with its classloading work, its still fairly complex. 6. There are also lots of JVM memory/thread management issues, see the various Hadoop bugs If you look at the slides of what I've been up to, you can see that it can be done http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/components/hadoop/doc/dynamic_hadoop_clusters.ppt However, * you really need to run every service in its own process, for memory and reliability alone * It's pretty leading edge * You will have to invest the time and effort to get it working If you want to do the work, start with what I've been doing, bring it up under the OSGi container of your choice. You can come and play with our tooling, I'm cutting a release today of this week's Hadoop trunk merged with my branch, it is of course experimental, as even the trunk is a bit up-and-down on feature stability. -steve
Re: Multiple NIC Cards
John Martyniak wrote: Does hadoop "cache" the server names anywhere? Because I changed to using DNS for name resolution, but when I go to the nodes view, it is trying to view with the old name. And I changed the hadoop-site.xml file so that it no longer has any of those values. in SVN head, we try and get Java to tell us what is going on http://svn.apache.org/viewvc/hadoop/core/trunk/src/core/org/apache/hadoop/net/DNS.java This uses InetAddress.getLocalHost().getCanonicalHostName() to get the value, which is cached for life of the process. I don't know of anything else, but wouldn't be surprised -the Namenode has to remember the machines where stuff was stored.
Re: Multiple NIC Cards
John Martyniak wrote: When I run either of those on either of the two machines, it is trying to resolve against the DNS servers configured for the external addresses for the box. Here is the result Server:xxx.xxx.xxx.69 Address:xxx.xxx.xxx.69#53 OK. in an ideal world, each NIC has a different hostname. Now, that confuses code that assumes a host has exactly one hostname, not zero or two, and I'm not sure how well Hadoop handles the 2+ situation (I know it doesn't like 0, but hey, its a distributed application). With separate hostnames, you set hadoop up to work on the inner addresses, and give out the inner hostnames of the jobtracker and namenode. As a result, all traffic to the master nodes should be routed on the internal network
Re: Multiple NIC Cards
John Martyniak wrote: I am running Mac OS X. So en0 points to the external address and en1 points to the internal address on both machines. Here is the internal results from duey: en1: flags=8963 mtu 1500 inet6 fe80::21e:52ff:fef4:65%en1 prefixlen 64 scopeid 0x5 inet 192.168.1.102 netmask 0xff00 broadcast 192.168.1.255 ether 00:1e:52:f4:00:65 media: autoselect (1000baseT ) status: active lladdr 00:23:32:ff:fe:1a:20:66 media: autoselect status: inactive supported media: autoselect Here are the internal results from huey: en1: flags=8863 mtu 1500 inet6 fe80::21e:52ff:fef3:f489%en1 prefixlen 64 scopeid 0x5 inet 192.168.1.103 netmask 0xff00 broadcast 192.168.1.255 what does nslookup 192.168.1.103 and nslookup 192.168.1.102 say? There really ought to be different names for them. > I have some other applications running on these machines, that > communicate across the internal network and they work perfectly. I admire their strength. Multihost systems cause us trouble. That and machines that don't quite know who they are http://jira.smartfrog.org/jira/browse/SFOS-5 https://issues.apache.org/jira/browse/HADOOP-3612 https://issues.apache.org/jira/browse/HADOOP-3426 https://issues.apache.org/jira/browse/HADOOP-3613 https://issues.apache.org/jira/browse/HADOOP-5339 One thing to consider is that some of the various services of Hadoop are bound to 0:0:0:0, which means every Ipv4 address, you really want to bring up everything, including jetty services, on the en0 network adapter, by binding them to 192.168.1.102; this will cause anyone trying to talk to them over the other network to fail, which at least find the problem sooner rather than later
Re: Multiple NIC Cards
John Martyniak wrote: My original names where huey-direct and duey-direct, both names in the /etc/hosts file on both machines. Are nn.internal and jt.interal special names? no, just examples on a multihost network when your external names could be something completely different. What does /sbin/ifconfig say on each of the hosts?
Re: Multiple NIC Cards
John Martyniak wrote: David, For the Option #1. I just changed the names to the IP Addresses, and it still comes up as the external name and ip address in the log files, and on the job tracker screen. So option 1 is a no go. When I change the "dfs.datanode.dns.interface" values it doesn't seem to do anything. When I was search archived mail, this seemed to be a the approach to change the NIC card being used for resolution. But when I change it nothing happens, I even put in bogus values and still no issues. -John I've been having similar but different fun with Hadoop-on-VMs, there's a lot of assumption that DNS and rDNS all works consistently in the code. Do you have separate internal and external hostnames? In which case, can you bring up the job tracker as jt.internal , namenode as nn.internal (so the full HDFS URl is something like hdfs://nn.internal/ ) , etc, etc.?
Re: Every time the mapping phase finishes I see this
Mayuran Yogarajah wrote: There are always a few 'Failed/Killed Task Attempts' and when I view the logs for these I see: - some that are empty, ie stdout/stderr/syslog logs are all blank - several that say: 2009-06-06 20:47:15,309 WARN org.apache.hadoop.mapred.TaskTracker: Error running child java.io.IOException: Filesystem closed at org.apache.hadoop.dfs.DFSClient.checkOpen(DFSClient.java:195) at org.apache.hadoop.dfs.DFSClient.access$600(DFSClient.java:59) at org.apache.hadoop.dfs.DFSClient$DFSInputStream.close(DFSClient.java:1359) at java.io.FilterInputStream.close(FilterInputStream.java:159) at org.apache.hadoop.mapred.LineRecordReader$LineReader.close(LineRecordReader.java:103) at org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:301) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close(MapTask.java:173) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:231) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198) Any idea why this happens? I don't understand why I'd be seeing these only as the mappers get to 100%. Seen this when something in the same process got a FileSystem reference by FileSystem.get() and then called close() on it -it closes the client for every thread/class that has a reference to the same object. We're planning on adding more diagnostics, by tracking who closed the filesystem https://issues.apache.org/jira/browse/HADOOP-5933
Re: Hadoop scheduling question
Aaron Kimball wrote: Finally, there's a third scheduler called the Capacity scheduler. It's similar to the fair scheduler, in that it allows guarantees of minimum availability for different pools. I don't know how it apportions additional extra resources though -- this is the one I'm least familiar with. Someone else will have to chime in here. There's a dynamic priority scheduler in the patch queue, that I've promised to commit this week. Its the one with a notion of currency: you pay for your work/priority. At peak times, work costs more https://issues.apache.org/jira/browse/HADOOP-4768
Re: Monitoring hadoop?
Matt Massie wrote: Anthony- The ganglia web site is at http://ganglia.info/ with documentation in a wiki at http://ganglia.wiki.sourceforge.net/. There is also a good wiki page at IBM as well http://www.ibm.com/developerworks/wikis/display/WikiPtype/ganglia . Ganglia packages are available for most distributions to help with installation so make sure to grep for ganglia with your favorite package manager (e.g. aptitude, yum, etc). Ganglia will give you more information about your cluster than just Hadoop metrics. You'll get CPU, load, memory, disk and network monitoring as well for free. You can see live demos of ganglia at http://ganglia.info/?page_id=69. Good luck. -Matt Out of Modesty, Matt neglects to mention that Ganglia one of his projects, so not only does it work well with Hadoop today, I would expect the integration to only get better over time. Anthony -don't forget to feed those stats back into your DFS for later analysis...
Re: Hadoop ReInitialization.
b wrote: But after formatting and starting DFS i need to wait some time (sleep 60) before putting data into HDFS. Else i will receive "NotReplicatedYetException". that means the namenode is up but there aren't enough workers yet.
Re: question about when shuffle/sort start working
Todd Lipcon wrote: Hi Jianmin, This is not (currently) supported by Hadoop (or Google's MapReduce either afaik). What you're looking for sounds like something more like Microsoft's Dryad. One thing that is supported in versions of Hadoop after 0.19 is JVM reuse. If you enable this feature, task trackers will persist JVMs between jobs. You can then persist some state in static variables. I'd caution you, however, from making too much use of this fact as anything but an optimization. The reason that Hadoop is limited to MR (or M+RM* as you said) is that simplicity and reliability often go hand in hand. If you start maintaining important state in RAM on the tasktracker JVMs, and one of them goes down, you may need to restart your entire job sequence from the top. In typical MapReduce, you may need to rerun a mapper or a reducer, but the state is all on disk ready to go. -Todd I'd thought the question is not necessarily one of maintaining state, but of chaining the output from one job into another, where the # of iterations depends on the outcome of the previous set. Funnily enough, this is what you (apparently) end up having to do when implementing PageRank-like ranking as MR jobs: http://skillsmatter.com/podcast/cloud-grid/having-fun-with-pagerank-and-mapreduce
Re: org.apache.hadoop.ipc.client : trying connect to server failed
ashish pareek wrote: Yes I am able to ping and ssh between two virtual machine and even i have set ip address of both the virtual machines in their respective /etc/hosts file ... thanx for reply .. if you suggest some other thing which i could have missed or any remedy Regards, Ashish Pareek. VMs? VMWare? Xen? Something else? I've encountered problems on virtual networks where the machines aren't locatable via DNS., and can't be sure who they say they are. 1. start the machines individually, instead of the start-all script that needs to have SSH working too. 2. check with netstat -a to see what ports/interfaces they are listening on -steve
Re: hadoop hardware configuration
Patrick Angeles wrote: Sorry for cross-posting, I realized I sent the following to the hbase list when it's really more a Hadoop question. This is an interesting question. Obviously as an HP employee you must assume that I'm biased when I say HP DL160 servers are good value for the workers, though our blade systems are very good for a high physical density -provided you have the power to fill up the rack. 2 x Hadoop Master (and Secondary NameNode) - 2 x 2.3Ghz Quad Core (Low Power Opteron -- 2376 HE @ 55W) - 16GB DDR2-800 Registered ECC Memory - 4 x 1TB 7200rpm SATA II Drives - Hardware RAID controller - Redundant Power Supply - Approx. 390W power draw (1.9amps 208V) - Approx. $4000 per unit I do not know the what the advantages of that many cores are on a NN. Someone needs to do some experiments. I do know you need enough RAM to hold the index in memory, and you may want to go for a bigger block size to keep the index size down. 6 x Hadoop Task Nodes - 1 x 2.3Ghz Quad Core (Opteron 1356) - 8GB DDR2-800 Registered ECC Memory - 4 x 1TB 7200rpm SATA II Drives - No RAID (JBOD) - Non-Redundant Power Supply - Approx. 210W power draw (1.0amps 208V) - Approx. $2000 per unit I had some specific questions regarding this configuration... 1. Is hardware RAID necessary for the master node? You need a good story to ensure that loss of a disk on the master doesn't lose the filesystem. I like RAID there, but the alternative is to push the stuff out over the network to other storage you trust. That could be NFS-mounted RAID storage, it could be NFS mounted JBOD. Whatever your chosen design, test it works before you go live by running the cluster then simulate different failures, see how well the hardware/ops team handles it. Keep an eye on where that data goes, because when the NN runs out of file storage, the consequences can be pretty dramatic (i,e the cluster doesnt come up unless you edit the editlog by hand) 2. What is a good processor-to-storage ratio for a task node with 4TB of raw storage? (The config above has 1 core per 1TB of raw storage.) That really depends on the work you are doing...the bytes in/out to CPU work, and the size of any memory structures that are built up over the run. With 1 core per physical disk, you get the bandwidth of a single disk per CPU; for some IO-intensive work you can make the case for two disks/CPU -one in, one out, but then you are using more power, and if/when you want to add more storage, you have to pull out the disks to stick in new ones. If you go for more CPUs, you will probably need more RAM to go with it. 3. Am I better off using dual quads for a task node, with a higher power draw? Dual quad task node with 16GB RAM and 4TB storage costs roughly $3200, but draws almost 2x as much power. The tradeoffs are: 1. I will get more CPU per dollar and per watt. 2. I will only be able to fit 1/2 as much dual quad machines into a rack. 3. I will get 1/2 the storage capacity per watt. 4. I will get less I/O throughput overall (less spindles per core) First there is the algorithm itself, and whether you are IO or CPU bound. Most MR jobs that I've encountered are fairly IO bound -without indexes, every lookup has to stream through all the data, so it's power inefficient and IO limited. but if you are trying to do higher level stuff than just lookup, then you will be doing more CPU-work Then there is the question of where your electricity comes from, what the limits for the room are, whether you are billed on power drawn or quoted PSU draw, what the HVAC limits are, what the maximum allowed weight per rack is, etc, etc. I'm a fan of low Joule work, though we don't have any benchmarks yet of the power efficiency of different clusters; the number of MJ used to do a a terasort. I'm debating doing some single-cpu tests for this on my laptop, as the battery knows how much gets used up by some work. 4. In planning storage capacity, how much spare disk space should I take into account for 'scratch'? For now, I'm assuming 1x the input data size. That you should probably be able to determine on experimental work on smaller datasets. Some maps can throw out a lot of data, most reduces do actually reduce the final amount. -Steve (Disclaimer: I'm not making any official recommendations for hardware here, just making my opinions known. If you do want an official recommendation from HP, talk to your reseller or account manager, someone will look at your problem in more detail and make some suggestions. If you have any code/data that could be shared for benchmarking, that would help validate those suggestions)
Re: ssh issues
hmar...@umbc.edu wrote: Steve, Security through obscurity is always a good practice from a development standpoint and one of the reasons why tricking you out is an easy task. :) My most recent presentation on HDFS clusters is now online, notice how it doesn't gloss over the security: http://www.slideshare.net/steve_l/hdfs-issues Please, keep hiding relevant details from people in order to keep everyone smiling. HDFS is as secure as NFS: you are trusted to be who you say you are. Which means that you have to run it on a secured subnet -access restricted to trusted hosts and/or one two front end servers or accept that your dataset is readable and writeable by anyone on the network. There is user identification going in; it is currently at the level where it will stop someone accidentally deleting the entire filesystem if they lack the rights. Which has been known to happen. If the team looking after the cluster demand separate SSH keys/login for every machine then not only are they making their operations costs high, once you have got the HDFS cluster and MR engine live, it's moot. You can push out work to the JobTracker, which then runs it on the machines, under whatever userid the TaskTrackers are running on. Now, 0.20+ will run it under the identity of the user who claimed to be submitting the job, but without that, your MR Jobs get the access rights to the filesystem of the user that is running the TT, but it's fairly straightforward to create a modified hadoop client JAR that doesn't call whoami to get the userid, and instead spoofs to be anyone. Which means that even if you lock down the filesystem -no out of datacentre access-, if I can run my java code as MR jobs in your cluster, I can have unrestricted access to the filesystem by way of the task tracker server. But Hal, if you are running Ant for your build I'm running my code on your machines anyway, so you had better be glad that I'm not malicious. -Steve
Re: ssh issues
Pankil Doshi wrote: Well i made ssh with passphares. as the system in which i need to login requires ssh with pass phrases and those systems have to be part of my cluster. and so I need a way where I can specify -i path/to key/ and passphrase to hadoop in before hand. Pankil Well, are trying to manage a system whose security policy is incompatible with hadoop's current shell scripts. If you push out the configs and manage the lifecycle using other tools, this becomes a non-issue. Dont raise the topic of HDFS security to your ops team though, as they will probably be unhappy about what is currently on offer. -steve
Re: Username in Hadoop cluster
Pankil Doshi wrote: Hello everyone, Till now I was using same username on all my hadoop cluster machines. But now I am building my new cluster and face a situation in which I have different usernames for different machines. So what changes will have to make in configuring hadoop. using same username ssh was easy. now will it face problem as now I have different username? Are you building these machines up by hand? How many? Why the different usernames? Can't you just create a new user and group "hadoop" on all the boxes?
Re: Optimal Filesystem (and Settings) for HDFS
Bryan Duxbury wrote: We use XFS for our data drives, and we've had somewhat mixed results. Thanks for that. I've just created a wiki page to put some of these notes up -extensions and some hard data would be welcome http://wiki.apache.org/hadoop/DiskSetup One problem we have for hard data is that we need some different benchmarks for MR jobs. Terasort is good for measuring IO and MR framework performance, but for more CPU intensive algorithms, or things that need to seek round a bit more, you can't be sure that terasort benchmarks are a good predictor of what's right for you in terms of hardware, filesystem, etc. Contributions in this area would be welcome. I'd like to measure the power consumed on a run too, which is actually possible as far as my laptop is concerned, because you can ask it's battery what happened. -steve
Re: Suspend or scale back hadoop instance
John Clarke wrote: Hi, I am working on a project that is suited to Hadoop and so want to create a small cluster (only 5 machines!) on our servers. The servers are however used during the day and (mostly) idle at night. So, I want Hadoop to run at full throttle at night and either scale back or suspend itself during certain times. You could add/remove new task trackers on idle systems, but * you don't want to take away datanodes, as there's a risk that data will become unavailable. * there's nothing in the scheduler to warn that machines will go away at a certain time If you only want to run the cluster at night, I'd just configure the entire cluster to go up and down
Re: Is there any performance issue with Jrockit JVM for Hadoop
Tom White wrote: On Mon, May 18, 2009 at 11:44 AM, Steve Loughran wrote: Grace wrote: To follow up this question, I have also asked help on Jrockit forum. They kindly offered some useful and detailed suggestions according to the JRA results. After updating the option list, the performance did become better to some extend. But it is still not comparable with the Sun JVM. Maybe, it is due to the use case with short duration and different implementation in JVM layer between Sun and Jrockit. I would like to be back to use Sun JVM currently. Thanks all for your time and help. what about flipping the switch that says "run tasks in the TT's own JVM?". That should handle startup costs, and reduce the memory footprint The property mapred.job.reuse.jvm.num.tasks allows you to set how many tasks the JVM may be reused for (within a job), but it always runs in a separate JVM to the tasktracker. (BTW https://issues.apache.org/jira/browse/HADOOP-3675has some discussion about running tasks in the tasktracker's JVM). Tom Tom, that's why you are writing a book on Hadoop and I'm not ...you know the answers and I have some vague misunderstandings, -steve (returning to the svn book)
Re: Is there any performance issue with Jrockit JVM for Hadoop
Grace wrote: To follow up this question, I have also asked help on Jrockit forum. They kindly offered some useful and detailed suggestions according to the JRA results. After updating the option list, the performance did become better to some extend. But it is still not comparable with the Sun JVM. Maybe, it is due to the use case with short duration and different implementation in JVM layer between Sun and Jrockit. I would like to be back to use Sun JVM currently. Thanks all for your time and help. what about flipping the switch that says "run tasks in the TT's own JVM?". That should handle startup costs, and reduce the memory footprint
Re: Beware sun's jvm version 1.6.0_05-b13 on linux
Allen Wittenauer wrote: On 5/15/09 11:38 AM, "Owen O'Malley" wrote: We have observed that the default jvm on RedHat 5 I'm sure some people are scratching their heads at this. The default JVM on at least RHEL5u0/1 is a GCJ-based 1.4, clearly incapable of running Hadoop. We [and, really, this is my doing... ^.^ ] replace it with the JVM from the JPackage folks. So while this isn't the default JVM that comes from RHEL, the warning should still be heeded. Presumably its one of those hard-to-reproduce race conditions that only surfaces under load on a big cluster so is hard to replicate in a unit test, right?
Re: public IP for datanode on EC2
Tom White wrote: Hi Joydeep, The problem you are hitting may be because port 50001 isn't open, whereas from within the cluster any node may talk to any other node (because the security groups are set up to do this). However I'm not sure this is a good approach. Configuring Hadoop to use public IP addresses everywhere should work, but you have to pay for all data transfer between nodes (see http://aws.amazon.com/ec2/, "Public and Elastic IP Data Transfer"). This is going to get expensive fast! So to get this to work well, we would have to make changes to Hadoop so it was aware of both public and private addresses, and use the appropriate one: clients would use the public address, while daemons would use the private address. I haven't looked at what it would take to do this or how invasive it would be. I thought that AWS had stopped you being able to talk to things within the cluster using the public IP addresses -stopped you using DynDNS as your way of bootstrapping discovery Here's what may work -bring up the EC2 cluster using the local names -open up the ports -have the clients talk using the public IP addresses the problem will arise when the namenode checks the fs name used and it doesnt match its expectations -there were some recent patches in the code to handle this when someone talks to the namenode using the ipaddress instead of the hostname; they may work for this situation too. personally, I wouldn't trust the NN in the EC2 datacentres to be secure to external callers, but that problem already exists within their datacentres anyway
Re: How to do load control of MapReduce
Stefan Will wrote: Yes, I think the JVM uses way more memory than just its heap. Now some of it might be just reserved memory, but not actually used (not sure how to tell the difference). There are also things like thread stacks, jit compiler cache, direct nio byte buffers etc. that take up process space outside of the Java heap. But none of that should imho add up to Gigabytes... good article on this http://www.ibm.com/developerworks/linux/library/j-nativememory-linux/
Re: How to do load control of MapReduce
zsongbo wrote: Hi Stefan, Yes, the 'nice' cannot resolve this problem. Now, in my cluster, there are 8GB of RAM. My java heap configuration is: HDFS DataNode : 1GB HBase-RegionServer: 1.5GB MR-TaskTracker: 1GB MR-child: 512MB (max child task is 6, 4 map task + 2 reduce task) But the memory usage is still tight. does TT need to be so big if you are running all your work in external VMs?
Re: Huge DataNode Virtual Memory Usage
Stefan Will wrote: Raghu, I don't actually have exact numbers from jmap, although I do remember that jmap -histo reported something less than 256MB for this process (before I restarted it). I just looked at another DFS process that is currently running and has a VM size of 1.5GB (~600 resident). Here jmap reports a total object heap usage of 120MB. The memory block list reported by jmap doesn't actually seem to contain the heap at all since the largest block in that list is 10MB in size (/usr/java/jdk1.6.0_10/jre/lib/amd64/server/libjvm.so). However, pmap reports a total usage of 1.56GB. -- Stefan you know, if you could get the Task Tracker to include stats on real and virtual memory use, I'm sure that others would welcome those reports -know that the job was slower and its VM was 2x physical would give you a good hint as to the root cause.
Re: Winning a sixty second dash with a yellow elephant
Arun C Murthy wrote: ... oh, and getting it to run a marathon too! http://developer.yahoo.net/blogs/hadoop/2009/05/hadoop_sorts_a_petabyte_in_162.html Owen & Arun Lovely. I will now stick up the pic of you getting the first results in on your laptop at apachecon
Re: Re-Addressing a cluster
jason hadoop wrote: Now that I think about it, the reverse lookups in my clusters work. and you have made sure that IPv6 is turned off, right?
Re: datanode replication
Jeff Hammerbacher wrote: Hey Vishal, Check out the chooseTarget() method(s) of ReplicationTargetChooser.java in the org.apache.hadoop.hdfs.server.namenode package: http://svn.apache.org/viewvc/hadoop/core/trunk/src/hdfs/org/apache/hadoop/hdfs/server/namenode/ReplicationTargetChooser.java?view=markup . In words: assuming you're using the default replication level (3), the default strategy will put one block on the local node, one on a node in a remote rack, and another on that same remote rack. Note that HADOOP-3799 (http://issues.apache.org/jira/browse/HADOOP-3799) proposes making this strategy pluggable. Yes, there's some good reasons for having different placement algorithms for different datacentres, and I could even imagine different MR sequences providing hints about where they want data, depending on what they want to do afterwards
Re: Re-Addressing a cluster
jason hadoop wrote: You should be able to relocate the cluster's IP space by stopping the cluster, modifying the configuration files, resetting the dns and starting the cluster. Be best to verify connectivity with the new IP addresses before starting the cluster. to the best of my knowledge the namenode doesn't care about the ip addresses of the datanodes, only what blocks they report as having. The namenode does care about loosing contact with a connected datanode, replicating the blocks that are now under replicated. I prefer IP addresses in my configuration files but that is a personal preference not a requirement. I do deployments on to Virtual clusters without fully functional reverse DNS, things do work badly in that situation. Hadoop assumes that if a machine looks up its hostname, it can pass that to peers and they can resolve it, the "well managed network infrastructure" assumption.
Re: Is there any performance issue with Jrockit JVM for Hadoop
Grace wrote: Thanks all for your replying. I have run several times with different Java options for Map/Reduce tasks. However there is no much difference. Following is the example of my test setting: Test A: -Xmx1024m -server -XXlazyUnlocking -XlargePages -XgcPrio:deterministic -XXallocPrefetch -XXallocRedoPrefetch Test B: -Xmx1024m Test C: -Xmx1024m -XXaggressive Is there any tricky or special setting for Jrockit vm on Hadoop? In the Hadoop Quick Start guides, it says that "JavaTM 1.6.x, preferably from Sun". Is there any concern about the Jrockit performance issue? The main thing is that all the big clusters are running (as far as I know), Linux (probably RedHat) and Sun Java. This is where the performance and scale testing is done. If you are willing to spend time doing the experiments and tuning, then I'm sure we can update those guides to say "JRockit works, here are some options...". -steve
Re: Is there any performance issue with Jrockit JVM for Hadoop
Chris Collins wrote: a couple of years back we did a lot of experimentation between sun's vm and jrocket. We had initially assumed that jrocket was going to scream since thats what the press were saying. In short, what we discovered was that certain jdk library usage was a little bit faster with jrocket, but for core vm performance such as synchronization, primitive operations the sun vm out performed. We were not taking account of startup time, just raw code execution. As I said, this was a couple of years back so things may of changed. C I run JRockit as its what some of our key customers use, and we need to test things. One lovely feature is tests time out before the stack runs out on a recursive operation; clearly different stack management at work. Another: no PermGenHeapSpace to fiddle with. * I have to turn debug logging of in hadoop test runs, or there are problems. * It uses short pointers (32 bits long) for near memory on a 64 bit JVM. So your memory footprint on sub-4GB VM images is better. Java7 promises this, and with the merger, who knows what we will see. This is unimportant on 32-bit boxes * debug single stepping doesnt work. That's ok, I use functional tests instead :) I havent looked at outright performance. /
Re: move tasks to another machine on the fly
Tom White wrote: Hi David, The MapReduce framework will attempt to rerun failed tasks automatically. However, if a task is running out of memory on one machine, it's likely to run out of memory on another, isn't it? Have a look at the mapred.child.java.opts configuration property for the amount of memory that each task VM is given (200MB by default). You can also control the memory that each daemon gets using the HADOOP_HEAPSIZE variable in hadoop-env.sh. Or you can specify it on a per-daemon basis using the HADOOP__OPTS variables in the same file. Tom This looks not so much a VM out of memory problem as OS thread provisioning. ulimit may be useful, as is the java -Xss option http://candrews.integralblue.com/2009/01/preventing-outofmemoryerror-native-thread/ On Wed, May 6, 2009 at 1:28 AM, David Batista wrote: I get this error when running Reduce tasks on a machine: java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:597) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.(DFSClient.java:2591) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:454) at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:190) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:387) at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:117) at org.apache.hadoop.mapred.lib.MultipleTextOutputFormat.getBaseRecordWriter(MultipleTextOutputFormat.java:44) at org.apache.hadoop.mapred.lib.MultipleOutputFormat$1.write(MultipleOutputFormat.java:99) at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410) is it possible to move a reduce task to other machine in the cluster on the fly? -- ./david
Re: What do we call Hadoop+HBase+Lucene+Zookeeper+etc....
Edward Capriolo wrote: 'cloud computing' is a hot term. According to the definition provided by wikipedia http://en.wikipedia.org/wiki/Cloud_computing, Hadoop+HBase+Lucene+Zookeeper, fits some of the criteria but not well. Hadoop is scalable, with HOD it is dynamically scalable. I do not think (Hadoop+HBase+Lucene+Zookeeper) can be used for 'utility computing'. as managing the stack and getting started is quite a complex process. Exactly. Which is why the Apache Clouds proposal emphasises -Lightweight front end: low Wattage, stateless nodes for web GUI, bonded to the back end -instrumentation for liveness and load monitoring. Hadoop has a lot of this, I'm trying to add more, but we want it everywhere. -Resource Management: bringing up and tearing down nodes by asking the infrastructure. Some Apache projects have done this but only for EC2 and only for their layer of the stack. You need something that keeps track of everything and acts in your interests, not those of the datacentre provider -Packaging for fully automated install/deploy on Linux systems (=rpm and deb) -A development process in which the tools push the code out to a targeted infrastracture even for test runs Hadoop and friends are part of this, they are a very interesting foundation, but they are only part of the storing Also this stack is best running on LAN network with high speed interlinks. Historically the "Cloud" is composed of WAN links. An implication of Cloud Computing is that different services would be running in different geographical locations which is not how hadoop is normally deployed. I believe 'Apache Grid Stack' would be a more fitting. http://en.wikipedia.org/wiki/Grid_computing Grid computing (or the use of computational grids) is the application of several computers to a single problem at the same time — usually to a scientific or technical problem that requires a great number of computer processing cycles or access to large amounts of data. Classic Grid computing - OGSi/OGSA is something I want to steer clear of. Historically, you end up in WS-* and computer management politics. Furthermore, OGSA never had a good use case except "rewrite your apps for the cloud and they will be better". They (lets be fair, we) also focused too much on CPU scheduling, not on storage. Grid computing via the Wikipedia definition describes exactly what hadoop does. Without amazon S3 and EC2 hadoop does not fit well into a 'cloud computing' IMHO To be precise: without a dynamic infrastructure provider that is more than just AWS: it could be Sun/Oracle, IBM/google, HP/Intel/Yahoo!, it could be your ops team and Eucalyptus. The other hardware/service vendors are working on this infrastructure. Apache doesn't work at that level, but if we provide the code to run on all of them, we give the users the independence of a particular infrastructure provider
Re: What do we call Hadoop+HBase+Lucene+Zookeeper+etc....
Bradford Stephens wrote: Hey all, I'm going to be speaking at OSCON about my company's experiences with Hadoop and Friends, but I'm having a hard time coming up with a name for the entire software ecosystem. I'm thinking of calling it the "Apache CloudStack". Does this sound legit to you all? :) Is there something more 'official'? We've been using "Apache Cloud Computing Edition" for this, to emphasise this is the successor to Java Enterprise Edition, and that it is cross language and being built at apache. If you use the same term, even if you put a different stack outline than us, it gives the idea more legitimacy. The slides that Andrew linked to are all in SVN under http://svn.apache.org/repos/asf/labs/clouds/ we have a space in the apache labs for "apache clouds", where we want to do more work integrating things, and bringing the idea of deploy and test on someone else's infrastructure mainstream across all the apache products. We would welcome your involvement -and if you send a draft of your slides out, will happily review them -steve
Re: I need help
Razen Alharbi wrote: Thanks everybody, The issue was that hadoop writes all the outputs to stderr instead of stdout and i don't know why. I would really love to know why the usual hadoop job progress is written to stderr. because there is a line in log4.properties telling it to do just that? log4j.appender.console.target=System.err -- Steve Loughran http://www.1060.org/blogxter/publish/5 Author: Ant in Action http://antbook.org/
Re: Can i make a node just an HDFS client to put/get data into hadoop
Usman Waheed wrote: Hi All, Is it possible to make a node just a hadoop client so that it can put/get files into HDFS but not act as a namenode or datanode? I already have a master node and 3 datanodes but need to execute puts/gets into hadoop in parallel using more than just one machine other than the master. Anything on the LAN can be a client of the filesystem, you just need appropriate hadoop configuration files to talk to the namenode and job tracker. I don't know how well the (custom) IPC works over long distances, and you have to keep the versions in sync for everything to work reliably.
Re: programming java ee and hadoop at the same time
Bill Habermaas wrote: George, I haven't used the Hadoop perspective in Eclipse so I can't help with that specifically but map/reduce is a batch process (and can be long running). In my experience, I've written servlets that write to HDFS and then have a background process perform the map/reduce. They can both run in background under Eclipse but are not tightly coupled. I've discussed this recently, having a good binding from webapps to the back end, where the back end consists of HDFS, MapReduce queues, and things round them https://svn.apache.org/repos/asf/labs/clouds/src/doc/apache_cloud_computing_edition_oxford.odp If people are willing to help with this, we have an apache "lab" project, Apache Clouds, ready for you code, tests and ideas
Re: I need help
Razen Al Harbi wrote: Hi all, I am writing an application in which I create a forked process to execute a specific Map/Reduce job. The problem is that when I try to read the output stream of the forked process I get nothing and when I execute the same job manually it starts printing the output I am expecting. For clarification I will go through the simple code snippet: Process p = rt.exec("hadoop jar GraphClean args"); BufferedReader reader = new BufferedReader(new InputStreamReader(p.getInputStream())); String line = null; check = true; while(check){ line = reader.readLine(); if(line != null){// I know this will not finish it's only for testing. System.out.println(line); } } If I run this code nothing shows up. But if execute the command (hadoop jar GraphClean args) from the command line it works fine. I am using hadoop 0.19.0. Why not just invoke the Hadoop job submission calls yourself, no need to exec anything? Look at org.apache.hadoop.util.RunJar to see what you need to do. Avoid calling RunJar.main() directly as - it calls System.exit() when it wants to exit with an error - it adds shutdown hooks -steve
Re: Storing data-node content to other machine
Vishal Ghawate wrote: Hi, I want to store the contents of all the client machine(datanode)of hadoop cluster to centralized machine with high storage capacity.so that tasktracker will be on the client machine but the contents are stored on the centralized machine. Can anybody help me on this please. set the datanode to point to the (mounted) filesystem with the dfs.data.dir parameter.
Re: Processing High CPU & Memory intensive tasks on Hadoop - Architecture question
Aaron Kimball wrote: I'm not aware of any documentation about this particular use case for Hadoop. I think your best bet is to look into the JNI documentation about loading native libraries, and go from there. - Aaron You could also try 1. Starting the main processing app as a process on the machines -and leave it running- 2. have your mapper (somehow) talk to that running process, passing in parameters (including local filesystem filenames) to read and write. You can use RMI or other IPC mechanisms to talk to the long-lived process.
Re: No route to host prevents from storing files to HDFS
Stas Oskin wrote: Hi. 2009/4/23 Matt Massie Just for clarity: are you using any type of virtualization (e.g. vmware, xen) or just running the DataNode java process on the same machine? What is "fs.default.name" set to in your hadoop-site.xml? This machine has OpenVZ installed indeed, but all the applications run withing the host node, meaning all Java processes are running withing same machine. Maybe, but there will still be at least one virtual network adapter on the host. Try turning them off. The fs.default.name is: hdfs://192.168.253.20:8020 what happens if you switch to hostnames over IP addresses?
Re: No route to host prevents from storing files to HDFS
Stas Oskin wrote: Hi again. Other tools, like balancer, or the web browsing from namenode, don't work as well. This because other nodes complain about not reaching the offending node as well. I even tried netcat'ing the IP/port from another node - and it successfully connected. Any advice on this "No route to host" error? "No route to host" generally means machines have routing problems. Machine A doesnt know how to route packets to Machine B. Reboot everything, router first, see if it goes away. Otherwise, now is the time to learn to debug routing problems. traceroute is the best starting place
Re: Error reading task output
Aaron Kimball wrote: Cam, This isn't Hadoop-specific, it's how Linux treats its network configuration. If you look at /etc/host.conf, you'll probably see a line that says "order hosts, bind" -- this is telling Linux's DNS resolution library to first read your /etc/hosts file, then check an external DNS server. You could probably disable local hostfile checking, but that means that every time a program on your system queries the authoritative hostname for "localhost", it'll go out to the network. You'll probably see a big performance hit. The better solution, I think, is to get your nodes' /etc/hosts files squared away. I agree You only need to do so once :) No, you need to detect whenever the Linux networking stack has decided to add new entries to resolv.conf or /etc/hosts and detect when they are inappropriate. Which is a tricky thing to do as there are some cases where you may actually be grateful that someone in the debian codebase decided that adding the local hostname as 127.0.0.1 is actually a feature. I ended up writing a new SmartFrog component that can be configured to fail to start if the network is a mess, which is something worth pushing out. as part of hadoop diagnostics, this test would be one of the things to deal with and at least warn on. "your hostname is local, you will not be visible over the network". -steve
Re: getting DiskErrorException during map
Jim Twensky wrote: Yes, here is how it looks: hadoop.tmp.dir /scratch/local/jim/hadoop-${user.name} so I don't know why it still writes to /tmp. As a temporary workaround, I created a symbolic link from /tmp/hadoop-jim to /scratch/... and it works fine now but if you think this might be a considered as a bug, I can report it. I've encountered this somewhere too; could be something is using the java temp file API, which is not what you want. Try setting java.io.tmpdir to /scratch/local/tmp just to see if that makes it go away
Re: Error reading task output
Cam Macdonell wrote: Well, for future googlers, I'll answer my own post. Watch our for the hostname at the end of "localhost" lines on slaves. One of my slaves was registering itself as "localhost.localdomain" with the jobtracker. Is there a way that Hadoop could be made to not be so dependent on /etc/hosts, but on more dynamic hostname resolution? DNS is trouble in Java; there are some (outstanding) bugreps/hadoop patches on the topic, mostly showing up on a machine of mine with a bad hosts entry. I also encountered some fun last month with ubuntu linux adding the local hostname to /etc/hosts along the 127.0.0.1 entry, which is precisely what you dont want for a cluster of vms with no DNS at all. This sounds like your problem too, in which case I have shared your pain http://www.1060.org/blogxter/entry?publicid=121ED68BB21DB8C060FE88607222EB52
Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks
Andrew Newman wrote: They are comparing an indexed system with one that isn't. Why is Hadoop faster at loading than the others? Surely no one would be surprised that it would be slower - I'm surprised at how well Hadoop does. Who want to write a paper for next year, "grep vs reverse index"? 2009/4/15 Guilherme Germoglio : (Hadoop is used in the benchmarks) http://database.cs.brown.edu/sigmod09/ I think it is interesting, though it misses the point that the reason that few datasets are >1PB today is nobody could afford to store or process the data. With Hadoop cost is somewhat high (learn to patch the source to fix your cluster's problems) but scales well with the #of nodes. Commodity storage costs (my own home now has >2TB of storage) and commodity software costs are compatible. Some other things to look at -power efficiency. I actually think the DBs could come out better -ease of writing applications by skilled developers. Pig vs SQL -performance under different workloads (take a set of log files growing continually, mine it in near-real time. I think the last.fm use case would be a good one) One of the great ironies of SQL is most developers dont go near it, as it is a detail handed by the O/R mapping engine, except when building SQL selects for web pages. If Pig makes M/R easy, would it be used -and if so, does that show that we developers prefer procedural thinking? -steve
Re: How many people is using Hadoop Streaming ?
Tim Wintle wrote: On Fri, 2009-04-03 at 09:42 -0700, Ricky Ho wrote: 1) I can pick the language that offers a different programming paradigm (e.g. I may choose functional language, or logic programming if they suit the problem better). In fact, I can even chosen Erlang at the map() and Prolog at the reduce(). Mix and match can optimize me more. Agreed (as someone who has written mappers/reducers in Python, perl, shell script and Scheme before). sounds like a good argument for adding scripting support for in-JVM MR jobs; use the java6 scripting APIs and use any of the supported languages -java script out the box, other languages (jython, scala) with the right JARs.
Re: RPM spec file for 0.19.1
Ian Soboroff wrote: Steve Loughran writes: I think from your perpective it makes sense as it stops anyone getting itchy fingers and doing their own RPMs. Um, what's wrong with that? It's reallly hard to do good RPM spec files. If cloudera are willing to pay Matt to do it, not only do I welcome it, but will see if I can help him with some of the automated test setup they'll need. One thing which would be useful would be to package up all of the hadoop functional tests that need a live cluster up as its own JAR, so the test suite could be run against an RPM installation on different Virtual OS/JVM combos. I've just hit this problem with my own RPMs on Java6 (java security related), so know that having the ability to use the entire existing test suite against an RPM installation would be be beneficial (both in my case and for hadoop RPMS)
Re: Amazon Elastic MapReduce
Brian Bockelman wrote: On Apr 2, 2009, at 3:13 AM, zhang jianfeng wrote: seems like I should pay for additional money, so why not configure a hadoop cluster in EC2 by myself. This already have been automatic using script. Not everyone has a support team or an operations team or enough time to learn how to do it themselves. You're basically paying for the fact that the only thing you need to know to use Hadoop is: 1) Be able to write the Java classes. 2) Press the "go" button on a webpage somewhere. You could use Hadoop with little-to-zero systems knowledge (and without institutional support), which would always make some researchers happy. Brian True, but this way nobody gets the opportunity to learn how to do it themselves, which can be a tactical error one comes to regret further down the line. By learning the pain of cluster management today, you get to keep it under control as your data grows. I am curious what bug patches AWS will supply, for they have been very silent on their hadoop work to date.
Re: Using HDFS to serve www requests
Snehal Nagmote wrote: can you please explain exactly adding NIO bridge means what and how it can be done , what could be advantages in this case ? NIO: java non-blocking IO. It's a standard API to talk to different filesystems; support has been discussed in jira. If the DFS APIs were accessible under an NIO front end, then applications written for the NIO APIs would work with the supported filesystems, with no need to code specifically for hadoop's not-yet-stable APIs Steve Loughran wrote: Edward Capriolo wrote: It is a little more natural to connect to HDFS from apache tomcat. This will allow you to skip the FUSE mounts and just use the HDFS-API. I have modified this code to run inside tomcat. http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample I will not testify to how well this setup will perform under internet traffic, but it does work. If someone adds an NIO bridge to hadoop filesystems then it would be easier; leaving you only with the performance issues.
Re: RPM spec file for 0.19.1
Christophe Bisciglia wrote: Hey Ian, we are totally fine with this - the only reason we didn't contribute the SPEC file is that it is the output of our internal build system, and we don't have the bandwidth to properly maintain multiple RPMs. That said, we chatted about this a bit today, and were wondering if the community would like us to host RPMs for all releases in our "devel" repository. We can't stand behind these from a reliability angle the same way we can with our "blessed" RPMs, but it's a manageable amount of additional work to have our build system spit those out as well. I think from your perpective it makes sense as it stops anyone getting itchy fingers and doing their own RPMs. At the same time, I think we do need to make it possible/easy to do RPMs *and have them consistent*. If hadoop-core makes RPMs that don't work with your settings rpms, you get to field to the support calls. -steve
Re: RPM spec file for 0.19.1
Ian Soboroff wrote: I created a JIRA (https://issues.apache.org/jira/browse/HADOOP-5615) with a spec file for building a 0.19.1 RPM. I like the idea of Cloudera's RPM file very much. In particular, it has nifty /etc/init.d scripts and RPM is nice for managing updates. However, it's for an older, patched version of Hadoop. This spec file is actually just Cloudera's, with suitable edits. The spec file does not contain an explicit license... if Cloudera have strong feelings about it, let me know and I'll pull the JIRA attachment. The JIRA includes instructions on how to roll the RPMs yourself. I would have attached the SRPM but they're too big for JIRA. I can offer noarch RPMs build with this spec file if someone wants to host them. Ian -RPM and deb packaging would be nice -the .spec file should be driven by ant properties to get dependencies from the ivy files -the jdk requirements are too harsh as it should run on openjdk's JRE or jrockit; no need for sun only. Too bad the only way to say that is leave off all jdk dependencies. -I worry about how they patch the rc.d files. I can see why, but wonder what that does with the RPM ownership As someone whose software does get released as RPMs (and tar files containing everything needed to create your own), I can state with experience that RPMs are very hard to get right, and very hard to test. The hardest thing to get right (and to test) is live update of the RPMs while the app is running. I am happy for the cloudera team to have taken on this problem.
Re: Typical hardware configurations
Scott Carey wrote: On 3/30/09 4:41 AM, "Steve Loughran" wrote: Ryan Rawson wrote: You should also be getting 64-bit systems and running a 64 bit distro on it and a jvm that has -d64 available. For the namenode yes. For the others, you will take a fairly big memory hit (1.5X object size) due to the longer pointers. JRockit has special compressed pointers, so will JDK 7, apparently. Sun Java 6 update 14 has ³Ordinary Object Pointer² compression as well. -XX:+UseCompressedOops. I¹ve been testing out the pre-release of that with great success. Nice. Have you tried Hadoop with it yet? Jrockit has virtually no 64 bit overhead up to 4GB, Sun Java 6u14 has small overhead up to 32GB with the new compression scheme. IBM¹s VM also has some sort of pointer compression but I don¹t have experience with it myself. I use the JRockit JVM as it is what our customers use and we need to test on the same JVM. It is interesting in that recursive calls don't ever seem to run out; the way it does stack doesn't have separate memory spaces for stack, permanent generation heap space and the like. That doesn't mean apps are light: a freshly started IDE consumes more physical memory than a VMWare image running XP and outlook. But it is fairly responsive, which is good for a UI: 2295m 650m 22m S2 10.9 0:43.80 java 855m 543m 530m S 11 9.1 4:40.40 vmware-vmx http://wikis.sun.com/display/HotSpotInternals/CompressedOops http://blog.juma.me.uk/tag/compressed-oops/ With pointer compression, there may be gains to be had with running 64 bit JVMs smaller than 4GB on x86 since then the runtime has access to native 64 bit integer operations and registers (as well as 2x the register count). It will be highly use-case dependent. that would certainly benefit atomic operations on longs; for floating point math it would be less useful as JVMs have long made use of the SSE register set for FP work. 64 bit registers would make it easier to move stuff in and out of those registers. I will try and set up a hudson server with this update and see how well it behaves.
Re: Typical hardware configurations
Ryan Rawson wrote: You should also be getting 64-bit systems and running a 64 bit distro on it and a jvm that has -d64 available. For the namenode yes. For the others, you will take a fairly big memory hit (1.5X object size) due to the longer pointers. JRockit has special compressed pointers, so will JDK 7, apparently.
Re: Using HDFS to serve www requests
Edward Capriolo wrote: It is a little more natural to connect to HDFS from apache tomcat. This will allow you to skip the FUSE mounts and just use the HDFS-API. I have modified this code to run inside tomcat. http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample I will not testify to how well this setup will perform under internet traffic, but it does work. If someone adds an NIO bridge to hadoop filesystems then it would be easier; leaving you only with the performance issues.
Re: virtualization with hadoop
Oliver Fischer wrote: Hello Vishal, I did the same some weeks ago. The most important fact is, that it works. But it is horrible slow if you not have enough ram and multiple disks since all I/o-Operations go to the same disk. they may go to separate disks underneath, but performance is bad as what the virtual OS thinks is a raw hard disk could be a badly fragmented bit of storage on the container OS. Memory is another point of conflict; your VMs will swap out or block other vms. 0. Keep different VM virtual disks on different physical disks. Fast disks at that. 1. pre-allocate your virtual disks 2. defragment at both the VM and host OS levels. 3. Crank back the schedulers so that the VMs aren't competing too much for CPU time. One core for the host OS, one for each VM. 4. You can keep an eye on performance by looking at the clocks of the various machines: if they pause and get jittery then they are being swapped out. Using multiple VMs on a single host is OK for testing, but not for hard work. You can use VM images to do work, but you need to have enough physical cores and RAM to match that of the VMs. -steve
Re: JNI and calling Hadoop jar files
jason hadoop wrote: The exception reference to *org.apache.hadoop.hdfs.DistributedFileSystem*, implies strongly that a hadoop-default.xml file, or at least a job.xml file is present. Since hadoop-default.xml is bundled into the hadoop-0.X.Y-core.jar, the assumption is that the core jar is available. The class not found exception, the implication is that the hadoop-0.X.Y-core.jar is not available to jni. Given the above constraints, the two likely possibilities are that the -core jar is unavailable or damaged, or that somehow the classloader being used does not have access to the -core jar. A possible reason for the jar not being available is that the application is running on a different machine, or as a different user and the jar is not actually present or perhaps readable in the expected location. Which way is your JNI, java application calling into a native shared library, or a native application calling into a jvm that it instantiates via libjvm calls? Could you dump the classpath that is in effect before your failing jni call? System.getProperty( "java.class.path"), and for that matter, "java.library.path", or getenv("CLASSPATH) and provide an ls -l of the core.jar from the class path, run as the user that owns the process, on the machine that the process is running on. Or something bad is happening with a dependent library of the filesystem that is causing the reflection-based load to fail and die with the root cause being lost in the process. Sometimes putting an explicit reference to the class you are trying to load is a good way to force the problem to surface earlier, and fail with better error messages.
Re: Coordination between Mapper tasks
Stuart White wrote: The nodes in my cluster have 4 cores & 4 GB RAM. So, I've set mapred.tasktracker.map.tasks.maximum to 3 (leaving 1 core for "breathing room"). My process requires a large dictionary of terms (~ 2GB when loaded into RAM). The terms are looked-up very frequently, so I want the terms memory-resident. So, the problem is, I want 3 processes (to utilize CPU), but each process requires ~2GB, but my nodes don't have enough memory to each have their own copy of the 2GB of data. So, I need to somehow share the 2GB between the processes. What I have currently implemented is a standalone RMI service that, during startup, loads the 2GB dictionaries. My mappers are simply RMI clients that call this RMI service. This works just fine. The only problem is that my standalone RMI service is totally "outside" Hadoop. I have to ssh onto each of the nodes, start/stop/reconfigure the services manually, etc... There's nothing wrong with doing this outside Hadoop, the only problem is that manual deployment is not the way forward. 1. some kind of javaspace system where you put facts into the t-space and let them all share it 2. (CofI warning), use something like SmartFrog's anubis tuplespace to bring up one -and one only- node with the dictionary application. This may be hard to get started, but it keeps availability high -the anubis nodes keep track of all other members of the cluster by some heartbeat/election protocol, and can handle failures of the dictionary node by automatically bringing up a new one 3. Roll your own multicast/voting protocol, so avoiding RMI. Something scatter/gather style is needed as part of the Apache Cloud computing product portfolio, so you could try implementing it -Doug Cutting will probably provide constructive feedback. I haven't played with zookeeper enough to say whether it would work here -steve
Re: using virtual slave machines
Karthikeyan V wrote: There is no specific procedure for configuring virtual machine slaves. make sure the following thing are done. I've used these as the beginning of a page on this http://wiki.apache.org/hadoop/VirtualCluster
Re: Extending ClusterMapReduceTestCase
jason hadoop wrote: I am having trouble reproducing this one. It happened in a very specific environment that pulled in an alternate sax parser. The bottom line is that jetty expects a parser with particular capabilities and if it doesn't get one, odd things happen. In a day or so I will have hopefully worked out the details, but it has been have a year since I dealt with this last. Unless you are forking, to run your junit tests, ant won't let you change the class path for your unit tests - much chaos will ensue. Even if you fork, unless you set includeantruntime=false then you get Ant's classpath, as the junit test listeners are in the ant-optional-junit.jar and you'd better pull them in somehow. I can see why AElfred would cause problems for jetty; they need to handle web.xml and suchlike, and probably validate them against the schema to reduce support calls.
Re: Persistent HDFS On EC2
Kris Jirapinyo wrote: Why would you lose the locality of storage-per-machine if one EBS volume is mounted to each machine instance? When that machine goes down, you can just restart the instance and re-mount the exact same volume. I've tried this idea before successfully on a 10 node cluster on EC2, and didn't see any adverse performance effects-- I was thinking more of S3 FS, which is remote-ish and write times measurable and actually amazon claims that EBS I/O should be even better than the instance stores. Assuming the transient filesystems are virtual disks (and not physical disks that get scrubbed, formatted and mounted on every VM instantiation), and also assuming that EBS disks are on a SAN in the same datacentre, this is probably true. Disk IO performance in virtual disks is currently pretty slow as you are navigating through >1 filesystem, and potentially seeking at lot, even something that appears unfragmented at the VM level The only concerns I see are that you need to pay for EBS storage regardless of whether you use that storage or not. So, if you have 10 EBS volumes of 1 TB each, and you're just starting out with your cluster so you're using only 50GB on each EBS volume so far for the month, you'd still have to pay for 10TB worth of EBS volumes, and that could be a hefty price for each month. Also, currently EBS needs to be created in the same availability zone as your instances, so you need to make sure that they are created correctly, as there is no direct migration of EBS to different availability zones. View EBS as renting space in SAN and it starts to make sense. -- Steve Loughran http://www.1060.org/blogxter/publish/5 Author: Ant in Action http://antbook.org/
Re: Persistent HDFS On EC2
Malcolm Matalka wrote: If this is not the correct place to ask Hadoop + EC2 questions please let me know. I am trying to get a handle on how to use Hadoop on EC2 before committing any money to it. My question is, how do I maintain a persistent HDFS between restarts of instances. Most of the tutorials I have found involve the cluster being wiped once all the instances are shut down but in my particular case I will be feeding output of a previous days run as the input of the current days run and this data will get large over time. I see I can use s3 as the file system, would I just create an EBS volume for each instance? What are my options? EBS would cost you more; you'd lose the locality of storage-per-machine. If you stick the output of some runs back into S3 then the next jobs have no locality and higher startup overhead to pull the data down, but you dont pay for that download (just the time it takes).
Re: Extending ClusterMapReduceTestCase
jason hadoop wrote: The other goofy thing is that the xml parser that is commonly first in the class path, validates xml in a way that is opposite to what jetty wants. What does ant -diagnostics say? It will list the XML parser at work This line in the preamble before theClusterMapReduceTestCase setup takes care of the xml errors. System.setProperty("javax.xml.parsers.SAXParserFactory","org.apache.xerces.jaxp.SAXParserFactoryImpl"); possibly, though when Ant starts it with the classpath set up for junit runners, I'd expect the xml parser from the ant distro to get in there first, system properties notwithstandng
Re: DataNode gets 'stuck', ends up with two DataNode processes
Philip Zeyliger wrote: Very naively looking at the stack traces, a common theme is that there's a call out to "df" to find the system capacity. If you see two data node processes, perhaps the fork/exec to call out to "df" is failing in some strange way. that's deep into Java code. OpenJDK gives you more of that source. One option here is to consider some kind of timeouts in the exec, but it's pretty tricky to tack that on round the Java runtime APIs, because the process APIs weren't really designed to be interrupted by other threads. -steve "DataNode: [/hadoop-data/dfs/data]" daemon prio=10 tid=0x002ae2c0d400 nid=0x21cf in Object.wait() [0x42c54000..0x42c54b30] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:485) at java.lang.UNIXProcess$Gate.waitForExit(UNIXProcess.java:64) - locked <0x002a9fd84f98> (a java.lang.UNIXProcess$Gate) at java.lang.UNIXProcess.(UNIXProcess.java:145) at java.lang.ProcessImpl.start(ProcessImpl.java:65) at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) at org.apache.hadoop.util.Shell.runCommand(Shell.java:149) at org.apache.hadoop.util.Shell.run(Shell.java:134) at org.apache.hadoop.fs.DF.getCapacity(DF.java:63) at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolume.getCapacity(FSDataset.java:341) at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet.getCapacity(FSDataset.java:501) - locked <0x002a9ed97078> (a org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet) at org.apache.hadoop.hdfs.server.datanode.FSDataset.getCapacity(FSDataset.java:697) at org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:671) at org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1105) at java.lang.Thread.run(Thread.java:619) On Mon, Mar 9, 2009 at 8:17 AM, Garhan Attebury wrote: On a ~100 node cluster running HDFS (we just use HDFS + fuse, no job/task trackers) I've noticed many datanodes get 'stuck'. The nodes themselves seem fine with no network/memory problems, but in every instance I see two DataNode processes running, and the NameNode logs indicate the datanode in question simply stopped responding. This state persists until I come along and kill the DataNode processes and restart the DataNode on that particular machine. I'm at a loss as to why this is happening, so here's all the relevant information I can think of sharing: hadoop version = 0.19.1-dev, r (we possibly have some custom patches running, but nothing which would affect HDFS that I'm aware of) number of nodes = ~100 HDFS size = ~230TB Java version = OS = CentOS 4.7 x86_64, 4/8 core Opterons with 4GB/16GB of memory respectively I managed to grab a stack dump via "kill -3" from two of these problem instances and threw up the logs at http://cse.unl.edu/~attebury/datanode_problem/<http://cse.unl.edu/%7Eattebury/datanode_problem/>. The .log files honestly show nothing out of the ordinary, and having very little Java developing experience the .out files mean nothing to me. It's also worth mentioning that the NameNode logs at the time when these DataNodes got stuck show nothing out of the ordinary either -- just the expected "lost heartbeat from node " message. The DataNode daemon (the original process, not the second mysterious one) continues to respond to web requests like browsing the log directory during this time. Whenever this happens I've just manually done a "kill -9" to remove the two stuck DataNode processes (I'm not even sure why there's two of them, as under normal operation there's only one). After killing the stuck ones, I simply do a "hadoop-daemon.sh start datanode" and all is normal again. I've not seen any dataloss or corruption as a result of this problem. Has anyone seen anything like this happen before? Out of our ~100 node cluster I see this problem around once a day, and it seems to just strike random nodes at random times. It happens often enough that I would be happy to do additional debugging if anyone can tell me how. I'm not a developer at all, so I'm at the end of my knowledge on how to solve this problem. Thanks for any help! === Garhan Attebury Systems Administrator UNL Research Computing Facility 402-472-7761 === -- Steve Loughran http://www.1060.org/blogxter/publish/5 Author: Ant in Action http://antbook.org/
Re: master trying fetch data from slave using "localhost" hostname :)
pavelkolo...@gmail.com wrote: On Fri, 06 Mar 2009 14:41:57 -, jason hadoop wrote: I see that when the host name of the node is also on the localhost line in /etc/hosts I erased all records with "localhost" from all "/etc/hosts" files and all fine now :) Thank you :) what does /etc/host look like now? I hit some problems with ubuntu and localhost last week; the hostname was set up in /etc/hosts not just to point to the loopback address, but to a different loopback address (127.0.1.1) from the normal value (127.0.0.1), so breaking everything. http://www.1060.org/blogxter/entry?publicid=121ED68BB21DB8C060FE88607222EB52
Re: Running 0.19.2 branch in production before release
Aaron Kimball wrote: I recommend 0.18.3 for production use and avoid the 19 branch entirely. If your priority is stability, then stay a full minor version behind, not just a revision. Of course, if everyone stays that far behind, they don't get to find the bugs for other people. * If you play with the latest releases early, while they are in the beta phase -you will encounter the problems specific to your applications/datacentres, and get them fixed fast. * If you work with stuff further back you get stability, but not only are you behind on features, you can't be sure that all "fixes" that matter to you get pushed back. * If you plan on making changes, of adding features, get onto SVN_HEAD * If you want to catch changes being made that break your site, SVN_HEAD. Better yet, have a private Hudson server checking out SVN_HEAD hadoop *then* building and testing your app against it. Normally I work with stable releases of things I dont depend on, and SVN_HEAD of OSS stuff whose code I have any intent to change; there is a price -merge time, the odd change breaking your code- but you get to make changes that help you long term. Where Hadoop is different is that it is a filesystem, and you don't want to hit bugs that delete files that matter. I'm only bringing up transient clusters on VMs, pulling in data from elsewhere, so this isn't an issue. All that remains is changing APIs. -Steve
Re: [ANNOUNCE] Hadoop release 0.19.1 available
(Unknown Source) at org.eclipse.core.internal.events.BuildManager$1.run(Unknown Source) at org.eclipse.core.runtime.SafeRunner.run(Unknown Source) at org.eclipse.core.internal.events.BuildManager.basicBuild(Unknown Source) at org.eclipse.core.internal.events.BuildManager.basicBuild(Unknown Source) at org.eclipse.core.internal.events.BuildManager.build(Unknown Source) at org.eclipse.core.internal.resources.Project.internalBuild(Unknown Source) at org.eclipse.core.internal.resources.Project.build(Unknown Source) at org.eclipse.ui.actions.BuildAction.invokeOperation(Unknown Source) at org.eclipse.ui.actions.WorkspaceAction.execute(Unknown Source) at org.eclipse.ui.actions.WorkspaceAction$2.runInWorkspace(Unknown Source) at org.eclipse.core.internal.resources.InternalWorkspaceJob.run(Unknown Source) at org.eclipse.core.internal.jobs.Worker.run(Unknown Source) Caused by: *org.apache.commons.logging.LogConfigurationException*: java.lang.NoClassDefFoundError: org.apache.log4j.Category (Caused by java.lang.NoClassDefFoundError: org.apache.log4j.Category) at org.apache.commons.logging.impl.LogFactoryImpl.newInstance(Unknown Source) at org.apache.commons.logging.impl.LogFactoryImpl.getInstance(Unknown Source) at org.apache.commons.logging.impl.LogFactoryImpl.getInstance(Unknown Source) at org.apache.commons.logging.LogFactory.getLog(Unknown Source) at org.apache.jasper.compiler.Compiler.(Unknown Source) at java.lang.J9VMInternals.initializeImpl(*Native Method*) ... 54 more Caused by: java.lang.NoClassDefFoundError: org.apache.log4j.Category at java.lang.J9VMInternals.verifyImpl(*Native Method*) at java.lang.J9VMInternals.verify(Unknown Source) at java.lang.J9VMInternals.initialize(Unknown Source) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(*Native Method*) at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source) at java.lang.reflect.Constructor.newInstance(Unknown Source) ... 60 more Caused by: *java.lang.ClassNotFoundException*: org.apache.log4j.Category at java.lang.Throwable.(Unknown Source) at java.lang.ClassNotFoundException.(Unknown Source) at org.eclipse.osgi.framework.internal.core.BundleLoader.findClassInternal(Unknown Source) at org.eclipse.osgi.framework.internal.core.BundleLoader.findClass(Unknown Source) at org.eclipse.osgi.framework.internal.core.BundleLoader.findClass(Unknown Source) at org.eclipse.osgi.internal.baseadaptor.DefaultClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) ... 67 more On Tue, Mar 3, 2009 at 1:16 PM, Steve Loughran wrote: Aviad sela wrote: Nigel Thanks, I have extracted the new project. However, I am having problems building the project I am using Eclipse 3.4 and ant 1.7 I recieve error compiling core classes * compile-core-classes*: BUILD FAILED * D:\Work\AviadWork\workspace\cur\WSAD\Hadoop_Core_19_1\Hadoop\build.xml:302: java.lang.ExceptionInInitializerError * it points to the the webxml tag Try an ant -verbose and post the full log, we may be able to look at the problem more. Also, run an
Re: [ANNOUNCE] Hadoop release 0.19.1 available
Aviad sela wrote: Nigel Thanks, I have extracted the new project. However, I am having problems building the project I am using Eclipse 3.4 and ant 1.7 I recieve error compiling core classes * compile-core-classes*: BUILD FAILED * D:\Work\AviadWork\workspace\cur\WSAD\Hadoop_Core_19_1\Hadoop\build.xml:302: java.lang.ExceptionInInitializerError * it points to the the webxml tag Try an ant -verbose and post the full log, we may be able to look at the problem more. Also, run an ant -diagnostics and include what it prints -- Steve Loughran http://www.1060.org/blogxter/publish/5 Author: Ant in Action http://antbook.org/
Re: How does NVidia GPU compare to Hadoop/MapReduce
Dan Zinngrabe wrote: On Fri, Feb 27, 2009 at 11:21 AM, Doug Cutting wrote: I think they're complementary. Hadoop's MapReduce lets you run computations on up to thousands of computers potentially processing petabytes of data. It gets data from the grid to your computation, reliably stores output back to the grid, and supports grid-global computations (e.g., sorting). CUDA can make computations on a single computer run faster by using its GPU. It does not handle co-ordination of multiple computers, e.g., the flow of data in and out of a distributed filesystem, distributed reliability, global computations, etc. So you might use CUDA within mapreduce to more efficiently run compute-intensive tasks over petabytes of data. Doug I actually did some work with this several months ago, using a consumer-level NVIDIA card. I found a couple of interesting things: - I used JOGL and OpenGL shaders rather than CUDA, as at least at the time there was no reasonable way to talk to CUDA through java. That made a number of things more complicated, CUDA certainly makes things simpler. For the particular problem I was working with, GLSL was fine, though CUDA would have simplified things. - The problem set I was working with involved creating and searching large amounts of hashes - 3-4 TB of them at a time. - Only 2 of my nodes in an 8 node cluster had accelerators, but they had a dramatic effect on performance. I do not have any of my test results handy, but for this particular problem the accelerators cut the job time in half or more. that's interesting, as it means the power budget of the overall workload ought to be less I would agree with Doug that the two are complimentary, though there are some similarities. Working with the GPU means you are limited by how much texture memory is available for storage (compared to HDFS, not much!), and the cost of getting data on and off the card can be high. Like many hadoop jobs, the overhead of getting data in and starting a task can easily be greater than the length of the task itself. For what I was doing, it was a good fit - but for many, many problems it would not be the right solution. Yes, and you will need more disk IO capacity per node if each node is capable of more computation, unless you have very CPU-intensive workloads
Re: HDFS architecture based on GFS?
kang_min82 wrote: Hello Matei, Which Tasktracker did you mean here ? I don't understand that. In general we have mane Tasktrackers and each of them runs on one separate Datanode. Why doesn't the JobTracker talk directly to the Namenode for a list of Datanodes and then performs the MapReduce tasks there. 1. There's no requirement for a 1:1 mapping of task-trackers to datanodes. You could bring up TT's on any machine with spare CPU cycles on your network, talking to a long lived filesystem built from a few datanodes 2. There's no requirement for HDFS. You could have a cluster of MapReduce nodes talking to other filesystems. Locality of data helps, but is not needed. 3. Layering makes for cleaner code.
Re: the question about the common pc?
Tim Wintle wrote: On Fri, 2009-02-20 at 13:07 +, Steve Loughran wrote: I've been doing MapReduce work over small in-memory datasets using Erlang, which works very well in such a context. I've got some (mainly python) scripts (that will probably be run with hadoop streaming eventually) that I run over multiple cpus/cores on a single machine by opening the appropriate number of named pipes and using tee and awk to split the workload something like mkfifo mypipe1 mkfifo mypipe2 awk '0 == NR % 2' < mypipe1 | ./mapper | sort > map_out_1& awk '0 == (NR+1) % 2' < mypipe2 | ./mapper | sort > map_out_2& ./get_lots_of_data | tee mypipe1 > mypipe2 (wait until it's done... or send a signal from the "get_lots_of_data" process on completion if it's a cronjob) sort -m map_out* | ./reducer > reduce_out works around the global interpreter lock in python quite nicely and doesn't need people that write the scripts (who may not be programmers) to understand multiple processes etc, just stdin and stdout. Dumbo provides py support under Hadoop: http://wiki.github.com/klbostee/dumbo https://issues.apache.org/jira/browse/HADOOP-4304 as well as that, given Hadoop is java1.6+, there's no reason why it couldn't support the javax.script engine, with JavaScript working without extra JAR files, groovy and jython once their JARs were stuck on the classpath. Some work would probably be needed to make it easier to use these languages, and then there are the tests...
Re: How to use Hadoop API to submit job?
Wu Wei wrote: Hi, I used to submit Hadoop job with the utility RunJar.main() on hadoop 0.18. On hadoop 0.19, because the commandLineConfig of JobClient was null, I got a NullPointerException error when RunJar.main() calls GenericOptionsParser to get libJars (0.18 didn't do this call). I also tried the class JobShell to submit job, but it catches all exceptions and sends to stderr so that I cann't handle the exceptions myself. I noticed that if I can call JobClient's setCommandLineConfig method, everything goes easy. But this method has default package accessibility, I cann't see the method out of package org.apache.hadoop.mapred. Any advices on using Java APIs to submit job? Wei Looking at my code, the line that does the work is JobClient jc = new JobClient(jobConf); runningJob = jc.submitJob(jobConf); My full (LGPL) code is here : http://tinyurl.com/djk6vj there's more work with validating input and output directories, pulling back the results, handling timeouts if the job doesnt complete, etc,etc, but that's feature creep
Re: the question about the common pc?
?? wrote: Actually, there's a widely misunderstanding of this "Common PC" . Common PC doesn't means PCs which are daily used, It means the performance of each node, can be measured by common pc's computing power. In the matter of fact, we dont use Gb enthernet for daily pcs' communication, we dont use linux for our document process, and most importantly, Hadoop cannot run effectively on thoese "daily pc"s. Hadoop is designed for High performance computing equipment, but "claimed" to be fit for "daily pc"s. Hadoop for pcs? what a joke. Hadoop is designed to build a high throughput dataprocessing infrastructure from commodity PC parts. SATA not RAID or SAN, x68+linux not supercomputer hardware and OS. You can bring it up on lighter weight systems, but it has a minimium overhead that is quite steep for small datasets. I've been doing MapReduce work over small in-memory datasets using Erlang, which works very well in such a context. -you need a good network, with DNS working (fast), good backbone and switches -the faster your disks, the better your throughput -ECC memory makes a lot of sense -you need a good cluster management setup unless you like SSH-ing to 20 boxes to find out which one is playing up
Re: GenericOptionsParser warning
Rasit OZDAS wrote: Hi, There is a JIRA issue about this problem, if I understand it correctly: https://issues.apache.org/jira/browse/HADOOP-3743 Strange, that I searched all source code, but there exists only this control in 2 places: if (!(job.getBoolean("mapred.used.genericoptionsparser", false))) { LOG.warn("Use GenericOptionsParser for parsing the arguments. " + "Applications should implement Tool for the same."); } Just an if block for logging, no extra controls. Am I missing something? If your class implements Tool, than there shouldn't be a warning. OK, for my automated submission code I'll just set that switch and I won't get told off.
Re: GenericOptionsParser warning
Sandhya E wrote: Hi All I prepare my JobConf object in a java class, by calling various set apis in JobConf object. When I submit the jobconf object using JobClient.runJob(conf), I'm seeing the warning: "Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same". From hadoop sources it looks like setting "mapred.used.genericoptionsparser" will prevent this warning. But if I set this flag to true, will it have some other side effects. Thanks Sandhya Seen this message too -and it annoys me; not tracked it down
Re: HADOOP-2536 supports Oracle too?
sandhiya wrote: Hi, I'm using postgresql and the driver is not getting detected. How do you run it in the first place? I just typed bin/hadoop jar /root/sandy/netbeans/TableAccess/dist/TableAccess.jar at the terminal without the quotes. I didnt copy any files from my local drives into the Hadoop file system. I get an error like this : java.lang.RuntimeException: java.lang.ClassNotFoundException: org.postgresql.Driver and then the complete stack trace Am i doing something wrong? I downloaded a jar file for postgresql jdbc support and included it in my Libraries folder (I'm using NetBeans). please help JDBC drivers need to be (somehow) loaded before you can resolve the relevant jdbc urls; somehow your code needs to call Class.forName("jdbcdrivername"), where that string is set to the relevant jdbc driver classname
Re: HDFS architecture based on GFS?
Amr Awadallah wrote: I didn't understand usage of "malicuous" here, but any process using HDFS api should first ask NameNode where the Rasit, Matei is referring to fact that a malicious peace of code can bypass the Name Node and connect to any data node directly, or probe all data nodes for that matter. There is no strong authentication for RPC at this layer of HDFS, which is one of the current shortcomings that will be addressed in hadoop 1.0. -- amr This shouldn't be a problem in a locked down datacentre. Where it is a risk is when you host on EC2 or other Vm-hosting service that doesn't set up private VPNs
Re: datanode not being started
Sandy wrote: Since I last used this machine, Parallels Desktop was installed by the admin. I am currently suspecting that somehow this is interfering with the function of Hadoop (though Java_HOME still seems to be ok). Has anyone had any experience with this being a cause of interference? It could have added >1 virtual network adapter, and hadoop is starting on the wrong adapter. I dont think Hadoop handles this situation that well (yet), as you -need to be able to specify the adapter for every node -get rid of the notion of "I have a hostname" and move to "every network adapter has its own hostname" I haven't done enough experiments to be sure. I do know that if you start a datanode with IP addresses for the filesystem, it works out the hostname and then complains if anyone tries to talk to it using the same ip address URL it booted with. -steve
Re: stable version
Anum Ali wrote: The parser problem is related to jar files , can be resolved not a bug. Forwarding link for its solution http://www.jroller.com/navanee/entry/unsupportedoperationexception_this_parser_does_not this site is down; cant see it It is a bug, because I view all operations problems as defects to be opened in the bug tracker, stack traces stuck in, the problem resolved. That's software or hardware -because that issue DB is your searchable history of what went wrong. Given on my system I was seeing a ClassNotFoundException for loading FSConstants, there was no easy way to work out what went wrong, and its cost me a couple of days work. furthermore, in the OSS world, every person who can't get your app to work is either going to walk away unhappy (=lost customer, lost developer and risk they compete with you), or they are going to get on the email list and ask questions, questions which may get answered, but it will cost them time. Hence * happyaxis.jsp: axis' diagnostics page, prints out useful stuff and warns if it knows it is unwell (and returns 500 error code so your monitoring tools can recognise this) * ant -diagnostics: detailed look at your ant system including xml parser experiments. Good open source tools have to be easy for people to get started with, and that means helpful error messages. If we left the code alone, knowing that the cause of a ClassNotFoundException was the fault of the user sticking the wrong XML parser on the classpath -and yet refusing to add the four lines of code needed to handle this- then we are letting down the users On 2/13/09, Steve Loughran wrote: Anum Ali wrote: This only occurs in linux , in windows its fine. do a java -version for me, and an ant -diagnostics, stick both on the bugrep https://issues.apache.org/jira/browse/HADOOP-5254 It may be that XInclude only went live in java1.6u5; I'm running a JRockit JVM which predates that and I'm seeing it (linux again); I will also try sticking xerces on the classpath to see what happens next
Re: stable version
Anum Ali wrote: This only occurs in linux , in windows its fine. do a java -version for me, and an ant -diagnostics, stick both on the bugrep https://issues.apache.org/jira/browse/HADOOP-5254 It may be that XInclude only went live in java1.6u5; I'm running a JRockit JVM which predates that and I'm seeing it (linux again); I will also try sticking xerces on the classpath to see what happens next
Re: stable version
Anum Ali wrote: yes On Thu, Feb 12, 2009 at 4:33 PM, Steve Loughran wrote: Anum Ali wrote: Iam working on Hadoop SVN version 0.21.0-dev. Having some problems , regarding running its examples/file from eclipse. It gives error for Exception in thread "main" java.lang.UnsupportedOperationException: This parser does not support specification "null" version "null" at javax.xml.parsers.DocumentBuilderFactory.setXIncludeAware(DocumentBuilderFactory.java:590) Can anyone reslove or give some idea about it. You are using Java6, correct? I'm seeing this too, filed the bug https://issues.apache.org/jira/browse/HADOOP-5254 Any stack traces you can add on will help. Probable cause is https://issues.apache.org/jira/browse/HADOOP-4944
Re: Namenode not listening for remote connections to port 9000
Michael Lynch wrote: Hi, As far as I can tell I've followed the setup instructions for a hadoop cluster to the letter, but I find that the datanodes can't connect to the namenode on port 9000 because it is only listening for connections from localhost. In my case, the namenode is called centos1, and the datanode is called centos2. They are centos 5.1 servers with an unmodified sun java 6 runtime. fs.default.name takes a URL to the filesystem. such as hdfs://centos1:9000/ If the machine is only binding to localhost, that may mean DNS fun. Try a fully qualified name instead
Re: stable version
Anum Ali wrote: yes On Thu, Feb 12, 2009 at 4:33 PM, Steve Loughran wrote: Anum Ali wrote: Iam working on Hadoop SVN version 0.21.0-dev. Having some problems , regarding running its examples/file from eclipse. It gives error for Exception in thread "main" java.lang.UnsupportedOperationException: This parser does not support specification "null" version "null" at javax.xml.parsers.DocumentBuilderFactory.setXIncludeAware(DocumentBuilderFactory.java:590) Can anyone reslove or give some idea about it. You are using Java6, correct? well, in that case something being passed down to setXIncludeAware may be picked up as invalid. More of a stack trace may help. Otherwise, now is your chance to learn your way around the hadoop codebase, and ensure that when the next version ships, your most pressing bugs have been fixed
Re: Best practices on spliltting an input line?
Stefan Podkowinski wrote: I'm currently using OpenCSV which can be found at http://opencsv.sourceforge.net/ but haven't done any performance tests on it yet. In my case simply splitting strings would not work anyways, since I need to handle quotes and separators within quoted values, e.g. "a","a,b","c". I've used it in the past; found it pretty reliable. Again, no perf tests, just reading in CSV files exported from spreadsheets
Re: stable version
Anum Ali wrote: Iam working on Hadoop SVN version 0.21.0-dev. Having some problems , regarding running its examples/file from eclipse. It gives error for Exception in thread "main" java.lang.UnsupportedOperationException: This parser does not support specification "null" version "null" at javax.xml.parsers.DocumentBuilderFactory.setXIncludeAware(DocumentBuilderFactory.java:590) Can anyone reslove or give some idea about it. You are using Java6, correct?
Re: File Transfer Rates
Brian Bockelman wrote: Just to toss out some numbers (and because our users are making interesting numbers right now) Here's our external network router: http://mrtg.unl.edu/~cricket/?target=%2Frouter-interfaces%2Fborder2%2Ftengigabitethernet2_2;view=Octets Here's the application-level transfer graph: http://t2.unl.edu/phedex/graphs/quantity_rates?link=src&no_mss=true&to_node=Nebraska In a squeeze, we can move 20-50TB / day to/from other heterogenous sites. Usually, we run out of free space before we can find the upper limit for a 24-hour period. We use a protocol called GridFTP to move data back and forth between external (non-HDFS) clusters. The other sites we transfer with use niche software you probably haven't heard of (Castor, DPM, and dCache) because, well, it's niche software. I have no available data on HDFS<->S3 systems, but I'd again claim it's mostly a function of the amount of hardware you throw at it and the size of your network pipes. There are currently 182 datanodes; 180 are "traditional" ones of <3TB and 2 are big honking RAID arrays of 40TB. Transfers are load-balanced amongst ~ 7 GridFTP servers which each have 1Gbps connection. GridFTP is optimised for high bandwidth network connections with negotiated packet size and multiple TCP connections, so when nagel's algorithm triggers backoff from a dropped packet, only a fraction of the transmission gets dropped. It is probably best-in-class for long haul transfers over the big university backbones where someone else pays for your traffic. You would be very hard pressed to get even close to that on any other protocol. I have no data on S3 xfers other than hearsay * write time to S3 can be slow as it doesn't return until the data is persisted "somewhere". That's a better guarantee than a posix write operation. * you have to rely on other people on your rack not wanting all the traffic for themselves. That's an EC2 API issue: you don't get to request/buy bandwidth to/from S3 One thing to remember is that if you bring up a Hadoop cluster on any virtual server farm, disk IO is going to be way below physical IO rates. Even when the data is in HDFS, it will be slower to get at than dedicated high-RPM SCSI or SATA storage.
Re: anybody knows an apache-license-compatible impl of Integer.parseInt?
Zheng Shao wrote: We need to implement a version of Integer.parseInt/atoi from byte[] instead of String to avoid the high cost of creating a String object. I wanted to take the open jdk code but the license is GPL: http://www.docjar.com/html/api/java/lang/Integer.java.html Does anybody know an implementation that I can use for hive (apache license)? I also need to do it for Byte, Short, Long, and Double. Just don't want to go over all the corner cases. Use the Apache Harmony code http://svn.apache.org/viewvc/harmony/enhanced/classlib/branches/java6/modules/
Re: Backing up HDFS?
Allen Wittenauer wrote: On 2/9/09 4:41 PM, "Amandeep Khurana" wrote: Why would you want to have another backup beyond HDFS? HDFS itself replicates your data so if the reliability of the system shouldnt be a concern (if at all it is)... I'm reminded of a previous job where a site administrator refused to make tape backups (despite our continual harassment and pointing out that he was in violation of the contract) because he said RAID was "good enough". Then the RAID controller failed. When we couldn't recover data "from the other mirror" he was fired. Not sure how they ever recovered, esp. considering what the data was they lost. Hopefully they had a paper trail. hope that wasnt at SUNW, not given they do their own controllers 1. controller failure is lethal, especially if you don't notice for a while 2. some products -say, databases- didnt like live updates, so a trick evolved of taking off some of the RAID array and putting that to tape. Of course, then there's the problem of what happens there 3. Tape is still very power efficient; good for a bulk off-site store (or a local fire-safe) 4. Over at last.fm, they had an accident rm / on their primary dataset. Fortunately they did apparently have another copy somewhere else. and now that hfds has user ids, you can prevent anyone but the admin team from accidentally deleting everyones data.