Re: grahical tool for hadoop mapreduce
Some people at sun have done some recent work on this -- see a blog post at http://blogs.sun.com/jgebis/entry/hadoop_resource_utilization_and_performance, and a subsequent post with more detail at http://blogs.sun.com/jgebis/entry/hadoop_resource_utilization_monitoring_scripts . Kevin On Thu, Jun 25, 2009 at 7:28 PM, Manhee Jo j...@nttdocomo.com wrote: Hi, Do you know any graphical tools to show the progress of mapreduce using the job log under logs/history/ ? The web interface (namenode:50030) gives me similar one. But what I need is more specific ones that show the number of total running map tasks and reduce tasks at some points of time, which I've seen from some papers. Any help would be appreciated. Thanks, Manhee
Re: Pregel
On Jun 25, 2009, at 9:42 PM, Mark Kerzner wrote: my guess, as good as anybody's, is that Pregel is to large graphs is what Hadoop is to large datasets. I think it is much more likely a language that allows you to easily define fixed point algorithms. I would imagine a distributed version of something similar to Michal Young's GenSet. http://portal.acm.org/citation.cfm?doid=586094.586108 I've been trying to figure out how to justify working on a project like that for a couple of years, but haven't yet. (I have a background in program static analysis, so I've implemented similar stuff.) In other words, Pregel is the next natural step for massively scalable computations after Hadoop. I wonder if it uses map/reduce as a base or not. It would be easier to use map/reduce, but a direct implementation would be more performant. In either case, it is a new hammer. From what I see, it likely won't replace map/reduce, pig, or hive; but rather support a different class of applications much more directly than you can under map/reduce. -- Owen
Re: What is the best way to use the Hadoop output data
Anybody help me on this ? :) On Thu, Jun 25, 2009 at 5:02 PM, Huy Phan dac...@gmail.com wrote: Hi everybody, I'm working on a hadoop project that processing the log files. In the reduce part, as usual, I store the output to HDFS, but I also want send those output data to the message queue using HTTP Post Request. I'm wondering if there's any performance killer in this approach, I posted the question to IRC channel and someone told me that there may be a bottleneck. Then I think about running a cron task to get the output data and send it to MQ, but not sure it's the best way cause it's not synchronize with the MapReduce process. I wonder if there is any way to spawn a process directly from Hadoop after all the MapReduce tasks finish ?
Re: What is the best way to use the Hadoop output data
Hi Huy, On Thu, Jun 25, 2009 at 6:02 PM, Huy Phandac...@gmail.com wrote: I'm wondering if there's any performance killer in this approach, I posted the question to IRC channel and someone told me that there may be a bottleneck. There may be some communication errors to block your MapReduce job when you post your output data. So I think it's better to do this after the job is done. I wonder if there is any way to spawn a process directly from Hadoop after all the MapReduce tasks finish ? How do you submit your jobs? You can block the job submit process by running job using job.waitForCompletion(true) in your main driver class. Then the two processes are synchronous. -- Zhong Wang
Performance hit by not splitting .bz2?
Hi! I have a case where we need to analyse logfiles. They are currently compressed using bzip2, and an example logfile is roughly 105Mb compressed, 720Mb uncompressed. I'm considering using a Hadoop version with .bz2 support - probably Cloudera's 18.3 dist, but if I understand correctly, .bz2 files are not split. I expect that for most jobs, the number of log files will exceed the number of cores in my hadoop cluster. Is it possible to estimate if I'll get a performance hit because of the lack of splitting under these circumstances? Thanks, \EF -- Erik Forsberg forsb...@opera.com Developer, Opera Mini - http://www.opera.com/mini/
Re: Performance hit by not splitting .bz2?
Hi Erik, On Fri, Jun 26, 2009 at 4:24 PM, Erik Forsbergforsb...@opera.com wrote: I'm considering using a Hadoop version with .bz2 support - probably Cloudera's 18.3 dist, but if I understand correctly, .bz2 files are not split. Yes. The bzip2 compressed files are not splittable in current versions, maybe it will be introduced in next version. You may be interested in this patch https://issues.apache.org/jira/browse/HADOOP-4012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel. I expect that for most jobs, the number of log files will exceed the number of cores in my hadoop cluster. Is it possible to estimate if I'll get a performance hit because of the lack of splitting under these circumstances? The bzip2 files are not split that means your block size of HDFS is 720M. Even though the number of your log files may exceed the number of cores in your cluster, large blocks will decrease load balancing. -- Zhong Wang
Re: Pregel
According to my understanding, I think the Pregel is in same layer with MR, not a MR based language processor. I think the 'Collective Communication' of BSP seems the core of the problem. For example, this BFS problem (http://blog.udanax.org/2009/02/breadth-first-search-mapreduce.html) can be solved at once w/o MR iterations. On Fri, Jun 26, 2009 at 3:17 PM, Owen O'Malleyomal...@apache.org wrote: On Jun 25, 2009, at 9:42 PM, Mark Kerzner wrote: my guess, as good as anybody's, is that Pregel is to large graphs is what Hadoop is to large datasets. I think it is much more likely a language that allows you to easily define fixed point algorithms. I would imagine a distributed version of something similar to Michal Young's GenSet. http://portal.acm.org/citation.cfm?doid=586094.586108 I've been trying to figure out how to justify working on a project like that for a couple of years, but haven't yet. (I have a background in program static analysis, so I've implemented similar stuff.) In other words, Pregel is the next natural step for massively scalable computations after Hadoop. I wonder if it uses map/reduce as a base or not. It would be easier to use map/reduce, but a direct implementation would be more performant. In either case, it is a new hammer. From what I see, it likely won't replace map/reduce, pig, or hive; but rather support a different class of applications much more directly than you can under map/reduce. -- Owen -- Best Regards, Edward J. Yoon @ NHN, corp. edwardy...@apache.org http://blog.udanax.org
Re: Doing MapReduce over Har files
I also need help with this. I need to know how to handle a HAR file when it is the input to a MapReduce task. How do we read the HAR file so we can work on the individual logical files? I suppose we need to create our own InputFormat and RecordReader files, but I´m not sure how to proceed. Julian Roshan James-3 wrote: When I run map reduce task over a har file as the input, I see that the input splits refer to 64mb byte boundaries inside the part file. My mappers only know how to process the contents of each logical file inside the har file. Is there some way by which I can take the offset range specified by the input split and determine which logical files lie in that offset range? (How else would one do map reduce over a har file?) Roshan -- View this message in context: http://www.nabble.com/Doing-MapReduce-over-Har-files-tp24171216p24217500.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Permissions needed to run RandomWriter ?
Hi, I've just installed a new test cluster and I'm trying to give it a quick smoke test with RandomWriter and Sort. I can run these fine with the superuser account. When I try to run them as another user I run into problems even though I've created the output directory and given permissions to the other user to write to this directory. i.e. 1. smulc...@hadoop01:~$ hadoop fs -mkdir /foo mkdir: org.apache.hadoop.fs.permission.AccessControlException: Permission denied: user=smulcahy, access=WRITE, inode=:hadoop:supergroup:rwxr-xr-x OK - we don't have permissions anyways 2. had...@hadoop01:/$ hadoop fs -mkdir /foo OK 3. hadoop fs -chown -R smulcahy /foo OK 4. smulc...@hadoop01:~$ hadoop fs -mkdir /foo/test OK 5. smulc...@hadoop01:~$ hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar randomwriter /foo java.io.IOException: Permission denied at java.io.UnixFileSystem.createFileExclusively(Native Method) at java.io.File.checkAndCreate(File.java:1704) at java.io.File.createTempFile(File.java:1793) at org.apache.hadoop.util.RunJar.main(RunJar.java:115) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) Any suggestions on why step 5. is failing even though I have write permissions to /foo - do I need permissions on some other directory also or ... ? Thanks, -stephen -- Stephen Mulcahy, DI2, Digital Enterprise Research Institute, NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com
Re: PIG and Hadoop
Hi Krishna, pig-u...@hadoop.apache.org is the right mailing list for pig questions. I assume your jar file has an embedded pig script. If your jar file includes pig.jar you don't need to specify pig.jar . Assuming that the class called YourClass has the main function. The cmd would look like - java -cp YourJar.jar:$HADOOPSITEPATH YourClass See the example in - http://hadoop.apache.org/pig/docs/r0.2.0/quickstart.html -Thejas On 6/25/09 10:13 PM, krishna prasanna svk_prasa...@yahoo.com wrote: Hi, Here is my scenario 1. Having Cluster of 3 machines, 2. Have a Jar file with includes PIG.jar. How can i run a Jar (instead of PIG Script file) in Hadoop mode ? for running script file in hadoop mode, java -cp $PIGDIR/pig.jar:$HADOOPSITEPATH org.apache.pig.Main script1-hadoop.pig any suggesions/pointers please ?? appologies if i post to a wrong alias. Thanks, Krishna. ICC World Twenty20 England #39;09 exclusively on YAHOO! CRICKET http://cricket.yahoo.com
Re: About fuse-dfs and NFS
Hey Chris, FUSE in general does not support NFS mounts well because it has a tendency to renumber inodes upon NFS restart, which causes clients to choke. FUSE-DFS supports a limited range of write operations; it's possible that your application is trying to use write functionality that is not supported. Brian On Jun 26, 2009, at 2:57 AM, XuChris wrote: Hi, I mount hdfs to a directory of localhost by fuse-dfs, and then export the directory. When access the directory by NFS, I can read data of the directory, but cannot write data to the directory. Why? Now I want to know does fuse-dfs support data-writing operation by NFS or not? Who can help me? Thank you very much. My system configure: OS: Fedora release 8(kernel 2.6.23.1) For NFS, the version of fuse module has updated to 2.7.4 Fuse:2.7.4 Hadoop:0.19.1 Best regards. Chris 2009-6-26 _ 打工,挣钱,买房子,快来MClub一起”金屋藏娇”! http://club.msn.cn/?from=10
Re: grahical tool for hadoop mapreduce
Although it may not support your specific need for log files, I just happened to run across this link today and thought it was relevant for a thread about GUI tools for Hadoop: http://www.hadoopstudio.org/ It's a plugin for working visually with Hadoop in NetBeans. The page describes it as an alpha release, and while I haven't tried it out yet, the screenshot at least looks very promising. On Thu, Jun 25, 2009 at 9:28 PM, Manhee Joj...@nttdocomo.com wrote: Do you know any graphical tools to show the progress of mapreduce using the job log under logs/history/ ? The web interface (namenode:50030) gives me similar one. But what I need is more specific ones that show the number of total running map tasks and reduce tasks at some points of time, which I've seen from some papers. Any help would be appreciated. -- Tom Wheeler http://www.tomwheeler.com/
Error while trying to run map/reduce job
Hi All, On one of the test clusters when i try to launch map/reduce job it fails with the following error. / I am getting the following error in my jobtracker.log on the namenode:/ 2009-06-26 15:20:12,811 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'attempt_200906261401_0005_m_01_0' to tip task_200906261401_0005_m_01, for tracker 'tracker_datanode1:localhost/127.0.0.1:33748' 2009-06-26 15:20:14,016 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_200906261401_0005_m_01_0: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462) at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403) /My tasktracker log on datanode1 is reporting the following for the attempt noted above: /2009-06-26 15:20:13,449 INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction: attempt_200906261401_0005_m_01_0 2009-06-26 15:20:13,700 WARN org.apache.hadoop.mapred.TaskRunner: attempt_200906261401_0005_m_01_0 Child Error java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462) at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403) 2009-06-26 15:20:14,656 INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction: attempt_200906261401_0005_m_02_0 2009-06-26 15:20:14,811 WARN org.apache.hadoop.mapred.TaskRunner: attempt_200906261401_0005_m_02_0 Child Error java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462) at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403) Seems to be some problem with the Job not being to start on the datanode(s). I ran hadoop fsck and the system is healthy. Checked the namenode.log and no errors are being reported either. These errors happen when i submit a job to the cluster. Any clues or comments please? Thanks, Usman / /
Re: grahical tool for hadoop mapreduce
Tom, this is so much right on time! Bravo, Karmasphere. I installed the plugins, and nothing crashed - in fact, I get the same screens as the manual promises. It is worth reading this group - they released the plugin two days ago. Mark On Fri, Jun 26, 2009 at 10:13 AM, Tom Wheeler tomwh...@gmail.com wrote: Although it may not support your specific need for log files, I just happened to run across this link today and thought it was relevant for a thread about GUI tools for Hadoop: http://www.hadoopstudio.org/ It's a plugin for working visually with Hadoop in NetBeans. The page describes it as an alpha release, and while I haven't tried it out yet, the screenshot at least looks very promising. On Thu, Jun 25, 2009 at 9:28 PM, Manhee Joj...@nttdocomo.com wrote: Do you know any graphical tools to show the progress of mapreduce using the job log under logs/history/ ? The web interface (namenode:50030) gives me similar one. But what I need is more specific ones that show the number of total running map tasks and reduce tasks at some points of time, which I've seen from some papers. Any help would be appreciated. -- Tom Wheeler http://www.tomwheeler.com/
RE: hwo to read a text file in Map function until reaching specific line
I think map function gets the line number as key. You can ignore te other lines after the key value 500. Thanks -Original Message- From: Leiz [mailto:lzhan...@gmail.com] Sent: Friday, June 26, 2009 8:57 AM To: core-user@hadoop.apache.org Subject: hwo to read a text file in Map function until reaching specific line For example , I have a text file with 1000 lines. I only want to read the first 500 line of the file. How can I do in Map function? Thanks -- View this message in context: http://www.nabble.com/hwo-to-read-a-text-file-in-Map-function-until-reaching -specific-line-tp24222783p24222783.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Doing MapReduce over Har files
Hi Roshan and Julian, The har file system can be used as a input filesystem. You can just provide the input to map reduce as har:///something/some.har , where some.har is your har archive. This way amp reduce will use har filesystem as an input. The only problem being that maps cannot run across logical files in har. You can specify whatever input format these files have/had before you included them into har archives. The point being that har:/// can be used as a input filesystem for map reduce, which will give map reduce a view of logical files inside of har. Hope this helps. mahadev On 6/26/09 2:37 AM, jchernandez jchernan...@agnitio.es wrote: I also need help with this. I need to know how to handle a HAR file when it is the input to a MapReduce task. How do we read the HAR file so we can work on the individual logical files? I suppose we need to create our own InputFormat and RecordReader files, but I´m not sure how to proceed. Julian Roshan James-3 wrote: When I run map reduce task over a har file as the input, I see that the input splits refer to 64mb byte boundaries inside the part file. My mappers only know how to process the contents of each logical file inside the har file. Is there some way by which I can take the offset range specified by the input split and determine which logical files lie in that offset range? (How else would one do map reduce over a har file?) Roshan
Re: hwo to read a text file in Map function until reaching specific line
The TextInputFormat gives byte offset in the file as key and the entire line as value. so it won't work for you. You can modify NLineInputFormat to achieve what you want. NLineInputformat gives each mapper N Lines (in your case N=500) Since you are interested in only first 500 lines of each file, the record reader for NLineInputFormat will be implemented as- get the input split check the start pos if start pos ==0 you will read the first 500 lines else you have got a file split that is in middle of the file, don't bother to read anything as the mapper that is reading from the beginning of the file is reading first 500 lines. Just indicate no more input. -Tarandeep On Fri, Jun 26, 2009 at 10:35 AM, Ramakishore Yelamanchilli kyela...@cisco.com wrote: I think map function gets the line number as key. You can ignore te other lines after the key value 500. Thanks -Original Message- From: Leiz [mailto:lzhan...@gmail.com] Sent: Friday, June 26, 2009 8:57 AM To: core-user@hadoop.apache.org Subject: hwo to read a text file in Map function until reaching specific line For example , I have a text file with 1000 lines. I only want to read the first 500 line of the file. How can I do in Map function? Thanks -- View this message in context: http://www.nabble.com/hwo-to-read-a-text-file-in-Map-function-until-reaching -specific-line-tp24222783p24222783.htmlhttp://www.nabble.com/hwo-to-read-a-text-file-in-Map-function-until-reaching%0A-specific-line-tp24222783p24222783.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Using addCacheArchive
Hi, I've found it much easier to write the file to HDFS use the API, then pass the 'path' to the file in HDFS as a property. You'll need to remember to clean up the file after you're done with it. Example details are in this thread: http://groups.google.com/group/cascading-user/browse_thread/thread/d5c619349562a8d6# Hope this helps, Chris On Thu, Jun 25, 2009 at 4:50 PM, akhil1988 akhilan...@gmail.com wrote: Please ask any questions if I am not clear above about the problem I am facing. Thanks, Akhil akhil1988 wrote: Hi All! I want a directory to be present in the local working directory of the task for which I am using the following statements: DistributedCache.addCacheArchive(new URI(/home/akhil1988/Config.zip), conf); DistributedCache.createSymlink(conf); Here Config is a directory which I have zipped and put at the given location in HDFS I have zipped the directory because the API doc of DistributedCache (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that the archive files are unzipped in the local cache directory : DistributedCache can be used to distribute simple, read-only data/text files and/or more complex types such as archives, jars etc. Archives (zip, tar and tgz/tar.gz files) are un-archived at the slave nodes. So, from my understanding of the API docs I expect that the Config.zip file will be unzipped to Config directory and since I have SymLinked them I can access the directory in the following manner from my map function: FileInputStream fin = new FileInputStream(Config/file1.config); But I get the FileNotFoundException on the execution of this statement. Please let me know where I am going wrong. Thanks, Akhil -- View this message in context: http://www.nabble.com/Using-addCacheArchive-tp24207739p24210836.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Error in Cluster Startup: NameNode is not formatted
Hi all, I am a student and I am trying to install the Hadoop on a cluster, I have one machine running namenode, one running jobtracker, two slaves. When I run the /bin/start-dfs.sh , there is something wrong with my namenode, it won't start. Here is the error message in the log file: ERROR org.apache.hadoop.fs.FSNamesystem: FSNamesystem initialization failed. java.io.IOException: NameNode is not formatted. at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:243) at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80) at org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:294) at org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:273) at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:148) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:193) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:179) at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:830) at org.apache.hadoop.dfs.NameNode.main(NameNode.java:839) I think it is something stupid i did, could somebody help me out? Thanks a lot! Sincerely, Boyu Zhang
RE: Permissions needed to run RandomWriter ?
[Apologies for the top-post, sending this from a dodgy webmail client] Hi Alex, My hadoop-site.xml is as follows, ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namemapred.job.tracker/name valuehadoop01:9001/value /property property namefs.default.name/name valuehdfs://hadoop01:9000/value /property property namehadoop.tmp.dir/name value/data1/hadoop-tmp//value /property property namedfs.data.dir/name value/data1/hdfs,/data2/hdfs/value /property /configuration Any comments welcome, -stephen -Original Message- From: Alex Loddengaard [mailto:a...@cloudera.com] Sent: Fri 26/06/2009 18:32 To: core-user@hadoop.apache.org Subject: Re: Permissions needed to run RandomWriter ? Hey Stephen, What does your hadoop-site.xml look like? The Exception is in java.io.UnixFileSystem, which makes me think that you're actually creating and modifying directories on your local file system instead of HDFS. Make sure fs.default.name looks like hdfs://your-namenode.domain.com:PORT. Alex On Fri, Jun 26, 2009 at 4:40 AM, stephen mulcahy stephen.mulc...@deri.orgwrote: Hi, I've just installed a new test cluster and I'm trying to give it a quick smoke test with RandomWriter and Sort. I can run these fine with the superuser account. When I try to run them as another user I run into problems even though I've created the output directory and given permissions to the other user to write to this directory. i.e. 1. smulc...@hadoop01:~$ hadoop fs -mkdir /foo mkdir: org.apache.hadoop.fs.permission.AccessControlException: Permission denied: user=smulcahy, access=WRITE, inode=:hadoop:supergroup:rwxr-xr-x OK - we don't have permissions anyways 2. had...@hadoop01:/$ hadoop fs -mkdir /foo OK 3. hadoop fs -chown -R smulcahy /foo OK 4. smulc...@hadoop01:~$ hadoop fs -mkdir /foo/test OK 5. smulc...@hadoop01:~$ hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar randomwriter /foo java.io.IOException: Permission denied at java.io.UnixFileSystem.createFileExclusively(Native Method) at java.io.File.checkAndCreate(File.java:1704) at java.io.File.createTempFile(File.java:1793) at org.apache.hadoop.util.RunJar.main(RunJar.java:115) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) Any suggestions on why step 5. is failing even though I have write permissions to /foo - do I need permissions on some other directory also or ... ? Thanks, -stephen -- Stephen Mulcahy, DI2, Digital Enterprise Research Institute, NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com
Re: Error in Cluster Startup: NameNode is not formatted
Boyu- You didn't do anything stupid. I've forgotten to format a NameNode too myself. If you check the QuickStart guide at http://hadoop.apache.org/core/docs/current/quickstart.html you'll see that formatting the NameNode is the first of the Execution section (near the bottom of the page). The command to format the NameNode is: hadoop namenode -format A warning though, you should only format your NameNode once. Just like formatting any filesystem, you can loss data if you (re)format. Good luck. -Matt On Jun 26, 2009, at 1:25 PM, Boyu Zhang wrote: Hi all, I am a student and I am trying to install the Hadoop on a cluster, I have one machine running namenode, one running jobtracker, two slaves. When I run the /bin/start-dfs.sh , there is something wrong with my namenode, it won't start. Here is the error message in the log file: ERROR org.apache.hadoop.fs.FSNamesystem: FSNamesystem initialization failed. java.io.IOException: NameNode is not formatted. at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:243) at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80) at org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:294) at org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:273) at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:148) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:193) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:179) at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:830) at org.apache.hadoop.dfs.NameNode.main(NameNode.java:839) I think it is something stupid i did, could somebody help me out? Thanks a lot! Sincerely, Boyu Zhang
Re: Pregel
Hello, I don't have a background in CS, but does MS's Dryad ( http://research.microsoft.com/en-us/projects/Dryad/ ) fit in anywhere here? Regards Saptarshi On Fri, Jun 26, 2009 at 5:19 AM, Edward J. Yoonedwardy...@apache.org wrote: According to my understanding, I think the Pregel is in same layer with MR, not a MR based language processor. I think the 'Collective Communication' of BSP seems the core of the problem. For example, this BFS problem (http://blog.udanax.org/2009/02/breadth-first-search-mapreduce.html) can be solved at once w/o MR iterations. On Fri, Jun 26, 2009 at 3:17 PM, Owen O'Malleyomal...@apache.org wrote: On Jun 25, 2009, at 9:42 PM, Mark Kerzner wrote: my guess, as good as anybody's, is that Pregel is to large graphs is what Hadoop is to large datasets. I think it is much more likely a language that allows you to easily define fixed point algorithms. I would imagine a distributed version of something similar to Michal Young's GenSet. http://portal.acm.org/citation.cfm?doid=586094.586108 I've been trying to figure out how to justify working on a project like that for a couple of years, but haven't yet. (I have a background in program static analysis, so I've implemented similar stuff.) In other words, Pregel is the next natural step for massively scalable computations after Hadoop. I wonder if it uses map/reduce as a base or not. It would be easier to use map/reduce, but a direct implementation would be more performant. In either case, it is a new hammer. From what I see, it likely won't replace map/reduce, pig, or hive; but rather support a different class of applications much more directly than you can under map/reduce. -- Owen -- Best Regards, Edward J. Yoon @ NHN, corp. edwardy...@apache.org http://blog.udanax.org
RE: Error in Cluster Startup: NameNode is not formatted
Matt, Thanks a lot for your reply! I did formatted the namenode. But I got the same error again. And actually I successfully run the example jar file once, but after that one time, I couldn't get it run again. I clean the /tmp dir every time before I format namenode again(I am just testing it, so I don't worry about losing data:). Still, I got the same error when I execute the bin/start-dfs.sh . I checked my conf, and I can't figure out why. Here is my conf file: I really appreciate if you could take a look at it. Thanks a lot. configuration property namefs.default.name/name valuehdfs://hostname1:9000/value /property property namemapred.job.tracker/name valuehostname2:9001/value /property property namedfs.data.dir/name value/data/zhang/hadoop/dfs/data/value descriptionDetermines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored. /description /property property namemapred.local.dir/name value/data/zhang/hadoop/mapred/local/value descriptionThe local directory where MapReduce stores intermediate data files. May be a comma-separated list of directories on different devices in order to spread disk i/o. Directories that do not exist are ignored. /description /property /configuration -Original Message- From: Matt Massie [mailto:m...@cloudera.com] Sent: Friday, June 26, 2009 4:31 PM To: core-user@hadoop.apache.org Subject: Re: Error in Cluster Startup: NameNode is not formatted Boyu- You didn't do anything stupid. I've forgotten to format a NameNode too myself. If you check the QuickStart guide at http://hadoop.apache.org/core/docs/current/quickstart.html you'll see that formatting the NameNode is the first of the Execution section (near the bottom of the page). The command to format the NameNode is: hadoop namenode -format A warning though, you should only format your NameNode once. Just like formatting any filesystem, you can loss data if you (re)format. Good luck. -Matt On Jun 26, 2009, at 1:25 PM, Boyu Zhang wrote: Hi all, I am a student and I am trying to install the Hadoop on a cluster, I have one machine running namenode, one running jobtracker, two slaves. When I run the /bin/start-dfs.sh , there is something wrong with my namenode, it won't start. Here is the error message in the log file: ERROR org.apache.hadoop.fs.FSNamesystem: FSNamesystem initialization failed. java.io.IOException: NameNode is not formatted. at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:243) at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80) at org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:294) at org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:273) at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:148) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:193) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:179) at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:830) at org.apache.hadoop.dfs.NameNode.main(NameNode.java:839) I think it is something stupid i did, could somebody help me out? Thanks a lot! Sincerely, Boyu Zhang
Re: Error in Cluster Startup: NameNode is not formatted
Sometimes the metadata gets corrupted. Its happened with me on multiple occasions during the initial stages of setting up the cluster. What I did was simply delete the entire directory where the metadata and the actual data is being stored by hdfs. Since I was playing around with the systems and didnt care much about the data, I could do so. If it doesnt spoil anything for you, go ahead and try it. It might work. Secondly, you've specified the dfs.data.dir parameter but havent specified the metadata directory. AFAIK, it will take /tmp as the default. Since /tmp gets cleaned up, you'll lose your metadata and that could be causing the system to not come up. Specify that parameter in the config file. Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Fri, Jun 26, 2009 at 2:33 PM, Boyu Zhang boyuzhan...@gmail.com wrote: Matt, Thanks a lot for your reply! I did formatted the namenode. But I got the same error again. And actually I successfully run the example jar file once, but after that one time, I couldn't get it run again. I clean the /tmp dir every time before I format namenode again(I am just testing it, so I don't worry about losing data:). Still, I got the same error when I execute the bin/start-dfs.sh . I checked my conf, and I can't figure out why. Here is my conf file: I really appreciate if you could take a look at it. Thanks a lot. configuration property namefs.default.name/name valuehdfs://hostname1:9000/value /property property namemapred.job.tracker/name valuehostname2:9001/value /property property namedfs.data.dir/name value/data/zhang/hadoop/dfs/data/value descriptionDetermines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored. /description /property property namemapred.local.dir/name value/data/zhang/hadoop/mapred/local/value descriptionThe local directory where MapReduce stores intermediate data files. May be a comma-separated list of directories on different devices in order to spread disk i/o. Directories that do not exist are ignored. /description /property /configuration -Original Message- From: Matt Massie [mailto:m...@cloudera.com] Sent: Friday, June 26, 2009 4:31 PM To: core-user@hadoop.apache.org Subject: Re: Error in Cluster Startup: NameNode is not formatted Boyu- You didn't do anything stupid. I've forgotten to format a NameNode too myself. If you check the QuickStart guide at http://hadoop.apache.org/core/docs/current/quickstart.html you'll see that formatting the NameNode is the first of the Execution section (near the bottom of the page). The command to format the NameNode is: hadoop namenode -format A warning though, you should only format your NameNode once. Just like formatting any filesystem, you can loss data if you (re)format. Good luck. -Matt On Jun 26, 2009, at 1:25 PM, Boyu Zhang wrote: Hi all, I am a student and I am trying to install the Hadoop on a cluster, I have one machine running namenode, one running jobtracker, two slaves. When I run the /bin/start-dfs.sh , there is something wrong with my namenode, it won't start. Here is the error message in the log file: ERROR org.apache.hadoop.fs.FSNamesystem: FSNamesystem initialization failed. java.io.IOException: NameNode is not formatted. at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:243) at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80) at org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:294) at org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:273) at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:148) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:193) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:179) at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:830) at org.apache.hadoop.dfs.NameNode.main(NameNode.java:839) I think it is something stupid i did, could somebody help me out? Thanks a lot! Sincerely, Boyu Zhang
Scaling out/up or a mix
Hi. We have a deployment of 10 hadoop servers and I now need more mapping capability (no not just add more mappers per instance) since I have so many jobs running. Now I am wondering what I should aim on... Memory, cpu or disk... How long is a rope perhaps you would say ? A typical server is currently using about 15-20% cpu today on a quad-core 2.4Ghz 8GB RAM machine with 2 RAID1 SATA 500GB disks. Some specs below. mpstat 2 5 Linux 2.6.24-19-server (mapreduce2) 06/26/2009 11:36:13 PM CPU %user %nice%sys %iowait%irq %soft %steal %idleintr/s 11:36:15 PM all 22.820.003.241.370.622.490.00 69.45 8572.50 11:36:17 PM all 13.560.001.741.990.622.610.00 79.48 8075.50 11:36:19 PM all 14.320.002.241.121.122.240.00 78.95 9219.00 11:36:21 PM all 14.710.000.871.620.251.750.00 80.80 8489.50 11:36:23 PM all 12.690.000.871.240.500.750.00 83.96 5495.00 Average: all 15.620.001.791.470.621.970.00 78.53 7970.30 What I am thinking is... Is it wiser to go for many of these cheap boxes with 8GB of RAM or should I for instance focus on machines which can give more I|O throughput ? I know that these things are hard but perhaps someone have draw some conclusions before the pragmatic way. Kindly //Marcus -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/
Re: Error in Cluster Startup: NameNode is not formatted
The property dfs.name.dir allows you to control where Hadoop writes NameNode metadata. You should have a property like property namedfs.name.dir/name value/data/zhang/hadoop/name/data/value /property to make sure the NameNode data isn't being deleted when you delete the files in /tmp. -Matt On Jun 26, 2009, at 2:33 PM, Boyu Zhang wrote: Matt, Thanks a lot for your reply! I did formatted the namenode. But I got the same error again. And actually I successfully run the example jar file once, but after that one time, I couldn't get it run again. I clean the / tmp dir every time before I format namenode again(I am just testing it, so I don't worry about losing data:). Still, I got the same error when I execute the bin/start-dfs.sh . I checked my conf, and I can't figure out why. Here is my conf file: I really appreciate if you could take a look at it. Thanks a lot. configuration property namefs.default.name/name valuehdfs://hostname1:9000/value /property property namemapred.job.tracker/name valuehostname2:9001/value /property property namedfs.data.dir/name value/data/zhang/hadoop/dfs/data/value descriptionDetermines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored. /description /property property namemapred.local.dir/name value/data/zhang/hadoop/mapred/local/value descriptionThe local directory where MapReduce stores intermediate data files. May be a comma-separated list of directories on different devices in order to spread disk i/o. Directories that do not exist are ignored. /description /property /configuration -Original Message- From: Matt Massie [mailto:m...@cloudera.com] Sent: Friday, June 26, 2009 4:31 PM To: core-user@hadoop.apache.org Subject: Re: Error in Cluster Startup: NameNode is not formatted Boyu- You didn't do anything stupid. I've forgotten to format a NameNode too myself. If you check the QuickStart guide at http://hadoop.apache.org/core/docs/current/quickstart.html you'll see that formatting the NameNode is the first of the Execution section (near the bottom of the page). The command to format the NameNode is: hadoop namenode -format A warning though, you should only format your NameNode once. Just like formatting any filesystem, you can loss data if you (re)format. Good luck. -Matt On Jun 26, 2009, at 1:25 PM, Boyu Zhang wrote: Hi all, I am a student and I am trying to install the Hadoop on a cluster, I have one machine running namenode, one running jobtracker, two slaves. When I run the /bin/start-dfs.sh , there is something wrong with my namenode, it won't start. Here is the error message in the log file: ERROR org.apache.hadoop.fs.FSNamesystem: FSNamesystem initialization failed. java.io.IOException: NameNode is not formatted. at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:243) at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80) at org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:294) at org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:273) at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:148) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:193) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:179) at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:830) at org.apache.hadoop.dfs.NameNode.main(NameNode.java:839) I think it is something stupid i did, could somebody help me out? Thanks a lot! Sincerely, Boyu Zhang
difference between 'hadoop.tmp.dir' 'mapred.temp.dir'
Hi, Can some body kindly explain the difference b/w 'hadoop.tmp.dir' 'mapred.temp.dir'. I am trying to figure out where does the intermediate temporary files are stored for a mapreduce job. Thanks, --umer _ Invite your mail contacts to join your friends list with Windows Live Spaces. It's easy! http://spaces.live.com/spacesapi.aspx?wx_action=createwx_url=/friends.aspxmkt=en-us
Re: Permissions needed to run RandomWriter ?
Have you tried to run the example job as the superuser? It seems like this might be an issue where hadoop.tmp.dir doesn't have the correctly permissions. hadoop.tmp.dir and dfs.data.dir should be owned by the unix user running your Hadoop daemons and owner-writtable and readable. Can you confirm this is the case? Thanks, Alex On Fri, Jun 26, 2009 at 1:29 PM, Mulcahy, Stephen stephen.mulc...@deri.orgwrote: [Apologies for the top-post, sending this from a dodgy webmail client] Hi Alex, My hadoop-site.xml is as follows, ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namemapred.job.tracker/name valuehadoop01:9001/value /property property namefs.default.name/name valuehdfs://hadoop01:9000/value /property property namehadoop.tmp.dir/name value/data1/hadoop-tmp//value /property property namedfs.data.dir/name value/data1/hdfs,/data2/hdfs/value /property /configuration Any comments welcome, -stephen -Original Message- From: Alex Loddengaard [mailto:a...@cloudera.com] Sent: Fri 26/06/2009 18:32 To: core-user@hadoop.apache.org Subject: Re: Permissions needed to run RandomWriter ? Hey Stephen, What does your hadoop-site.xml look like? The Exception is in java.io.UnixFileSystem, which makes me think that you're actually creating and modifying directories on your local file system instead of HDFS. Make sure fs.default.name looks like hdfs://your-namenode.domain.com:PORT. Alex On Fri, Jun 26, 2009 at 4:40 AM, stephen mulcahy stephen.mulc...@deri.orgwrote: Hi, I've just installed a new test cluster and I'm trying to give it a quick smoke test with RandomWriter and Sort. I can run these fine with the superuser account. When I try to run them as another user I run into problems even though I've created the output directory and given permissions to the other user to write to this directory. i.e. 1. smulc...@hadoop01:~$ hadoop fs -mkdir /foo mkdir: org.apache.hadoop.fs.permission.AccessControlException: Permission denied: user=smulcahy, access=WRITE, inode=:hadoop:supergroup:rwxr-xr-x OK - we don't have permissions anyways 2. had...@hadoop01:/$ hadoop fs -mkdir /foo OK 3. hadoop fs -chown -R smulcahy /foo OK 4. smulc...@hadoop01:~$ hadoop fs -mkdir /foo/test OK 5. smulc...@hadoop01:~$ hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar randomwriter /foo java.io.IOException: Permission denied at java.io.UnixFileSystem.createFileExclusively(Native Method) at java.io.File.checkAndCreate(File.java:1704) at java.io.File.createTempFile(File.java:1793) at org.apache.hadoop.util.RunJar.main(RunJar.java:115) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) Any suggestions on why step 5. is failing even though I have write permissions to /foo - do I need permissions on some other directory also or ... ? Thanks, -stephen -- Stephen Mulcahy, DI2, Digital Enterprise Research Institute, NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com
FileStatus.getLen(): bug in documentation or bug in implememtation?
Hi all I try get length of file hadoop(RawFilesysten or hdfs) . In javadoc method org.apache.hadoop.fs.FileStatus.getLen() writtend that this method return the length of this file, in blocks But method return size in bytes. Is this bug in documentation or implememtation? I use hadoop-0.18.3. Dmitry Rzhevskiy.
Can I post pig questions on this forum?
-- View this message in context: http://www.nabble.com/Can-I-post-pig-questions-on-this-forum--tp24228728p24228728.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: Can I post pig questions on this forum?
pig-u...@hadoop.apache.org On Fri, Jun 26, 2009 at 4:34 PM, pmgparmod.me...@gmail.com wrote: -- View this message in context: http://www.nabble.com/Can-I-post-pig-questions-on-this-forum--tp24228728p24228728.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com. -- get hadoop: cloudera.com/hadoop online training: cloudera.com/hadoop-training blog: cloudera.com/blog twitter: twitter.com/cloudera
Hadoop0.20 - Class Not Found exception
I'm getting the following error while starting a MR job: Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver at org.apache.hadoop.mapred.lib.db.DBInputFormat.configure(DBInputFormat.java:297) ... 21 more Caused by: java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClassInternal(Unknown Source) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Unknown Source) at org.apache.hadoop.mapred.lib.db.DBConfiguration.getConnection(DBConfiguration.java:123) at org.apache.hadoop.mapred.lib.db.DBInputFormat.configure(DBInputFormat.java:292) ... 21 more Interestingly, the relevant jar is bundled into the MR job jar and its also there in the $HADOOP_HOME/lib directory. Exactly same thing worked with 0.19.. Not sure what could have changed or I broke to cause this error... Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz
Re: FileStatus.getLen(): bug in documentation or bug in implememtation?
Documentation is wrong. Implementation wins. Could you please file a bug. Thanks, --Konstantin Dima Rzhevskiy wrote: Hi all I try get length of file hadoop(RawFilesysten or hdfs) . In javadoc method org.apache.hadoop.fs.FileStatus.getLen() writtend that this method return the length of this file, in blocks But method return size in bytes. Is this bug in documentation or implememtation? I use hadoop-0.18.3. Dmitry Rzhevskiy.
Re: Using addCacheArchive
Thanks Chris for your reply! Well, I could not understand much of what has been discussed on that forum. I am unaware of Cascading. My problem is simple - I want a directory to present in the local working directory of tasks so that I can access it from my map task in the following manner : FileInputStream fin = new FileInputStream(Config/file1.config); where, Config is a directory which contains many files/directories, one of which is file1.config It would be helpful to me if you can tell me what statements to use to distribute a directory to the tasktrackers. The API doc http://hadoop.apache.org/core/docs/r0.20.0/api/index.html says that archives are unzipped on the tasktrackers but I want an example of how to use this in case of a dreictory. Thanks, Akhil Chris Curtin-2 wrote: Hi, I've found it much easier to write the file to HDFS use the API, then pass the 'path' to the file in HDFS as a property. You'll need to remember to clean up the file after you're done with it. Example details are in this thread: http://groups.google.com/group/cascading-user/browse_thread/thread/d5c619349562a8d6# Hope this helps, Chris On Thu, Jun 25, 2009 at 4:50 PM, akhil1988 akhilan...@gmail.com wrote: Please ask any questions if I am not clear above about the problem I am facing. Thanks, Akhil akhil1988 wrote: Hi All! I want a directory to be present in the local working directory of the task for which I am using the following statements: DistributedCache.addCacheArchive(new URI(/home/akhil1988/Config.zip), conf); DistributedCache.createSymlink(conf); Here Config is a directory which I have zipped and put at the given location in HDFS I have zipped the directory because the API doc of DistributedCache (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that the archive files are unzipped in the local cache directory : DistributedCache can be used to distribute simple, read-only data/text files and/or more complex types such as archives, jars etc. Archives (zip, tar and tgz/tar.gz files) are un-archived at the slave nodes. So, from my understanding of the API docs I expect that the Config.zip file will be unzipped to Config directory and since I have SymLinked them I can access the directory in the following manner from my map function: FileInputStream fin = new FileInputStream(Config/file1.config); But I get the FileNotFoundException on the execution of this statement. Please let me know where I am going wrong. Thanks, Akhil -- View this message in context: http://www.nabble.com/Using-addCacheArchive-tp24207739p24210836.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/Using-addCacheArchive-tp24207739p24229338.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Can I post pig questions on this forum?
pig-u...@hadoop.apache.org is the right place for pig questions. Alan. On Jun 26, 2009, at 4:34 PM, pmg wrote: -- View this message in context: http://www.nabble.com/Can-I-post-pig-questions-on-this-forum--tp24228728p24228728.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: Hadoop0.20 - Class Not Found exception
I meet the question also, resolved by transfer a class name to JobConf constructor. If you new the *JobConf,you must transfer a class name to it.* 2009/6/27 Amandeep Khurana ama...@gmail.com I'm getting the following error while starting a MR job: Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver at org.apache.hadoop.mapred.lib.db.DBInputFormat.configure(DBInputFormat.java:297) ... 21 more Caused by: java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClassInternal(Unknown Source) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Unknown Source) at org.apache.hadoop.mapred.lib.db.DBConfiguration.getConnection(DBConfiguration.java:123) at org.apache.hadoop.mapred.lib.db.DBInputFormat.configure(DBInputFormat.java:292) ... 21 more Interestingly, the relevant jar is bundled into the MR job jar and its also there in the $HADOOP_HOME/lib directory. Exactly same thing worked with 0.19.. Not sure what could have changed or I broke to cause this error... Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz
Re: Scaling out/up or a mix
Hey Marcus, Are you recording the data rates coming out of HDFS? Since you have such a low CPU utilizations, I'd look at boxes utterly packed with big hard drives (also, why are you using RAID1 for Hadoop??). You can get 1U boxes with 4 drive bays or 2U boxes with 12 drive bays. Based on the data rates you see, make the call. On the other hand, what's the argument against running 3x more mappers per box? It seems that your boxes still have more overhead to use -- there's no I/O wait. Brian On Jun 26, 2009, at 4:43 PM, Marcus Herou wrote: Hi. We have a deployment of 10 hadoop servers and I now need more mapping capability (no not just add more mappers per instance) since I have so many jobs running. Now I am wondering what I should aim on... Memory, cpu or disk... How long is a rope perhaps you would say ? A typical server is currently using about 15-20% cpu today on a quad- core 2.4Ghz 8GB RAM machine with 2 RAID1 SATA 500GB disks. Some specs below. mpstat 2 5 Linux 2.6.24-19-server (mapreduce2) 06/26/2009 11:36:13 PM CPU %user %nice%sys %iowait%irq %soft %steal %idleintr/s 11:36:15 PM all 22.820.003.241.370.622.49 0.00 69.45 8572.50 11:36:17 PM all 13.560.001.741.990.622.61 0.00 79.48 8075.50 11:36:19 PM all 14.320.002.241.121.122.24 0.00 78.95 9219.00 11:36:21 PM all 14.710.000.871.620.251.75 0.00 80.80 8489.50 11:36:23 PM all 12.690.000.871.240.500.75 0.00 83.96 5495.00 Average: all 15.620.001.791.470.621.97 0.00 78.53 7970.30 What I am thinking is... Is it wiser to go for many of these cheap boxes with 8GB of RAM or should I for instance focus on machines which can give more I|O throughput ? I know that these things are hard but perhaps someone have draw some conclusions before the pragmatic way. Kindly //Marcus -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/
Map/Reduce Errors
Hi All, I had posted a question earlier regarding some not so intuitive error messages that I was getting on one of the clusters when trying to map/reduce. After many hours of googling :) i found a post that solved my problem. http://www.mail-archive.com/core-user@hadoop.apache.org/msg07202.html. One of our engineers ran way too many jobs that created enormous subdirs in $HADOOP_HOME/logs/userlogs. Deleting these subdirs under $HADOOP_HOME/logs/userlogs/ on the datanodes solved the problem. You can also set the cleanup in the hadoop-default.xml file by setting the cleanup time to x hours instead of 24. The specific param is userlogs.retain. Just wanted to share this with you all. Thanks, Usman -- Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
HDFS Random Access
All the documentation for HDFS says that it's for large streaming jobs, but I couldn't find an explicit answer to this, so I'll try asking here. How is HDFS's random seek performance within an FSDataInputStream? I use lucene with a lot of indices (potentially thousands), so I was thinking of putting them into HDFS and reimplementing my search as a Hadoop map-reduce. I've noticed that lucene tends to do a bit of random seeking when searching though; I don't believe that it guarantees that all seeks be to increasing file positions either. Would HDFS be a bad fit for an access pattern that involves seeks to random positions within a stream? Also, is getFileStatus the typical way of getting the length of a file in HDFS, or is there some method on FSDataInputStream that I'm not seeing? Please cc: me on any reply; I'm not on the hadoop list. Thanks!