Re: How to debug a MapReduce application
I am terribly sorry. I made a mistake. This is the output I get: 09/01/19 07:59:45 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 09/01/19 07:59:45 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 09/01/19 07:59:45 INFO mapred.JobClient: Running job: job_local_0001 09/01/19 07:59:45 INFO mapred.MapTask: numReduceTasks: 1 09/01/19 07:59:45 INFO mapred.MapTask: io.sort.mb = 100 09/01/19 07:59:46 INFO mapred.MapTask: data buffer = 79691776/99614720 09/01/19 07:59:46 INFO mapred.MapTask: record buffer = 262144/327680 09/01/19 07:59:46 WARN mapred.LocalJobRunner: job_local_0001 java.lang.NullPointerException at org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:504) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:295) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138) 09/01/19 07:59:46 ERROR memo.MemoAnnotationMerging: Se ha producido un error java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217) at es.vocali.intro.tools.memo.MemoAnnotationMerging.main(MemoAnnotationMerging.java:160) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:165) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217) at es.vocali.intro.tools.memo.MemoAnnotationMerging.main(MemoAnnotationMerging.java:160) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:165) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) On Mon, Jan 19, 2009 at 8:47 AM, Pedro Vivancos pedro.vivan...@vocali.netwrote: Thank you very much, but actually I would like to run my application as a standalone one. Anyway I tried to execute it on a pseudo distributed mode with that setup and this what I got: 09/01/19 07:45:24 INFO ipc.Client: Retrying connect to server: localhost/ 127.0.0.1:9000. Already tried 0 time(s). 09/01/19 07:45:25 INFO ipc.Client: Retrying connect to server: localhost/ 127.0.0.1:9000. Already tried 1 time(s). 09/01/19 07:45:26 INFO ipc.Client: Retrying connect to server: localhost/ 127.0.0.1:9000. Already tried 2 time(s). 09/01/19 07:45:27 INFO ipc.Client: Retrying connect to server: localhost/ 127.0.0.1:9000. Already tried 3 time(s). 09/01/19 07:45:28 INFO ipc.Client: Retrying connect to server: localhost/ 127.0.0.1:9000. Already tried 4 time(s). 09/01/19 07:45:29 INFO ipc.Client: Retrying connect to server: localhost/ 127.0.0.1:9000. Already tried 5 time(s). 09/01/19 07:45:30 INFO ipc.Client: Retrying connect to server: localhost/ 127.0.0.1:9000. Already tried 6 time(s). 09/01/19 07:45:31 INFO ipc.Client: Retrying connect to server: localhost/ 127.0.0.1:9000. Already tried 7 time(s). 09/01/19 07:45:32 INFO ipc.Client: Retrying connect to server: localhost/ 127.0.0.1:9000. Already tried 8 time(s). 09/01/19 07:45:33 INFO ipc.Client: Retrying connect to server: localhost/ 127.0.0.1:9000. Already tried 9 time(s). java.lang.RuntimeException: java.io.IOException: Call to localhost/ 127.0.0.1:9000 failed on local exception: Connection refused at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:323) at org.apache.hadoop.mapred.FileOutputFormat.setOutputPath(FileOutputFormat.java:118) at es.vocali.intro.tools.memo.MemoAnnotationMerging.main(MemoAnnotationMerging.java:156) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:165) at
Re: Hadoop 0.17.1 = EOFException reading FSEdits file, what causes this? how to prevent?
I would prefer catching the EOFException in my own code, assuming you are happy with the output before exception occurs. Hope this helps, Rasit 2009/1/16 Konstantin Shvachko s...@yahoo-inc.com Joe, It looks like you edits file is corrupted or truncated. Most probably the last modification was not written to it, when the name-node was turned off. This may happen if the node crashes depending on the underlying local file system I guess. Here are some options for you to consider: - try an alternative replica of the image directory if you had one. - try to edit the edits file if you know the internal format. - try to modify local copy of your name-node code, which should catch EOFException and ignore it. - Use a checkpointed image if you can afford to loose latest modifications to the fs. - Formatting of cause is the last resort since you loose everything. Thanks, --Konstantin Joe Montanez wrote: Hi: I'm using Hadoop 0.17.1 and I'm encountering EOFException reading the FSEdits file. I don't have a clear understanding what is causing this and how to prevent this. Has anyone seen this and can advise? Thanks in advance, Joe 2009-01-12 22:51:45,573 ERROR org.apache.hadoop.dfs.NameNode: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106) at org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:90) at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:599) at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:766) at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:640) at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:223) at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80) at org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:274) at org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:255) at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:133) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:178) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:164) at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:848) at org.apache.hadoop.dfs.NameNode.main(NameNode.java:857) 2009-01-12 22:51:45,574 INFO org.apache.hadoop.dfs.NameNode: SHUTDOWN_MSG: -- M. Raşit ÖZDAŞ
Re: Calling a mapreduce job from inside another
You can also play with the priority of the jobs to have the innermost job finish first -Sagar Devaraj Das wrote: You can chain job submissions at the client. Also, you can run more than one job in parallel (if you have enough task slots). An example of chaining jobs is there in src/examples/org/apache/hadoop/examples/Grep.java where the jobs grep-search and grep-sort are chained.. On 1/18/09 9:58 AM, Aditya Desai aditya3...@gmail.com wrote: Is it possible to call a mapreduce job from inside another, if yes how? and is it possible to disable the reducer completely that is suspend the job immediately after call to map has been terminated. I have tried -reducer NONE. I am using the streaming api to code in python Regards, Aditya Desai.
Haddop Error Massage
Hi friends, could somebody tell me what does the following quoted massage mean? 3154.42user 76.09system 44:47.21elapsed 120%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (15major+6092226minor)pagefaults 0swaps First part tells about system usage but what is rest part? Is it because of Heap size of program? I am running hadoop task in standalone mode on almost 250GB of compressed data. This error massage comes after finishing the task. Thanks in advance, -- - Deepak Diwakar,
Re: Haddop Error Massage
that is a timing / space report Miles 2009/1/19 Deepak Diwakar ddeepa...@gmail.com: Hi friends, could somebody tell me what does the following quoted massage mean? 3154.42user 76.09system 44:47.21elapsed 120%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (15major+6092226minor)pagefaults 0swaps First part tells about system usage but what is rest part? Is it because of Heap size of program? I am running hadoop task in standalone mode on almost 250GB of compressed data. This error massage comes after finishing the task. Thanks in advance, -- - Deepak Diwakar, -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Windows Support
I recognize that Windows support is, um, limited :-) But, any ideas what exactly would need to be changed to support Windows (without cygwin) if someone such as myself were so motivated? The most immediate thing I ran into was the UserGroupInformation which would need a windows implementation. I see there is an issue to switch to JAAS too, which may be the proper fix? Are there lots of other things that would need to be changed? I think it may be worth opening a JIRA for windows support and creating some subtasks for the various issues, even if no one tackles them quite yet. Thanks, Dan -- Dan Diephouse http://netzooid.com/blog
Re: Haddop Error Massage
Thanks friend. 2009/1/19 Miles Osborne mi...@inf.ed.ac.uk that is a timing / space report Miles 2009/1/19 Deepak Diwakar ddeepa...@gmail.com: Hi friends, could somebody tell me what does the following quoted massage mean? 3154.42user 76.09system 44:47.21elapsed 120%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (15major+6092226minor)pagefaults 0swaps First part tells about system usage but what is rest part? Is it because of Heap size of program? I am running hadoop task in standalone mode on almost 250GB of compressed data. This error massage comes after finishing the task. Thanks in advance, -- - Deepak Diwakar, -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- - Deepak Diwakar,
Re: Windows Support
On Mon, Jan 19, 2009 at 11:35 AM, Steve Loughran ste...@apache.org wrote: Dan Diephouse wrote: I recognize that Windows support is, um, limited :-) But, any ideas what exactly would need to be changed to support Windows (without cygwin) if someone such as myself were so motivated? The most immediate thing I ran into was the UserGroupInformation which would need a windows implementation. I see there is an issue to switch to JAAS too, which may be the proper fix? Are there lots of other things that would need to be changed? I think it may be worth opening a JIRA for windows support and creating some subtasks for the various issues, even if no one tackles them quite yet. Thanks, Dan I think a key one you need to address is motiviation. Is cygwin that bad for a piece of server-side code? No I guess I was trying to get an idea of how much work it was. It seems easy enough to supply a WindowsUserGroupInformation class (or a platform agnostic one). I wondered how many other things like this there were before I put together a patch. Seems bad Java practices to depend on shell utilities :-). Not very platform agnostic... Dan -- Dan Diephouse http://netzooid.com/blog
Re: Windows Support
Hey Dan There is discussion/issue on this here: https://issues.apache.org/jira/browse/HADOOP-4998 ckw On Jan 19, 2009, at 8:55 AM, Dan Diephouse wrote: On Mon, Jan 19, 2009 at 11:35 AM, Steve Loughran ste...@apache.org wrote: Dan Diephouse wrote: I recognize that Windows support is, um, limited :-) But, any ideas what exactly would need to be changed to support Windows (without cygwin) if someone such as myself were so motivated? The most immediate thing I ran into was the UserGroupInformation which would need a windows implementation. I see there is an issue to switch to JAAS too, which may be the proper fix? Are there lots of other things that would need to be changed? I think it may be worth opening a JIRA for windows support and creating some subtasks for the various issues, even if no one tackles them quite yet. Thanks, Dan I think a key one you need to address is motiviation. Is cygwin that bad for a piece of server-side code? No I guess I was trying to get an idea of how much work it was. It seems easy enough to supply a WindowsUserGroupInformation class (or a platform agnostic one). I wondered how many other things like this there were before I put together a patch. Seems bad Java practices to depend on shell utilities :-). Not very platform agnostic... Dan -- Dan Diephouse http://netzooid.com/blog -- Chris K Wensel ch...@wensel.net http://www.cascading.org/ http://www.scaleunlimited.com/
Java RMI and Hadoop RecordIO
Hi I've been testing some different serialization techniques, to go along with a research project. I know motivation behind hadoop serialization mechanism (e.g. Writable) and the enhancement of this feature through record I/O is not only performance, but also control of the input/output. Still I've been running some simple tests and I've foud that plain RMi beats Hadoop RecordIO almost every time (14-16% faster). In my test I have a simple java class that has 14 int fields and 1 long field and I'm serializing aroung 35000 instances. Am I doing anything wrong? are there ways to improve performance in RecordIO? Have I got the use case wrong? Regards David Alves
Java RMI and Hadoop RecordIO
Hi I've been testing some different serialization techniques, to use in a research project. I know motivation behind hadoop serialization mechanism (e.g. Writable) and the enhancement of this feature through record I/O is not only performance, but also control of the input/output. Still I've been running some simple tests and I've foud that plain RMi beats Hadoop RecordIO almost every time (14-16% faster). In my test I have a simple java class that has 14 int fields and 1 long field and I'm serializing aroung 35000 instances. Am I doing anything wrong? are there ways to improve performance in RecordIO? Have I got the use case wrong? Regards David Alves
Re: Performance testing
Hi, I am in the process of following your guidelines. I would like to know: 1. How can block size impact the performance of a mapred job. 2. Does the performance improve if I setup NameNode and JobTracker on different machine. At present, I am running Namenode and JobTracker on the same machine as Master interconnected to 2 slave machines running Datanode and TaskTracker 3. What should be the replication factor for a 3 node cluster 4. How does io.sort.mb impact the performance of the cluster Thanks, Sandeep Brian Bockelman wrote: Hey Sandeep, I'd do a couple of things: 1) Run your test. Do something which will be similar to your actual workflow. 2) Save the resulting Ganglia plots. This will give you a hint as to where things are bottlenecking (memory, CPU, wait I/O). 3) Watch iostat and find out the I/O rates during the test. Compare this to the I/O rates of a known I/O benchmark (i.e., Bonnie+). 4) Finally, watch the logfiles closely. If you start to overload things, you'll usually get a pretty good indication from Hadoop where things go wrong. Once something does go wrong, *then* look through the parameters to see what can be done. There's about a hundred things which can go wrong between the kernel, the OS, Java, and the application code. It's difficult to make an educated guess beforehand without some hint from the data. Brian On Dec 31, 2008, at 1:30 AM, Sandeep Dhawan wrote: Hi Brian, That's what my issue is i.e. How do I ascertain the bottleneck or in other words if the results obtained after doing the performance testing are not upto the mark then How do I find the bottleneck. How can we confidently say that OS and hardware are the culprits. I understand that by using the latest OS and hardware can improve the performance irrespective of the application but my real worry is What Next . How can I further increase the performance. What should I look for which can suggest or point the areas which can be potential problems or hotspot. Thanks for your comments. ~Sandeep~ Brian Bockelman wrote: Hey Sandeep, I would warn against premature optimization: first, run your test, then see how far from your target you are. Of course, I'd wager you'd find that the hardware you are using is woefully underpowered and that your OS is 5 years old. Brian On Dec 30, 2008, at 5:57 AM, Sandeep Dhawan wrote: Hi, I am trying to create a hadoop cluster which can handle 2000 write requests per second. In each write request I would writing a line of size 1KB in a file. I would be using machine having following configuration: Platfom: Red Hat Linux 9.0 CPU : 2.07 GHz RAM : 1GB Can anyone help in giving me some pointers/guideline as to how to go about setting up such a cluster. What are the configuration parameters in hadoop with which we can tweak to ehance the performance of the hadoop cluster. Thanks, Sandeep -- View this message in context: http://www.nabble.com/Performance-testing-tp21216266p21216266.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/Performance-testing-tp21216266p21228264.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/Performance-testing-tp21216266p21548160.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Hadoop Exceptions
Here are few hadoop exceptions that I am getting while running mapred job on 700MB of data on a 3 node cluster on Windows platform (using cygwin): 1. 2009-01-08 17:54:10,597 INFO org.apache.hadoop.dfs.DataNode: writeBlock blk_-4309088198093040326_1001 received exception java.io.IOException: Block blk_-4309088198093040326_1001 is valid, and cannot be written to. 2009-01-08 17:54:10,597 ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(10.120.12.91:50010, storageID=DS-70805886-10.120.12.91-50010-1231381442699, infoPort=50075, ipcPort=50020):DataXceiver: java.io.IOException: Block blk_-4309088198093040326_1001 is valid, and cannot be written to. at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:921) at org.apache.hadoop.dfs.DataNode$BlockReceiver.init(DataNode.java:2364) at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1218) at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1076) at java.lang.Thread.run(Thread.java:619) 2. This particular job succeeded. Is it possible that this task was a speculative execution and was killed before it could be started. Exception in thread main java.lang.NullPointerException at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2195) 3. 2009-01-15 21:27:13,547 WARN org.apache.hadoop.mapred.ReduceTask: attempt_200901152118_0001_r_00_0 Merge of the inmemory files threw an exception: java.io.IOException: Expecting a line not the end of stream at org.apache.hadoop.fs.DF.parseExecResult(DF.java:109) at org.apache.hadoop.util.Shell.runCommand(Shell.java:179) at org.apache.hadoop.util.Shell.run(Shell.java:134) at org.apache.hadoop.fs.DF.getAvailable(DF.java:73) at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:296) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124) at org.apache.hadoop.mapred.MapOutputFile.getInputFileForWrite(MapOutputFile.java:160) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2105) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2078) 4. 2009-01-15 21:27:13,547 INFO org.apache.hadoop.mapred.ReduceTask: In-memory merge complete: 47 files left. 2009-01-15 21:27:13,579 WARN org.apache.hadoop.mapred.TaskTracker: Error running child java.io.IOException: attempt_200901152118_0001_r_00_0The reduce copier failed at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:255) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) 5. Caused by: java.io.IOException: An established connection was aborted by the software in your host machine ... 12 more Can anyone help me in giving some pointers to what could be the issue. Thanks, Sandeep -- View this message in context: http://www.nabble.com/Hadoop-Exceptions-tp21548261p21548261.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Upgrading and patching
Thanks Brian, I have just one more question: When building my own release where do I enter in the version and compiled by information? Thanks, Phil On Fri, Jan 16, 2009 at 6:23 PM, Brian Bockelman bbock...@cse.unl.eduwrote: Hey Philip, I've found it easier to download the release, apply the patches, and then re-build the release. It's really pleasant to build the release. I suppose it's equivalent to check it out from SVN. Brian On Jan 16, 2009, at 1:46 PM, Philip wrote: Hello All, I'm currently trying to upgrade a hadoop 0.18.0 cluster to 0.19. The wrinkle is that I would like to include https://issues.apache.org/jira/browse/HADOOP-4906 into the build as well. Would it be easier if I downloaded trunk and applied the patch or is there a branch that I can download with the patch already integrated and install onto my system? Thanks, Philip
Re: Maven repo for Hadoop
On Jan 17, 2009, at 5:53 PM, Chanwit Kaewkasi wrote: I would like to integrate Hadoop to my project using Ivy. Is there any maven repository containing Hadoop jars that I can point my configuration to? Not yet, but soon. We recently introduced ivy into Hadoop, so I believe we'll upload the pom and jar for 0.20.0 when it is released. -- Owen
Re: Performance testing
Hi, see answers inline below HTH, Jothi I would like to know: 1. How can block size impact the performance of a mapred job. From the M/R side, the fileSystem block size of the input files is treated as an upper bound for input splits. . Since each input split translates into one map, this can affect the actual number of maps for the job 2. Does the performance improve if I setup NameNode and JobTracker on different machine. At present, I am running Namenode and JobTracker on the same machine as Master interconnected to 2 slave machines running Datanode and TaskTracker Intuitively, it should help. Namenode is really memory intensive and the job tracker could also be heavily loaded depending on the number of concurrent jobs running and the number of maps and reducers of these jobs (for scheduling). 3. What should be the replication factor for a 3 node cluster I think having a higher replication factor might not increase performance for a 3 node cluster, it might degrade if at all because of the extra replication. If replication is only for performance and not for availability/fault tolerance, you could try setting the replication factor to a smaller number (1?). 4. How does io.sort.mb impact the performance of the cluster Look here http://hadoop.apache.org/core/docs/r0.19.0/mapred_tutorial.html Thanks, Sandeep Brian Bockelman wrote: Hey Sandeep, I'd do a couple of things: 1) Run your test. Do something which will be similar to your actual workflow. 2) Save the resulting Ganglia plots. This will give you a hint as to where things are bottlenecking (memory, CPU, wait I/O). 3) Watch iostat and find out the I/O rates during the test. Compare this to the I/O rates of a known I/O benchmark (i.e., Bonnie+). 4) Finally, watch the logfiles closely. If you start to overload things, you'll usually get a pretty good indication from Hadoop where things go wrong. Once something does go wrong, *then* look through the parameters to see what can be done. There's about a hundred things which can go wrong between the kernel, the OS, Java, and the application code. It's difficult to make an educated guess beforehand without some hint from the data. Brian On Dec 31, 2008, at 1:30 AM, Sandeep Dhawan wrote: Hi Brian, That's what my issue is i.e. How do I ascertain the bottleneck or in other words if the results obtained after doing the performance testing are not upto the mark then How do I find the bottleneck. How can we confidently say that OS and hardware are the culprits. I understand that by using the latest OS and hardware can improve the performance irrespective of the application but my real worry is What Next . How can I further increase the performance. What should I look for which can suggest or point the areas which can be potential problems or hotspot. Thanks for your comments. ~Sandeep~ Brian Bockelman wrote: Hey Sandeep, I would warn against premature optimization: first, run your test, then see how far from your target you are. Of course, I'd wager you'd find that the hardware you are using is woefully underpowered and that your OS is 5 years old. Brian On Dec 30, 2008, at 5:57 AM, Sandeep Dhawan wrote: Hi, I am trying to create a hadoop cluster which can handle 2000 write requests per second. In each write request I would writing a line of size 1KB in a file. I would be using machine having following configuration: Platfom: Red Hat Linux 9.0 CPU : 2.07 GHz RAM : 1GB Can anyone help in giving me some pointers/guideline as to how to go about setting up such a cluster. What are the configuration parameters in hadoop with which we can tweak to ehance the performance of the hadoop cluster. Thanks, Sandeep -- View this message in context: http://www.nabble.com/Performance-testing-tp21216266p21216266.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/Performance-testing-tp21216266p21228264.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Distributed Key-Value Databases
Hey y'all, There've been a few questions about distributed database solutions (a partial list: HBase, Voldemort, Memcached, ThruDB, CouchDB, Ringo, Scalaris, Kai, Dynomite, Cassandra, Hypertable, as well as the closed Dynamo, BigTable, SimpleDB). For someone using Hadoop at scale, what problem aspects would recommend one of those over another? And in your subjective judgement, do any of these seem especially likely to succeed? Richard Jones of Last.fm just posted an overview with a great deal of engineering insight: http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/ His focus is a production web server farm, and so in some ways orthogonal to the crowd here -- but still highly recommended. Swaroop CH of Yahoo wrote a broad introduction to distributed DBs I also found useful: http://www.swaroopch.com/notes/Distributed_Storage_Systems Both give HBase short shrift, though my impression is that it is the leader among open projects for massive unordered dataset problems. The answer also, though, doesn't seem to be a simple If you're using Hadoop you should be using HBase, dummy. I don't have the expertise to write this kind of overview from the hadoop / big data perspective but would eagerly read such an article from someone who does, or to summarize the insights of the list. === In lieu yet of such a summary, pointers to a few relevant threads: * http://www.nabble.com/Why-is-scaling-HBase-much-simpler-then-scaling-a-relational-db--tt18869660.html#a19093685 (especially Jonathan Gray's breakdown) * HBase Performance http://www.mail-archive.com/hadoop-u...@lucene.apache.org/msg02540.html (and the paper by Stonebraker and friends: http://www.vldb.org/conf/2007/papers/industrial/p1150-stonebraker.pdf) * http://www.nabble.com/Serving-contents-of-large-MapFiles-SequenceFiles-from-memory-across-many-machines-tt19546012.html#a19574917 * On specific problem domains: http://www.nabble.com/Indexed-Hashtables-tt21470024.html#a21470848 http://www.nabble.com/Why-can%27t-Hadoop-be-used-for-online-applications---tt19461962.html#a19471894 http://www.nabble.com/Architecture-question.-tt21100766.html#a21100766 flip (noted in passing: a huge proportion of the development seems to be coming out of commercial enterprises and not the academic/HPC community. I worry my ivory tower is hung up on big iron and the top500.org list, at the expense of solving the many interesting problems these unlock.) -- http://www.infochimps.org Connected Open Free Data
hadoop balanceing data
Why do we not use the Remaining % in place of use Used % when we are selecting datanode for new data and when running the balancer. form what I can tell we are using the use % used and we do not factor in non DFS Used at all. I see a datanode with only a 60GB hard drive fill up completely 100% before the other servers that have 130+GB hard drives get half full. Seams like Trying to keep the same % free on the drives in the cluster would be more optimal in production. I know this still may not be perfect but would be nice if we tried. Billy