Re: How to deal with too many fetch failures?

2009-08-20 Thread Ted Dunning
I think that the problem that I am remembering was due to poor recovery from this problem. The underlying fault is likely due to poor connectivity between your machines. Test that all members of your cluster can access all others on all ports used by hadoop. See here for hints:

Re: File Chunk to Map Thread Association

2009-08-20 Thread roman kolcun
On Thu, Aug 20, 2009 at 6:49 AM, Harish Mallipeddi harish.mallipe...@gmail.com wrote: On Thu, Aug 20, 2009 at 7:25 AM, roman kolcun roman.w...@gmail.com wrote: Hello everyone, could anyone please tell me in which class and which method does Hadoop download the file chunk from HDFS and

Re: File Chunk to Map Thread Association

2009-08-20 Thread Harish Mallipeddi
On Thu, Aug 20, 2009 at 2:39 PM, roman kolcun roman.w...@gmail.com wrote: Hello Harish, I know that TaskTracker creates separate threads (up to mapred.tasktracker.map.tasks.maximum) which execute the map() function. However, I haven't found the piece of code which associate FileSplit with

Running hadoop jobs from a client and tuning (was Re: How does hadoop deal with hadoop-site.xml?)

2009-08-20 Thread stephen mulcahy
Hi folks, Sorry to cut across this discussion but I'm experiencing some similar confusion about where to change some parameters. In particular, I'm not entirely clear on how the following should be used - clarification welcome (I'm happy to pull some of this together on a blog once I get

Re: File Chunk to Map Thread Association

2009-08-20 Thread roman kolcun
On Thu, Aug 20, 2009 at 10:30 AM, Harish Mallipeddi harish.mallipe...@gmail.com wrote: On Thu, Aug 20, 2009 at 2:39 PM, roman kolcun roman.w...@gmail.com wrote: Hello Harish, I know that TaskTracker creates separate threads (up to mapred.tasktracker.map.tasks.maximum) which execute

Re: File Chunk to Map Thread Association

2009-08-20 Thread Tom White
Hi Roman, Have a look at CombineFileInputFormat - it might be related to what you are trying to do. Cheers, Tom On Thu, Aug 20, 2009 at 10:59 AM, roman kolcunroman.w...@gmail.com wrote: On Thu, Aug 20, 2009 at 10:30 AM, Harish Mallipeddi harish.mallipe...@gmail.com wrote: On Thu, Aug 20,

Re: File Chunk to Map Thread Association

2009-08-20 Thread roman kolcun
Thanks Tom, I will have a look at it. Cheers, Roman On Thu, Aug 20, 2009 at 3:02 PM, Tom White t...@cloudera.com wrote: Hi Roman, Have a look at CombineFileInputFormat - it might be related to what you are trying to do. Cheers, Tom On Thu, Aug 20, 2009 at 10:59 AM, roman

Re: syslog-ng and hadoop

2009-08-20 Thread Edward Capriolo
On Wed, Aug 19, 2009 at 11:50 PM, Brian Bockelmanbbock...@cse.unl.edu wrote: Hey Mike, Yup.  We find the stock log4j needs two things: 1) Set the rootLogger manually.  The way 0.19.x has the root logger set up breaks when adding new appenders.  I.e., do:

Re: syslog-ng and hadoop

2009-08-20 Thread mike anderson
Yeah, that is interesting Edward. I don't need syslog-ng for any particular reason, other than that I'm familiar with it. If there were another way to get all my logs collated into one log file that would be great. mike On Thu, Aug 20, 2009 at 10:44 AM, Edward Capriolo edlinuxg...@gmail.comwrote:

Re: syslog-ng and hadoop

2009-08-20 Thread Edward Capriolo
On Thu, Aug 20, 2009 at 10:49 AM, mike andersonsaidthero...@gmail.com wrote: Yeah, that is interesting Edward. I don't need syslog-ng for any particular reason, other than that I'm familiar with it. If there were another way to get all my logs collated into one log file that would be great.

MR job scheduler

2009-08-20 Thread bharath vissapragada
Hi all, Can anyone tell me how the MR scheduler schedule the MR jobs? How does it decide where t create MAP tasks and how many to create. Once the MAP tasks are over how does it decide to move the keys to the reducer efficiently(minimizing the data movement across the network). Is there any doc

MR job scheduler

2009-08-20 Thread bharath vissapragada
Hi all, Can anyone tell me how the MR scheduler schedule the MR jobs? How does it decide where t create MAP tasks and how many to create. Once the MAP tasks are over how does it decide to move the keys to the reducer efficiently(minimizing the data movement across the network). Is there any doc

Invalid argument for option USER_DATA_FILE

2009-08-20 Thread Harshit Kumar
Hi When I try to execute *hadoop-ec2 launch-cluster test-cluster 2*, it executes, but keep waiting at Waiting for instance to start, find below the exact display as it shows on my screen $ bin/hadoop-ec2 launch-cluster test-cluster 2 Testing for existing master in group: test-cluster Creating

Re: Faster alternative to FSDataInputStream

2009-08-20 Thread Scott Carey
If it always takes a very long time to start transferring data, get a few stack dumps (jstack or kill -e) during this period to see what it is doing during this time. Most likely, the client is doing nothing but waiting on the remote side. On 8/20/09 8:02 AM, Ananth T. Sarathy

Re: submitting multiple small jobs simultaneously

2009-08-20 Thread George Jahad
On Wednesday, August 19, 2009 11:21 Jakob Homan wrote: George- You can certainly submit jobs asynchronously via the JobClient.submitJob() method (http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/JobClient.html). This will return a handle (a

Re: How to deal with too many fetch failures?

2009-08-20 Thread Koji Noguchi
Probably unrelated to your problem, but one extreme case I've seen, a user's job with large gzip inputs (non-splittable), 20 mappers 800 reducers. Each map outputted like 20G. Too many reducers were hitting a single node as soon as a mapper finished. I think we tried something like

Re: Faster alternative to FSDataInputStream

2009-08-20 Thread Scott Carey
On 8/20/09 9:48 AM, Ananth T. Sarathy ananth.t.sara...@gmail.com wrote: ok.. i seems that's the case. that seems kind of selfdefeating though. Ananth T Sarathy Then something is wrong with S3. It may be misconfigured, or just poor performance. I have no experience with S3 but 20 seconds

Re: Using Hadoop with executables and binary data

2009-08-20 Thread Jaliya Ekanayake
Hi Stefan, I am sorry, for the late reply. Somehow the response email has slipped my eyes. Could you explain a bit on how to use Hadoop streaming with binary data formats. I can see, explanations on using it with text data formats, but not for binary files. Thank you, Jaliya Stefan

Re: File Chunk to Map Thread Association

2009-08-20 Thread Ted Dunning
Uhh hadoop already goes to considerable lengths to make sure that computation is local. In my experience it is common for 90% of the map invocations to be working from local data. Hadoop doesn't know about record boundaries so a little bit of slop into a non-local block is possible to finish

Re: NN memory consumption on 0.20/0.21 with compressed pointers/

2009-08-20 Thread Raghu Angadi
Suresh had made an spreadsheet for memory consumption.. will check. A large portion of NN memory is taken by references. I would expect memory savings to be very substantial (same as going from 64bit to 32bit), could be on the order of 40%. The last I heard from Sun was that compressed

Re: Location of the source code for the fair scheduler

2009-08-20 Thread Mithila Nagendra
If you go to http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/contrib/fairscheduler/src/java/org/apache/hadoop/mapred/AllocationConfigurationException.java?view=log it shows many revisions for the source file AllocationConfigurationException.java, so I was wondering which can be used to

Re: File Chunk to Map Thread Association

2009-08-20 Thread roman kolcun
Hello Ted, I know that Hadoop tries to exploit data locality and it is pretty high. However, the data locality cannot be exploited in case when 'mapred.min.split.size' is set to much higher than DFS blocksize - because consecutive blocks are not stored on a single machine. I have found out that

Re: MR job scheduler

2009-08-20 Thread Arun C Murthy
On Aug 20, 2009, at 9:00 AM, bharath vissapragada wrote: Hi all, Can anyone tell me how the MR scheduler schedule the MR jobs? How does it decide where t create MAP tasks and how many to create. Once the MAP tasks are over how does it decide to move the keys to the reducer

Re: Location of the source code for the fair scheduler

2009-08-20 Thread Ravi Phulari
Mithila , It depends on which version of Hadoop you want to work on . If you want to work on Hadoop 0.20 then you should check out Hadoop 0.20 source code . If you want to work on trunk then check out Hadoop mapreduce source . svn checkout

Re: syslog-ng and hadoop

2009-08-20 Thread mike anderson
I got it working! fantastic. One thing that hung me up for a while was how picky the log4j.properties files are about syntax. For future reference to others, I used this in log4j.properties: # Define the root logger to the system property hadoop.root.logger. log4j.rootLogger=${hadoop.root.logger},

Re: Using Hadoop with executables and binary data

2009-08-20 Thread Aaron Kimball
Look into typed bytes: http://dumbotics.com/2009/02/24/hadoop-1722-and-typed-bytes/ On Thu, Aug 20, 2009 at 10:29 AM, Jaliya Ekanayake jnekanay...@gmail.comwrote: Hi Stefan, I am sorry, for the late reply. Somehow the response email has slipped my eyes. Could you explain a bit on how to

Re: NN memory consumption on 0.20/0.21 with compressed pointers/

2009-08-20 Thread Aaron Kimball
Compressed OOPs are available now in 1.6.0u14: https://jdk6.dev.java.net/6uNea.html - Aaron On Thu, Aug 20, 2009 at 10:51 AM, Raghu Angadi rang...@yahoo-inc.comwrote: Suresh had made an spreadsheet for memory consumption.. will check. A large portion of NN memory is taken by references. I

Re: NN memory consumption on 0.20/0.21 with compressed pointers/

2009-08-20 Thread Scott Carey
On 8/20/09 3:40 AM, Steve Loughran ste...@apache.org wrote: does anyone have any up to date data on the memory consumption per block/file on the NN on a 64-bit JVM with compressed pointers? The best documentation on consumption is http://issues.apache.org/jira/browse/HADOOP-1687 -I'm

Cluster Disk Usage

2009-08-20 Thread Arvind Sharma
Is there a way to find out how much disk space - overall or per Datanode basis - is available before creating a file ? I am trying to address an issue where the disk got full (config error) and the client was not able to create a file on the HDFS. I want to be able to check if there space

Re: Cluster Disk Usage

2009-08-20 Thread Arvind Sharma
Using hadoop-0.19.2 From: Arvind Sharma arvind...@yahoo.com To: common-user@hadoop.apache.org Sent: Thursday, August 20, 2009 3:56:53 PM Subject: Cluster Disk Usage Is there a way to find out how much disk space - overall or per Datanode basis - is available

Re: Cluster Disk Usage

2009-08-20 Thread Arvind Sharma
Sorry, I also sent a direct e-mail to one response there I asked one question - what is the cost of these APIs ??? Are they too expensive calls ? Is the API only going to the NN which stores this data ? Thanks! Arvind From: Arvind Sharma

Writing to a db with DBOutputFormat spits out IOException Error

2009-08-20 Thread ishwar ramani
Hi, I am trying to run a simple map reduce that writes the result from the reducer to a mysql db. I Keep getting 09/08/20 15:44:59 INFO mapred.JobClient: Task Id : attempt_200908201210_0013_r_00_0, Status : FAILED java.io.IOException: com.mysql.jdbc.Driver at

RE: Cluster Disk Usage

2009-08-20 Thread zjffdu
You can use the jobtracker Web UI to use the disk usage. -Original Message- From: Arvind Sharma [mailto:arvind...@yahoo.com] Sent: 2009年8月20日 15:57 To: common-user@hadoop.apache.org Subject: Cluster Disk Usage Is there a way to find out how much disk space - overall or per Datanode

RE: MR job scheduler

2009-08-20 Thread zjffdu
Add some detials: 1. #map is determined by the block size and InputFormat (whether you can want to split or not split) 2. The default scheduler for Hadoop is FIFO, and the Fair Scheduler and Capacity Scheduler are other two options as I know. JobTracker has the scheduler. 3. Once the map task

RE: Cluster Disk Usage

2009-08-20 Thread zjffdu
Arvind, You can use this API to get the size of file system used FileSystem.getUsed(); But, I do not find the API for calculate the remaining space. You can write some code to create a API, The remaining disk space = Total of disk space - operate system space - FileSystem.getUsed()

Re: MR job scheduler

2009-08-20 Thread bharath vissapragada
OK i'll be a bit more specific , Suppose map outputs 100 different keys . Consider a key K whose correspoding values may be on N diff datanodes. Consider a datanode D which have maximum number of values . So instead of moving the values on D to other systems , it is useful to bring in the values

RE: Using Hadoop with executables and binary data

2009-08-20 Thread Jaliya Ekanayake
Thanks for the quick reply. I looked at it, but still could not figure out how to use HDFS to store input data (binary) and call an executable. Please note that I cannot modify the executable. May be I am asking some dumb question, but could you please explain a bit of how to handle the scenario

Exception when starting namenode

2009-08-20 Thread Zheng Lv
Hello, I got these exceptions when I started the cluster, any suggestions? I used hadoop 0.15.2. 2009-08-21 12:12:53,463 ERROR org.apache.hadoop.dfs.NameNode: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at

RE: passing job arguments as an xml file

2009-08-20 Thread Amogh Vasekar
Hi, GenericOptionsParser is customized only for Hadoop specific params : * codeGenericOptionsParser/code recognizes several standarad command * line arguments, enabling applications to easily specify a namenode, a * jobtracker, additional configuration resources etc. Ideally, all params