reducer outofmemoryerror
Hi, I have a 4 node hadoop 0.15.3 cluster. I am using the default config files. I am running a map reduce job to process 40 GB log data. Some reduce tasks are failing with the following errors: 1) stderr Exception in thread org.apache.hadoop.io.ObjectWritable Connection Culler Exception in thread [EMAIL PROTECTED] java.lang.OutOfMemoryError: Java heap space Exception in thread IPC Client connection to /127.0.0.1:34691 java.lang.OutOfMemoryError: Java heap space Exception in thread main java.lang.OutOfMemoryError: Java heap space 2) stderr Exception in thread org.apache.hadoop.io.ObjectWritable Connection Culler java.lang.OutOfMemoryError: Java heap space syslog: 2008-04-22 19:32:50,784 INFO org.apache.hadoop.mapred.ReduceTask: task_200804212359_0007_r_04_0 Merge of the 19 files in InMemoryFileSystem complete. Local file is /data/hadoop-im2/mapred/loca l/task_200804212359_0007_r_04_0/map_22600.out 2008-04-22 20:34:16,012 INFO org.apache.hadoop.ipc.Client: java.net.SocketException: Socket closed at java.net.SocketInputStream.read(SocketInputStream.java:162) at java.io.FilterInputStream.read(FilterInputStream.java:111) at org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:181) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:235) at java.io.DataInputStream.readInt(DataInputStream.java:353) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:258) 2008-04-22 20:34:16,032 WARN org.apache.hadoop.mapred.TaskTracker: Error running child java.lang.OutOfMemoryError: Java heap space 2008-04-22 20:34:16,031 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.lang.OutOfMemoryError: Java heap space Has anyone experienced similar problem ? Is there any configuration change that can help resolve this issue. Regards, aj
RE: Can Hadoop process pictures and videos?
Thanks Do you have any pointers to how you would normally approach this? Is Hadoop Streaming the solution I am looking for? Many thanks again. Roland -Original Message- From: Ted Dunning [mailto:[EMAIL PROTECTED] Sent: 22. april 2008 22:54 To: core-user@hadoop.apache.org Subject: Re: Can Hadoop process pictures and videos? Yes you can. One issue is typically that linux based video codecs are not as numberous as windows based codecs so you may be a bit limited as to what kinds of video you can process. Also, most video processing and transcoding is embarrassingly parallel at the file level with little need for map-reduce. That may make hadoop less useful than it might otherwise be. On the other hand, hadoop does expose URL's from which you can read file data so you might be just as happy using that. On 4/22/08 1:48 PM, Roland Rabben [EMAIL PROTECTED] wrote: Hi Sorry for my ignorance, but I am trying to understand if I can use Hadoop and Map/Reduce to process video files and images. Encoding and transcoding videos is an example of what I would like to do. Thank you for your patience. Regards Roland
Re: Can Hadoop process pictures and videos?
On Apr 22, 2008, at 11:44 PM, Roland Rabben wrote: Thanks Do you have any pointers to how you would normally approach this? Is Hadoop Streaming the solution I am looking for? Probably not, given that Hadoop streaming doesn't work well with binary data. It is probably easiest to the use the C++ interface. Look at the examples in src/examples/pipes. -- Owen
User accounts in Master and Slaves
After trying out Hadoop in a single machine, I decided to run a MapReduce across multiple machines. This is the approach I followed: 1 Master 1 Slave (A doubt here: Can my Master also be used to execute the Map/Reduce functions?) To do this, I set up the masters and slaves files in the conf directory. Following the instructions in this page - http://hadoop.apache.org/core/docs/current/cluster_setup.html, I had set up sshd in both the machines, and was able to ssh from one to the other. I tried to run bin/start-dfs.sh. Unfortunately, this asked for a password for [EMAIL PROTECTED], while in slave, there was only user2. While in master, user1 was the logged on user. How do I resolve this? Should the user accounts be present in all the machines? Or can I specify this somewhere?
Re: reducer outofmemoryerror
Memory settings are in conf/hadoop-default.xml. You can override them in conf/hadoop-site.xml. Specifically I think you would want to change mapred.child.java.opts On Wed, Apr 23, 2008 at 2:40 PM, Apurva Jadhav [EMAIL PROTECTED] wrote: Hi, I have a 4 node hadoop 0.15.3 cluster. I am using the default config files. I am running a map reduce job to process 40 GB log data. Some reduce tasks are failing with the following errors: 1) stderr Exception in thread org.apache.hadoop.io.ObjectWritable Connection Culler Exception in thread [EMAIL PROTECTED] java.lang.OutOfMemoryError: Java heap space Exception in thread IPC Client connection to /127.0.0.1:34691 java.lang.OutOfMemoryError: Java heap space Exception in thread main java.lang.OutOfMemoryError: Java heap space 2) stderr Exception in thread org.apache.hadoop.io.ObjectWritable Connection Culler java.lang.OutOfMemoryError: Java heap space syslog: 2008-04-22 19:32:50,784 INFO org.apache.hadoop.mapred.ReduceTask: task_200804212359_0007_r_04_0 Merge of the 19 files in InMemoryFileSystem complete. Local file is /data/hadoop-im2/mapred/loca l/task_200804212359_0007_r_04_0/map_22600.out 2008-04-22 20:34:16,012 INFO org.apache.hadoop.ipc.Client: java.net.SocketException: Socket closed at java.net.SocketInputStream.read(SocketInputStream.java:162) at java.io.FilterInputStream.read(FilterInputStream.java:111) at org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:181) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:235) at java.io.DataInputStream.readInt(DataInputStream.java:353) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:258) 2008-04-22 20:34:16,032 WARN org.apache.hadoop.mapred.TaskTracker: Error running child java.lang.OutOfMemoryError: Java heap space 2008-04-22 20:34:16,031 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.lang.OutOfMemoryError: Java heap space Has anyone experienced similar problem ? Is there any configuration change that can help resolve this issue. Regards, aj -- Harish Mallipeddi circos.com : poundbang.in/blog/
Re: User accounts in Master and Slaves
On Wed, Apr 23, 2008 at 3:03 PM, Sridhar Raman [EMAIL PROTECTED] wrote: After trying out Hadoop in a single machine, I decided to run a MapReduce across multiple machines. This is the approach I followed: 1 Master 1 Slave (A doubt here: Can my Master also be used to execute the Map/Reduce functions?) If you add the master node to the list of slaves (conf/slaves), then the master node run will also run a TaskTracker. To do this, I set up the masters and slaves files in the conf directory. Following the instructions in this page - http://hadoop.apache.org/core/docs/current/cluster_setup.html, I had set up sshd in both the machines, and was able to ssh from one to the other. I tried to run bin/start-dfs.sh. Unfortunately, this asked for a password for [EMAIL PROTECTED], while in slave, there was only user2. While in master, user1 was the logged on user. How do I resolve this? Should the user accounts be present in all the machines? Or can I specify this somewhere? -- Harish Mallipeddi circos.com : poundbang.in/blog/
Re: reducer outofmemoryerror
Apurva Jadhav wrote: Hi, I have a 4 node hadoop 0.15.3 cluster. I am using the default config files. I am running a map reduce job to process 40 GB log data. How many maps and reducers are there? Make sure that there are sufficient number of reducers. Look at conf/hadoop-default.xml (see mapred.child.java.opts parameter) to change the heap settings. Amar Some reduce tasks are failing with the following errors: 1) stderr Exception in thread org.apache.hadoop.io.ObjectWritable Connection Culler Exception in thread [EMAIL PROTECTED] java.lang.OutOfMemoryError: Java heap space Exception in thread IPC Client connection to /127.0.0.1:34691 java.lang.OutOfMemoryError: Java heap space Exception in thread main java.lang.OutOfMemoryError: Java heap space 2) stderr Exception in thread org.apache.hadoop.io.ObjectWritable Connection Culler java.lang.OutOfMemoryError: Java heap space syslog: 2008-04-22 19:32:50,784 INFO org.apache.hadoop.mapred.ReduceTask: task_200804212359_0007_r_04_0 Merge of the 19 files in InMemoryFileSystem complete. Local file is /data/hadoop-im2/mapred/loca l/task_200804212359_0007_r_04_0/map_22600.out 2008-04-22 20:34:16,012 INFO org.apache.hadoop.ipc.Client: java.net.SocketException: Socket closed at java.net.SocketInputStream.read(SocketInputStream.java:162) at java.io.FilterInputStream.read(FilterInputStream.java:111) at org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:181) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:235) at java.io.DataInputStream.readInt(DataInputStream.java:353) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:258) 2008-04-22 20:34:16,032 WARN org.apache.hadoop.mapred.TaskTracker: Error running child java.lang.OutOfMemoryError: Java heap space 2008-04-22 20:34:16,031 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.lang.OutOfMemoryError: Java heap space Has anyone experienced similar problem ? Is there any configuration change that can help resolve this issue. Regards, aj
Re: User accounts in Master and Slaves
Ok, what about the issue regarding the users? Do all the machines need to be under the same user? On Wed, Apr 23, 2008 at 12:43 PM, Harish Mallipeddi [EMAIL PROTECTED] wrote: On Wed, Apr 23, 2008 at 3:03 PM, Sridhar Raman [EMAIL PROTECTED] wrote: After trying out Hadoop in a single machine, I decided to run a MapReduce across multiple machines. This is the approach I followed: 1 Master 1 Slave (A doubt here: Can my Master also be used to execute the Map/Reduce functions?) If you add the master node to the list of slaves (conf/slaves), then the master node run will also run a TaskTracker. To do this, I set up the masters and slaves files in the conf directory. Following the instructions in this page - http://hadoop.apache.org/core/docs/current/cluster_setup.html, I had set up sshd in both the machines, and was able to ssh from one to the other. I tried to run bin/start-dfs.sh. Unfortunately, this asked for a password for [EMAIL PROTECTED], while in slave, there was only user2. While in master, user1 was the logged on user. How do I resolve this? Should the user accounts be present in all the machines? Or can I specify this somewhere? -- Harish Mallipeddi circos.com : poundbang.in/blog/
Re: submitting map-reduce jobs without creating jar file ?
On Apr 23, 2008, at 00:31, Ted Dunning wrote: Grool might help you. Got a link? Google is not very helpful on the Grool + Groovy search. cheers -- Torsten
Simple SetWritable class
I tried to extend TreeSet to be Wrtiable. Here is what I did. public class SetWritable extends TreeSetInteger implements WritableComparable { public void readFields(DataInput in) throws IOException { clear(); int sz=in.readInt(); for (int i = 0; i sz; i++) { add(in.readInt()); } } public void write(DataOutput out) throws IOException { out.writeInt(size()); IteratorInteger iter=this.iterator(); while (iter.hasNext()) { out.writeInt(iter.next()); } } } If I remove clear() from the readFields() I am gettting wrong output.(some old data is written with the new ones !). With clear() it is ok as my simple tests show. Is this implementaion save to be used ? Regards, -- View this message in context: http://www.nabble.com/Simple-SetWritable-class-tp16833976p16833976.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Not able to back up to S3
Part of the problem here is that the error message is confusing. It looks like there's a problem with the AWS credentials, when in fact the host name is malformed (but URI isn't telling us). I've created a patch to make the error message more helpful: https://issues.apache.org/jira/browse/HADOOP-3301. Tom On Fri, Apr 18, 2008 at 11:20 AM, Steve Loughran [EMAIL PROTECTED] wrote: Chris K Wensel wrote: you cannot have underscores in a bucket name. it freaks out java.net.URI. freaks out DNS, too, which is why the java.net classes whine. minus signs should work -- Steve Loughran http://www.1060.org/blogxter/publish/5 Author: Ant in Action http://antbook.org/
Re: Appending to Input Path after mapping has begun
Though I have only spent a couple days reviewing the code, it seems the crux of the problem is in the InputFormat interface in that getSplits is only called at the initiation of the map/reduce job; it would seem that if this method was more iterable in implementation like a getNextSplits you could have a way to add more files to the pipeline while in process. == mathos On Wed, Apr 23, 2008 at 1:32 AM, Owen O'Malley [EMAIL PROTECTED] wrote: On Apr 22, 2008, at 11:01 AM, Thomas Cramer wrote: Is it possible or how may one add to the input path after mapping has begun? More specifically say my Map process creates more files to needing to Map and you don't want to have to keep re-initiating Map/Reduce processes. I tried simply creating files in the InputPath directory. I have also pulled the JobConf object into my map process and issued an addInputPath but apparently it doesn't effect the process after it is running. Any thoughts or options? No, it isn't currently possible. I can imagine an extension to the framework that let's you add new input splits to a job after it has started, but it would be a lot of work to get it right. The primary advantage of such a system would be that you could increase the efficiency of a pipeline of map/reduce jobs. -- Owen
Re: Hadoop summit video capture?
Certainly... Stay tuned. Jeremy On 4/22/08, Chris Mattmann [EMAIL PROTECTED] wrote: Hi Jeremy, Any chance that these videos could be made in a downloadable format rather than thru Y!'s player? For example I'm traveling right now and would love to watch the rest of the presentations but the next few hours I won't have an internet connection. So, my request won't help me, but may help folks in similar situations. Just a thought, thanks! Cheers, Chris On 4/22/08 1:27 PM, Jeremy Zawodny [EMAIL PROTECTED] wrote: Okay, things appear to be fixed now. Jeremy On 4/20/08, Jeremy Zawodny [EMAIL PROTECTED] wrote: Not yet... there seem to be a lot of cooks in the kitchen on this one, but we'll get it fixed. Jeremy On 4/19/08, Cole Flournoy [EMAIL PROTECTED] wrote: Any news on when the videos are going to work? I am dieing to watch them! Cole On Fri, Apr 18, 2008 at 8:10 PM, Jeremy Zawodny [EMAIL PROTECTED] wrote: Almost... The videos and slides are up (as of yesterday) but there appears to be an ACL problem with the videos. http://developer.yahoo.com/blogs/hadoop/2008/04/hadoop_summit_slides_and_vi deo.html Jeremy On 4/17/08, wuqi [EMAIL PROTECTED] wrote: Are the videos and slides available now? - Original Message - From: Jeremy Zawodny [EMAIL PROTECTED] To: core-user@hadoop.apache.org Cc: [EMAIL PROTECTED] Sent: Thursday, March 27, 2008 11:01 AM Subject: Re: Hadoop summit video capture? Slides and video go up next week. It just takes a few days to assemble. We're glad everyone enjoyed it and was okay with a last minute venue change. Thanks also to Amazon.com and the NSF (not NFS as I typo'd on the printed agenda!) Jeremy On 3/26/08, Cam Bazz [EMAIL PROTECTED] wrote: Yes, are there any materials for those who could not come to summit? I am really curious about this summit. Is the material posted on the hadoop page? Best Regards, -C.A. On Wed, Mar 26, 2008 at 8:48 AM, Isabel Drost [EMAIL PROTECTED] wrote: On Wednesday 26 March 2008, Jeff Eastman wrote: I personally got a lot of positive feedback and interest in Mahout, so expect your inbox to explode in the next couple of days. Sounds great. I was already happy we received quite some traffic after we published that we would take part in the GSoC. Isabel -- kernel, n.: A part of an operating system that preserves the medieval traditions of sorcery and black art. |\ _,,,---,,_ Web: http://www.isabel-drost.de /,`.-'`'-. ;-;;,_ |,4- ) )-,_..;\ ( `'-' '---''(_/--' `-'\_) (fL) IM: xmpp://[EMAIL PROTECTED] __ Chris Mattmann, Ph.D. [EMAIL PROTECTED] Cognizant Development Engineer Early Detection Research Network Project _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266B Mailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: Hadoop summit video capture?
Thanks, Jeremy. Appreciate it. Cheers, Chris On 4/23/08 8:25 AM, Jeremy Zawodny [EMAIL PROTECTED] wrote: Certainly... Stay tuned. Jeremy On 4/22/08, Chris Mattmann [EMAIL PROTECTED] wrote: Hi Jeremy, Any chance that these videos could be made in a downloadable format rather than thru Y!'s player? For example I'm traveling right now and would love to watch the rest of the presentations but the next few hours I won't have an internet connection. So, my request won't help me, but may help folks in similar situations. Just a thought, thanks! Cheers, Chris On 4/22/08 1:27 PM, Jeremy Zawodny [EMAIL PROTECTED] wrote: Okay, things appear to be fixed now. Jeremy On 4/20/08, Jeremy Zawodny [EMAIL PROTECTED] wrote: Not yet... there seem to be a lot of cooks in the kitchen on this one, but we'll get it fixed. Jeremy On 4/19/08, Cole Flournoy [EMAIL PROTECTED] wrote: Any news on when the videos are going to work? I am dieing to watch them! Cole On Fri, Apr 18, 2008 at 8:10 PM, Jeremy Zawodny [EMAIL PROTECTED] wrote: Almost... The videos and slides are up (as of yesterday) but there appears to be an ACL problem with the videos. http://developer.yahoo.com/blogs/hadoop/2008/04/hadoop_summit_slides_and_vi deo.html Jeremy On 4/17/08, wuqi [EMAIL PROTECTED] wrote: Are the videos and slides available now? - Original Message - From: Jeremy Zawodny [EMAIL PROTECTED] To: core-user@hadoop.apache.org Cc: [EMAIL PROTECTED] Sent: Thursday, March 27, 2008 11:01 AM Subject: Re: Hadoop summit video capture? Slides and video go up next week. It just takes a few days to assemble. We're glad everyone enjoyed it and was okay with a last minute venue change. Thanks also to Amazon.com and the NSF (not NFS as I typo'd on the printed agenda!) Jeremy On 3/26/08, Cam Bazz [EMAIL PROTECTED] wrote: Yes, are there any materials for those who could not come to summit? I am really curious about this summit. Is the material posted on the hadoop page? Best Regards, -C.A. On Wed, Mar 26, 2008 at 8:48 AM, Isabel Drost [EMAIL PROTECTED] wrote: On Wednesday 26 March 2008, Jeff Eastman wrote: I personally got a lot of positive feedback and interest in Mahout, so expect your inbox to explode in the next couple of days. Sounds great. I was already happy we received quite some traffic after we published that we would take part in the GSoC. Isabel -- kernel, n.: A part of an operating system that preserves the medieval traditions of sorcery and black art. |\ _,,,---,,_ Web: http://www.isabel-drost.de /,`.-'`'-. ;-;;,_ |,4- ) )-,_..;\ ( `'-' '---''(_/--' `-'\_) (fL) IM: xmpp://[EMAIL PROTECTED] __ Chris Mattmann, Ph.D. [EMAIL PROTECTED] Cognizant Development Engineer Early Detection Research Network Project _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266B Mailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. __ Chris Mattmann, Ph.D. [EMAIL PROTECTED] Cognizant Development Engineer Early Detection Research Network Project _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266B Mailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: User accounts in Master and Slaves
Yes, this is the suggested configuration. Hadoop relies on password-less SSH to be able to start tasks on slave machines. You can find instructions on creating/transferring the SSH keys here: http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29 On Wed, Apr 23, 2008 at 4:39 AM, Sridhar Raman [EMAIL PROTECTED] wrote: Ok, what about the issue regarding the users? Do all the machines need to be under the same user? On Wed, Apr 23, 2008 at 12:43 PM, Harish Mallipeddi [EMAIL PROTECTED] wrote: On Wed, Apr 23, 2008 at 3:03 PM, Sridhar Raman [EMAIL PROTECTED] wrote: After trying out Hadoop in a single machine, I decided to run a MapReduce across multiple machines. This is the approach I followed: 1 Master 1 Slave (A doubt here: Can my Master also be used to execute the Map/Reduce functions?) If you add the master node to the list of slaves (conf/slaves), then the master node run will also run a TaskTracker. To do this, I set up the masters and slaves files in the conf directory. Following the instructions in this page - http://hadoop.apache.org/core/docs/current/cluster_setup.html, I had set up sshd in both the machines, and was able to ssh from one to the other. I tried to run bin/start-dfs.sh. Unfortunately, this asked for a password for [EMAIL PROTECTED], while in slave, there was only user2. While in master, user1 was the logged on user. How do I resolve this? Should the user accounts be present in all the machines? Or can I specify this somewhere? -- Harish Mallipeddi circos.com : poundbang.in/blog/
Best practices for handling many small files
Hello all, Hadoop newbie here, asking: what's the preferred way to handle large (~1 million) collections of small files (10 to 100KB) in which each file is a single record? 1. Ignore it, let Hadoop create a million Map processes; 2. Pack all the files into a single SequenceFile; or 3. Something else? I started writing code to do #2, transforming a big tar.bz2 into a BLOCK-compressed SequenceFile, with the file names as keys. Will that work? Thanks, -Stuart, altlaw.org
Re: Can Hadoop process pictures and videos?
It's also easy to launch processes from Java. If you have unsplittable input files, Java can read the entire file and pass it to the child process. On 4/22/08 11:49 PM, Owen O'Malley [EMAIL PROTECTED] wrote: On Apr 22, 2008, at 11:44 PM, Roland Rabben wrote: Thanks Do you have any pointers to how you would normally approach this? Is Hadoop Streaming the solution I am looking for? Probably not, given that Hadoop streaming doesn't work well with binary data. It is probably easiest to the use the C++ interface. Look at the examples in src/examples/pipes. -- Owen
Re: submitting map-reduce jobs without creating jar file ?
I haven't distributed it formally yet. If you would like a tarball, I would be happy to send it. On 4/23/08 1:43 AM, Torsten Curdt [EMAIL PROTECTED] wrote: On Apr 23, 2008, at 00:31, Ted Dunning wrote: Grool might help you. Got a link? Google is not very helpful on the Grool + Groovy search. cheers -- Torsten
Re: Best practices for handling many small files
Yes. That (2) should work well. On 4/23/08 8:55 AM, Stuart Sierra [EMAIL PROTECTED] wrote: Hello all, Hadoop newbie here, asking: what's the preferred way to handle large (~1 million) collections of small files (10 to 100KB) in which each file is a single record? 1. Ignore it, let Hadoop create a million Map processes; 2. Pack all the files into a single SequenceFile; or 3. Something else? I started writing code to do #2, transforming a big tar.bz2 into a BLOCK-compressed SequenceFile, with the file names as keys. Will that work? Thanks, -Stuart, altlaw.org
RE: Best practices for handling many small files
million map processes are horrible. aside from overhead - don't do it if u share the cluster with other jobs (all other jobs will get killed whenever the million map job is finished - see https://issues.apache.org/jira/browse/HADOOP-2393) well - even for #2 - it begs the question of how the packing itself will be parallelized .. There's a MultiFileInputFormat that can be extended - that allows processing of multiple files in a single map job. it needs improvement. For one - it's an abstract class - and a concrete implementation for (at least) text files would help. also - the splitting logic is not very smart (from what i last saw). ideally - it should take the million files and form it into N groups (say N is size of your cluster) where each group has files local to the Nth machine and then process them on that machine. currently it doesn't do this (the groups are arbitrary). But it's still the way to go .. -Original Message- From: [EMAIL PROTECTED] on behalf of Stuart Sierra Sent: Wed 4/23/2008 8:55 AM To: core-user@hadoop.apache.org Subject: Best practices for handling many small files Hello all, Hadoop newbie here, asking: what's the preferred way to handle large (~1 million) collections of small files (10 to 100KB) in which each file is a single record? 1. Ignore it, let Hadoop create a million Map processes; 2. Pack all the files into a single SequenceFile; or 3. Something else? I started writing code to do #2, transforming a big tar.bz2 into a BLOCK-compressed SequenceFile, with the file names as keys. Will that work? Thanks, -Stuart, altlaw.org
Re: submitting map-reduce jobs without creating jar file ?
I need some advice/help on how it should be structured as a contrib module. On 4/23/08 9:31 AM, Doug Cutting [EMAIL PROTECTED] wrote: Ted Dunning wrote: I haven't distributed it formally yet. If you would like a tarball, I would be happy to send it. Can you attach it to a Jira issue? Then we can target it for a contrib module or somesuch. Doug
Re: submitting map-reduce jobs without creating jar file ?
Just added it to Hadoop 2781. See here [ https://issues.apache.org/jira/browse/HADOOP-2781?page=com.atlassian.jira.pl ugin.system.issuetabpanels:all-tabpanel ] On 4/23/08 9:35 AM, Torsten Curdt [EMAIL PROTECTED] wrote: Ah, OK Well, bring it on :-) cheers -- Torsten On Apr 23, 2008, at 18:06, Ted Dunning wrote: I haven't distributed it formally yet. If you would like a tarball, I would be happy to send it. On 4/23/08 1:43 AM, Torsten Curdt [EMAIL PROTECTED] wrote: On Apr 23, 2008, at 00:31, Ted Dunning wrote: Grool might help you. Got a link? Google is not very helpful on the Grool + Groovy search. cheers -- Torsten
RE: How to instruct Job Tracker to use certain hosts only
Thanks Owen for the suggestion. I wonder if there would be side effects from failing the job from the node consistently? Would job tracker black list the nodes for other jobs as well? Htin -Original Message- From: Owen O'Malley [mailto:[EMAIL PROTECTED] Sent: Monday, April 21, 2008 10:53 PM To: core-user@hadoop.apache.org Subject: Re: How to instruct Job Tracker to use certain hosts only On Apr 18, 2008, at 1:52 PM, Htin Hlaing wrote: I would like to run the first job to run on all the compute hosts in the cluster (which is by default) and then, I would like to run the second job with only on a subset of the hosts (due to some licensing issue). One option would be to set mapred.map.max.attempts and mapred.reduce.max.attempts to larger numbers and have the map or reduce fail if it is run on a bad node. When the task re-runs, it will run on a different node. Eventually it will find a valid node. -- Owen
Re: Best practices for handling many small files
are the files to be stored on HDFS long term, or do they need to be fetched from an external authoritative source? depending on how things are setup in your datacenter etc... you could aggregate them into a fat sequence file (or a few). keep in mind how long it would take to fetch the files and aggregate them (this is a serial process) and if the corpus changes often (how often will you need to make these sequence files). another option is to make a manifest (list of docs to fetch), feed that to your mapper and have it fetch each file individually. this would be useful if the corpus is reasonably arbitrary between runs and could eliminate much of the load time. but painful if the data is external to your datacenter and the cost to refetch is high. there really is no simple answer.. ckw On Apr 23, 2008, at 9:16 AM, Joydeep Sen Sarma wrote: million map processes are horrible. aside from overhead - don't do it if u share the cluster with other jobs (all other jobs will get killed whenever the million map job is finished - see https://issues.apache.org/jira/browse/HADOOP-2393) well - even for #2 - it begs the question of how the packing itself will be parallelized .. There's a MultiFileInputFormat that can be extended - that allows processing of multiple files in a single map job. it needs improvement. For one - it's an abstract class - and a concrete implementation for (at least) text files would help. also - the splitting logic is not very smart (from what i last saw). ideally - it should take the million files and form it into N groups (say N is size of your cluster) where each group has files local to the Nth machine and then process them on that machine. currently it doesn't do this (the groups are arbitrary). But it's still the way to go .. -Original Message- From: [EMAIL PROTECTED] on behalf of Stuart Sierra Sent: Wed 4/23/2008 8:55 AM To: core-user@hadoop.apache.org Subject: Best practices for handling many small files Hello all, Hadoop newbie here, asking: what's the preferred way to handle large (~1 million) collections of small files (10 to 100KB) in which each file is a single record? 1. Ignore it, let Hadoop create a million Map processes; 2. Pack all the files into a single SequenceFile; or 3. Something else? I started writing code to do #2, transforming a big tar.bz2 into a BLOCK-compressed SequenceFile, with the file names as keys. Will that work? Thanks, -Stuart, altlaw.org Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: Error in start up
I put my username to R61neptun as you suggested but I am still getting that error: localhost: starting datanode, logging to /home/garga/Documents/hadoop-0.15.3/bin/../logs/hadoop-garga-datanode-R61neptun.out localhost: starting secondarynamenode, logging to /home/garga/Documents/hadoop-0.15.3/bin/../logs/hadoop-garga-secondarynamenode-R61neptun.out localhost: Exception in thread main java.lang.IllegalArgumentException: port out of range:-1 localhost: at java.net.InetSocketAddress.init(InetSocketAddress.java:118) localhost: at org.apache.hadoop.dfs.DataNode.createSocketAddr(DataNode.java:104) localhost: at org.apache.hadoop.dfs.SecondaryNameNode.init(SecondaryNameNode.java:94) localhost: at org.apache.hadoop.dfs.SecondaryNameNode.main(SecondaryNameNode.java:481) starting jobtracker, logging to /home/garga/Documents/hadoop-0.15.3/bin/../logs/hadoop-garga-jobtracker-R61neptun.out localhost: starting tasktracker, logging to /home/garga/Documents/hadoop-0.15.3/bin/../logs/hadoop-garga-tasktracker-R61neptun.out Could anyone tell about this error? I am just trying to run hadoop in pseudo distributed mode. Thanks, On Tue, Apr 22, 2008 at 11:57 PM, Sujee Maniyam [EMAIL PROTECTED] wrote: logs/hadoop-root-datanode-R61-neptun.out May be this will help you: I am guessing - from the log file name above - that your hostname has underscores/dashes. (e.g. R61-neptune). Could you try to use the hostname without I underscores? (e.g. R61neptune or even simple 'hadoop'). I had the same problem with Hadooop v0.16.3. My hostnames were 'hadoop_master / hadoop_slave'. And I was getting the 'Port of Out of Range -1' exception. Once I eliminated the underscores (e.g. master / slave) it started working. thanks -- View this message in context: http://www.nabble.com/Error-in-start-up-tp16783362p16826259.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Benchmarking and Statistics for Hadoop Distributed File System
Does any body aware of any benchmarking of hadoop distributed file system? Some numbers I am interested in are - How long does it take for master to recover if there are, say, 1 million blocks in the system? - How does the recovery time change as the number of blocks in the system change? Is it linear? Is it exponential? - What is the file read/write throughput of Hadoop File System with different configurations and loads? -- Best Regards, Cagdas Evren Gerede Home Page: http://cagdasgerede.info
Please Help: Namenode Safemode
I have a hadoop distributed file system with 3 datanodes. I only have 150 blocks in each datanode. It takes a little more than a minute for namenode to start and pass safemode phase. The steps for namenode start, as much as I understand, are: 1) Datanode send a heartbeat to namenode. Namenode tells datanode to send blockreport as a piggyback to heartbeat. 2) Datanode computes the block report. 3) Datanode sends it to Namenode. 4) Namenode processes the block report. 5) Namenode safe mode thread monitor checks for exiting, and namenode exist if threshold is reached and the extension time is passed. Here are my numbers: Step 1) Datanodes send heartbeats every 3 seconds. Step 2) Datanode computes the block report. (this takes about 20 miliseconds - as shown in the datanodes' logs) Step 3) No idea? (Depends on the size of blockreport. I suspect this should not be more than a couple of seconds). Step 4) No idea? Shouldn't be more than a couple of seconds. Step 5) Thread checks every second. The extension value in my configuration is 0. So there is no wait if threshold is achieved. Given these numbers, can any body explain where does one minute come from? Shouldn't this step take 10-20 seconds? Please help. I am very confused. -- Best Regards, Cagdas Evren Gerede Home Page: http://cagdasgerede.info
Mounting HDFS in Linux using FUSE
Hi! I have recently downloaded the latest version of Fuse-DSF. I would like to ask if anyone can help me in using it / mounting to Linux. Need help asap. thnx Currently I have in my machine: OS: Ubuntu ver. 7.10 Hadop: 0.16.3 Java: 1.6.3 Fuse: 2.7.3