How to configure SWIM
Hi all, Can anybody help me to configure SWIM -- Statistical Workload Injector for MapReduce on my hadoop cluster
Re: Browse the filesystem weblink broken after upgrade to 1.0.0: HTTP 404 Problem accessing /browseDirectory.jsp
On Wed, Feb 29, 2012 at 11:34 PM, W.P. McNeill bill...@gmail.com wrote: I can do perform HDFS operations from the command line like hadoop fs -ls /. Doesn't that meant that the datanode is up? No. It is just meta data lookup which comes from Namenode. Try to cat some file like hadoop fs -cat . Then if you are able get data then datanode should be up . Also make sure that hdfs is not in safemode . To turnoff safemode use hdfs command hadoop dfsadmin -safemode leave and then restart the jobtracker and tasktracker. -- Join me at http://hadoopworkshop.eventbrite.com/
Distributed Indexing on MapReduce
Hi all, I am looking into reusing some existing code for distributed indexing to test a Mahout tool I am working on https://issues.apache.org/jira/browse/MAHOUT-944 What I want is to index the Apache Public Mail Archives dataset (200G) via MapReduce on Hadoop. I have been going through the Nutch and contrib/index code and from my understanding I have to: * Create an InputFormat / RecordReader / InputSplit class for splitting the e-mails across mappers * Create a Mapper which emits the e-mails as key value pairs * Create a Reducer which indexes the e-mails on the local filesystem (or straight to HDFS?) * Copy these indexes from local filesystem to HDFS. In the same Reducer? I am unsure about the final steps. How to get to the end result, a bunch of index shards on HDFS. It seems that each Reducer needs to be aware of a directory they eventually write to on HDFS. I don't see how to get each reducer to copy its shard to HDFS How do I set this up? Cheers, Frank
Re: Streaming Hadoop using C
How was your experience of starfish? C On Mar 1, 2012, at 12:35 AM, Mark question wrote: Thank you for your time and suggestions, I've already tried starfish, but not jmap. I'll check it out. Thanks again, Mark On Wed, Feb 29, 2012 at 1:17 PM, Charles Earl charles.ce...@gmail.comwrote: I assume you have also just tried running locally and using the jdk performance tools (e.g. jmap) to gain insight by configuring hadoop to run absolute minimum number of tasks? Perhaps the discussion http://grokbase.com/t/hadoop/common-user/11ahm67z47/how-do-i-connect-java-visual-vm-to-a-remote-task might be relevant? On Feb 29, 2012, at 3:53 PM, Mark question wrote: I've used hadoop profiling (.prof) to show the stack trace but it was hard to follow. jConsole locally since I couldn't find a way to set a port number to child processes when running them remotely. Linux commands (top,/proc), showed me that the virtual memory is almost twice as my physical which means swapping is happening which is what I'm trying to avoid. So basically, is there a way to assign a port to child processes to monitor them remotely (asked before by Xun) or would you recommend another monitoring tool? Thank you, Mark On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl charles.ce...@gmail.com wrote: Mark, So if I understand, it is more the memory management that you are interested in, rather than a need to run an existing C or C++ application in MapReduce platform? Have you done profiling of the application? C On Feb 29, 2012, at 2:19 PM, Mark question wrote: Thanks Charles .. I'm running Hadoop for research to perform duplicate detection methods. To go deeper, I need to understand what's slowing my program, which usually starts with analyzing memory to predict best input size for map task. So you're saying piping can help me control memory even though it's running on VM eventually? Thanks, Mark On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl charles.ce...@gmail.com wrote: Mark, Both streaming and pipes allow this, perhaps more so pipes at the level of the mapreduce task. Can you provide more details on the application? On Feb 29, 2012, at 1:56 PM, Mark question wrote: Hi guys, thought I should ask this before I use it ... will using C over Hadoop give me the usual C memory management? For example, malloc() , sizeof() ? My guess is no since this all will eventually be turned into bytecode, but I need more control on memory which obviously is hard for me to do with Java. Let me know of any advantages you know about streaming in C over hadoop. Thank you, Mark
Re: Should splittable Gzip be a core hadoop feature?
I do agree that a git hub project is the way to go unless you could convince Cloudera, HortonWorks or MapR to pick it up and support it. They have enough committers Is this potentially worthwhile? Maybe, it depends on how the cluster is integrated in to the overall environment. Companies that have standardized on using gzip would find it useful. Sent from a remote device. Please excuse any typos... Mike Segel On Feb 29, 2012, at 3:17 PM, Niels Basjes ni...@basjes.nl wrote: Hi, On Wed, Feb 29, 2012 at 19:13, Robert Evans ev...@yahoo-inc.com wrote: What I really want to know is how well does this new CompressionCodec perform in comparison to the regular gzip codec in various different conditions and what type of impact does it have on network traffic and datanode load. My gut feeling is that the speedup is going to be relatively small except when there is a lot of computation happening in the mapper I agree, I made the same assesment. In the javadoc I wrote under When is this useful? *Assume you have a heavy map phase for which the input is a 1GiB Apache httpd logfile. Now assume this map takes 60 minutes of CPU time to run.* and the added load and network traffic outweighs the speedup in most cases, No, the trick to solve that one is to upload the gzipped files with a HDFS blocksize equal (or 1 byte larger) than the filesize. This setting will help in speeding up Gzipped input files in any situation (no more network overhead). From there the HDFS file replication factor of the file dictates the optimal number of splits for this codec. but like all performance on a complex system gut feelings are almost worthless and hard numbers are what is needed to make a judgment call. Yes Niels, I assume you have tested this on your cluster(s). Can you share with us some of the numbers? No I haven't tested it beyond a multiple core system. The simple reason for that is that when this was under review last summer the whole Yarn thing happened and I was unable to run it at all for a long time. I only got it running again last december when the restructuring of the source tree was mostly done. At this moment I'm building a experimentation setup at work that can be used for various things. Given the current state of Hadoop 2.0 I think it's time to produce some actual results. -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: Hadoop fair scheduler doubt: allocate jobs to pool
From the fairscheduler docs I assume the following should work: property namemapred.fairscheduler.poolnameproperty/name valuepool.name/value /property property namepool.name/name value${mapreduce.job.group.name}/value /property which means that the default pool will be the group of the user that has submitted the job. In your case I think that allocations.xml is correct. If you want to explicitly define a job to specific pool from your allocation.xml file you can define it as follows: Configuration conf3 = conf; conf3.set(pool.name, pool3); // conf.set(propriety.name, value) Let me know if it works.. On 29 February 2012 14:18, Austin Chungath austi...@gmail.com wrote: How can I set the fair scheduler such that all jobs submitted from a particular user group go to a pool with the group name? I have setup fair scheduler and I have two users: A and B (belonging to the user group hadoop) When these users submit hadoop jobs, the jobs from A got to a pool named A and the jobs from B go to a pool named B. I want them to go to a pool with their group name, So I tried adding the following to mapred-site.xml: property namemapred.fairscheduler.poolnameproperty/name valuegroup.name/value /property But instead the jobs now go to the default pool. I want the jobs submitted by A and B to go to the pool named hadoop. How do I do that? also how can I explicity set a job to any specified pool? I have set the allocation file (fair-scheduler.xml) like this: allocations pool name=hadoop minMaps1/minMaps minReduces1/minReduces maxMaps3/maxMaps maxReduces3/maxReduces /pool userMaxJobsDefault5/userMaxJobsDefault /allocations Any help is greatly appreciated. Thanks, Austin
RE: Hadoop fair scheduler doubt: allocate jobs to pool
I've just started playing with the Fair Scheduler. To specify the pool at job submission time you set the mapred.fairscheduler.pool property on the Job Conf to the name of the pool you want the job to use. Dave -Original Message- From: Merto Mertek [mailto:masmer...@gmail.com] Sent: Thursday, March 01, 2012 9:33 AM To: common-user@hadoop.apache.org Subject: Re: Hadoop fair scheduler doubt: allocate jobs to pool From the fairscheduler docs I assume the following should work: property namemapred.fairscheduler.poolnameproperty/name valuepool.name/value /property property namepool.name/name value${mapreduce.job.group.name}/value /property which means that the default pool will be the group of the user that has submitted the job. In your case I think that allocations.xml is correct. If you want to explicitly define a job to specific pool from your allocation.xml file you can define it as follows: Configuration conf3 = conf; conf3.set(pool.name, pool3); // conf.set(propriety.name, value) Let me know if it works.. On 29 February 2012 14:18, Austin Chungath austi...@gmail.com wrote: How can I set the fair scheduler such that all jobs submitted from a particular user group go to a pool with the group name? I have setup fair scheduler and I have two users: A and B (belonging to the user group hadoop) When these users submit hadoop jobs, the jobs from A got to a pool named A and the jobs from B go to a pool named B. I want them to go to a pool with their group name, So I tried adding the following to mapred-site.xml: property namemapred.fairscheduler.poolnameproperty/name valuegroup.name/value /property But instead the jobs now go to the default pool. I want the jobs submitted by A and B to go to the pool named hadoop. How do I do that? also how can I explicity set a job to any specified pool? I have set the allocation file (fair-scheduler.xml) like this: allocations pool name=hadoop minMaps1/minMaps minReduces1/minReduces maxMaps3/maxMaps maxReduces3/maxReduces /pool userMaxJobsDefault5/userMaxJobsDefault /allocations Any help is greatly appreciated. Thanks, Austin The information contained in this email message is considered confidential and proprietary to the sender and is intended solely for review and use by the named recipient. Any unauthorized review, use or distribution is strictly prohibited. If you have received this message in error, please advise the sender by reply email and delete the message.
Re: Hadoop fair scheduler doubt: allocate jobs to pool
Thanks, I will be trying the suggestions and will get back to you soon. On Thu, Mar 1, 2012 at 8:09 PM, Dave Shine dave.sh...@channelintelligence.com wrote: I've just started playing with the Fair Scheduler. To specify the pool at job submission time you set the mapred.fairscheduler.pool property on the Job Conf to the name of the pool you want the job to use. Dave -Original Message- From: Merto Mertek [mailto:masmer...@gmail.com] Sent: Thursday, March 01, 2012 9:33 AM To: common-user@hadoop.apache.org Subject: Re: Hadoop fair scheduler doubt: allocate jobs to pool From the fairscheduler docs I assume the following should work: property namemapred.fairscheduler.poolnameproperty/name valuepool.name/value /property property namepool.name/name value${mapreduce.job.group.name}/value /property which means that the default pool will be the group of the user that has submitted the job. In your case I think that allocations.xml is correct. If you want to explicitly define a job to specific pool from your allocation.xml file you can define it as follows: Configuration conf3 = conf; conf3.set(pool.name, pool3); // conf.set(propriety.name, value) Let me know if it works.. On 29 February 2012 14:18, Austin Chungath austi...@gmail.com wrote: How can I set the fair scheduler such that all jobs submitted from a particular user group go to a pool with the group name? I have setup fair scheduler and I have two users: A and B (belonging to the user group hadoop) When these users submit hadoop jobs, the jobs from A got to a pool named A and the jobs from B go to a pool named B. I want them to go to a pool with their group name, So I tried adding the following to mapred-site.xml: property namemapred.fairscheduler.poolnameproperty/name valuegroup.name/value /property But instead the jobs now go to the default pool. I want the jobs submitted by A and B to go to the pool named hadoop. How do I do that? also how can I explicity set a job to any specified pool? I have set the allocation file (fair-scheduler.xml) like this: allocations pool name=hadoop minMaps1/minMaps minReduces1/minReduces maxMaps3/maxMaps maxReduces3/maxReduces /pool userMaxJobsDefault5/userMaxJobsDefault /allocations Any help is greatly appreciated. Thanks, Austin The information contained in this email message is considered confidential and proprietary to the sender and is intended solely for review and use by the named recipient. Any unauthorized review, use or distribution is strictly prohibited. If you have received this message in error, please advise the sender by reply email and delete the message.
Re: Hadoop fair scheduler doubt: allocate jobs to pool
Hi, I tried what you had said. I added the following to mapred-site.xml: property namemapred.fairscheduler.poolnameproperty/name valuepool.name/value /property property namepool.name/name value${mapreduce.job.group.name}/value /property Funny enough it created a pool with the name ${mapreduce.job.group.name} so I tried ${mapred.job.group.name} and ${group.name} all to the same effect. But when I did ${user.name} it worked! and created a pool with the user name. On Thu, Mar 1, 2012 at 8:03 PM, Merto Mertek masmer...@gmail.com wrote: From the fairscheduler docs I assume the following should work: property namemapred.fairscheduler.poolnameproperty/name valuepool.name/value /property property namepool.name/name value${mapreduce.job.group.name}/value /property which means that the default pool will be the group of the user that has submitted the job. In your case I think that allocations.xml is correct. If you want to explicitly define a job to specific pool from your allocation.xml file you can define it as follows: Configuration conf3 = conf; conf3.set(pool.name, pool3); // conf.set(propriety.name, value) Let me know if it works.. On 29 February 2012 14:18, Austin Chungath austi...@gmail.com wrote: How can I set the fair scheduler such that all jobs submitted from a particular user group go to a pool with the group name? I have setup fair scheduler and I have two users: A and B (belonging to the user group hadoop) When these users submit hadoop jobs, the jobs from A got to a pool named A and the jobs from B go to a pool named B. I want them to go to a pool with their group name, So I tried adding the following to mapred-site.xml: property namemapred.fairscheduler.poolnameproperty/name valuegroup.name/value /property But instead the jobs now go to the default pool. I want the jobs submitted by A and B to go to the pool named hadoop. How do I do that? also how can I explicity set a job to any specified pool? I have set the allocation file (fair-scheduler.xml) like this: allocations pool name=hadoop minMaps1/minMaps minReduces1/minReduces maxMaps3/maxMaps maxReduces3/maxReduces /pool userMaxJobsDefault5/userMaxJobsDefault /allocations Any help is greatly appreciated. Thanks, Austin
kill -QUIT
When I try kill -QUIT for a job it doesn't send the stacktrace to the log files. Does anyone know why or if I am doing something wrong? I find the job using ps -ef|grep attempt. I then go to logs/userLogs/jobid/attemptid/
High quality hadoop logo?
Is there a high quality version of the hadoop logo anywhere? Even the graphic presented on the Apache page itself suffers from dreadful jpeg artifacting. A google image search didn't inspire much hope on this issue (they all have the same low-quality jpeg appearance). I'm looking for good graphics for slides, presentations, publications, etc. Thanks. Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't scratch. All together this implies: He scratched the itch from the scratch that itched but would never itch the scratch from the itch that scratched. -- Keith Wiley
Re: High quality hadoop logo?
Sorry, false alarm. I was looking at the popup thumbnails in google image search. If I click all the way through, there are some high quality versions available. Why is the version on the Apache site (and the Wikipedia page) so poor? On Mar 1, 2012, at 14:09 , Keith Wiley wrote: Is there a high quality version of the hadoop logo anywhere? Even the graphic presented on the Apache page itself suffers from dreadful jpeg artifacting. A google image search didn't inspire much hope on this issue (they all have the same low-quality jpeg appearance). I'm looking for good graphics for slides, presentations, publications, etc. Thanks. Keith Wiley kwi...@keithwiley.com keithwiley.com music.keithwiley.com You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't scratch. All together this implies: He scratched the itch from the scratch that itched but would never itch the scratch from the itch that scratched. -- Keith Wiley Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com Yet mark his perfect self-contentment, and hence learn his lesson, that to be self-contented is to be vile and ignorant, and that to aspire is better than to be blindly and impotently happy. -- Edwin A. Abbott, Flatland
Re: High quality hadoop logo?
On Thu, Mar 1, 2012 at 2:14 PM, Keith Wiley kwi...@keithwiley.com wrote: Sorry, false alarm. I was looking at the popup thumbnails in google image search. If I click all the way through, there are some high quality versions available. Why is the version on the Apache site (and the Wikipedia page) so poor? The high resolution images are in subversion: http://svn.apache.org/repos/asf/hadoop/logos/ -- Owen
Re: Streaming Hadoop using C
Starfish worked great for wordcount .. I didn't run it on my application because I have only map tasks. Mark On Thu, Mar 1, 2012 at 4:34 AM, Charles Earl charles.ce...@gmail.comwrote: How was your experience of starfish? C On Mar 1, 2012, at 12:35 AM, Mark question wrote: Thank you for your time and suggestions, I've already tried starfish, but not jmap. I'll check it out. Thanks again, Mark On Wed, Feb 29, 2012 at 1:17 PM, Charles Earl charles.ce...@gmail.com wrote: I assume you have also just tried running locally and using the jdk performance tools (e.g. jmap) to gain insight by configuring hadoop to run absolute minimum number of tasks? Perhaps the discussion http://grokbase.com/t/hadoop/common-user/11ahm67z47/how-do-i-connect-java-visual-vm-to-a-remote-task might be relevant? On Feb 29, 2012, at 3:53 PM, Mark question wrote: I've used hadoop profiling (.prof) to show the stack trace but it was hard to follow. jConsole locally since I couldn't find a way to set a port number to child processes when running them remotely. Linux commands (top,/proc), showed me that the virtual memory is almost twice as my physical which means swapping is happening which is what I'm trying to avoid. So basically, is there a way to assign a port to child processes to monitor them remotely (asked before by Xun) or would you recommend another monitoring tool? Thank you, Mark On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl charles.ce...@gmail.com wrote: Mark, So if I understand, it is more the memory management that you are interested in, rather than a need to run an existing C or C++ application in MapReduce platform? Have you done profiling of the application? C On Feb 29, 2012, at 2:19 PM, Mark question wrote: Thanks Charles .. I'm running Hadoop for research to perform duplicate detection methods. To go deeper, I need to understand what's slowing my program, which usually starts with analyzing memory to predict best input size for map task. So you're saying piping can help me control memory even though it's running on VM eventually? Thanks, Mark On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl charles.ce...@gmail.com wrote: Mark, Both streaming and pipes allow this, perhaps more so pipes at the level of the mapreduce task. Can you provide more details on the application? On Feb 29, 2012, at 1:56 PM, Mark question wrote: Hi guys, thought I should ask this before I use it ... will using C over Hadoop give me the usual C memory management? For example, malloc() , sizeof() ? My guess is no since this all will eventually be turned into bytecode, but I need more control on memory which obviously is hard for me to do with Java. Let me know of any advantages you know about streaming in C over hadoop. Thank you, Mark
Re: High quality hadoop logo?
Excellent! Thank you. Sent from my phone, please excuse my brevity. Keith Wiley, kwi...@keithwiley.com, http://keithwiley.com Owen O'Malley omal...@apache.org wrote: On Thu, Mar 1, 2012 at 2:14 PM, Keith Wiley kwi...@keithwiley.com wrote: Sorry, false alarm. I was looking at the popup thumbnails in google image search. If I click all the way through, there are some high quality versions available. Why is the version on the Apache site (and the Wikipedia page) so poor? The high resolution images are in subversion: http://svn.apache.org/repos/asf/hadoop/logos/ -- Owen
Adding nodes
Is this the right procedure to add nodes? I took some from hadoop wiki FAQ: http://wiki.apache.org/hadoop/FAQ 1. Update conf/slave 2. on the slave nodes start datanode and tasktracker 3. hadoop balancer Do I also need to run dfsadmin -refreshnodes?
Re: Adding nodes
You only have to refresh nodes if you're making use of an allows file. Sent from my iPhone On Mar 1, 2012, at 18:29, Mohit Anchlia mohitanch...@gmail.com wrote: Is this the right procedure to add nodes? I took some from hadoop wiki FAQ: http://wiki.apache.org/hadoop/FAQ 1. Update conf/slave 2. on the slave nodes start datanode and tasktracker 3. hadoop balancer Do I also need to run dfsadmin -refreshnodes?
Re: Adding nodes
On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria j...@cloudera.com wrote: You only have to refresh nodes if you're making use of an allows file. Thanks does it mean that when tasktracker/datanode starts up it communicates with namenode using master file? Sent from my iPhone On Mar 1, 2012, at 18:29, Mohit Anchlia mohitanch...@gmail.com wrote: Is this the right procedure to add nodes? I took some from hadoop wiki FAQ: http://wiki.apache.org/hadoop/FAQ 1. Update conf/slave 2. on the slave nodes start datanode and tasktracker 3. hadoop balancer Do I also need to run dfsadmin -refreshnodes?
Re: Adding nodes
Not quite. Datanodes get the namenode host from fs.defalt.name in core-site.xml. Task trackers find the job tracker from the mapred.job.tracker setting in mapred-site.xml. Sent from my iPhone On Mar 1, 2012, at 18:49, Mohit Anchlia mohitanch...@gmail.com wrote: On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria j...@cloudera.com wrote: You only have to refresh nodes if you're making use of an allows file. Thanks does it mean that when tasktracker/datanode starts up it communicates with namenode using master file? Sent from my iPhone On Mar 1, 2012, at 18:29, Mohit Anchlia mohitanch...@gmail.com wrote: Is this the right procedure to add nodes? I took some from hadoop wiki FAQ: http://wiki.apache.org/hadoop/FAQ 1. Update conf/slave 2. on the slave nodes start datanode and tasktracker 3. hadoop balancer Do I also need to run dfsadmin -refreshnodes?
Re: Adding nodes
The master and slave files, if I remember correctly are used to start the correct daemons on the correct nodes from the master node. Raj From: Joey Echeverria j...@cloudera.com To: common-user@hadoop.apache.org common-user@hadoop.apache.org Cc: common-user@hadoop.apache.org common-user@hadoop.apache.org Sent: Thursday, March 1, 2012 4:57 PM Subject: Re: Adding nodes Not quite. Datanodes get the namenode host from fs.defalt.name in core-site.xml. Task trackers find the job tracker from the mapred.job.tracker setting in mapred-site.xml. Sent from my iPhone On Mar 1, 2012, at 18:49, Mohit Anchlia mohitanch...@gmail.com wrote: On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria j...@cloudera.com wrote: You only have to refresh nodes if you're making use of an allows file. Thanks does it mean that when tasktracker/datanode starts up it communicates with namenode using master file? Sent from my iPhone On Mar 1, 2012, at 18:29, Mohit Anchlia mohitanch...@gmail.com wrote: Is this the right procedure to add nodes? I took some from hadoop wiki FAQ: http://wiki.apache.org/hadoop/FAQ 1. Update conf/slave 2. on the slave nodes start datanode and tasktracker 3. hadoop balancer Do I also need to run dfsadmin -refreshnodes?
Re: Adding nodes
On Thu, Mar 1, 2012 at 4:57 PM, Joey Echeverria j...@cloudera.com wrote: Not quite. Datanodes get the namenode host from fs.defalt.name in core-site.xml. Task trackers find the job tracker from the mapred.job.tracker setting in mapred-site.xml. I actually meant to ask how does namenode/jobtracker know there is a new node in the cluster. Is it initiated by namenode when slave file is edited? Or is it initiated by tasktracker when tasktracker is started? Sent from my iPhone On Mar 1, 2012, at 18:49, Mohit Anchlia mohitanch...@gmail.com wrote: On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria j...@cloudera.com wrote: You only have to refresh nodes if you're making use of an allows file. Thanks does it mean that when tasktracker/datanode starts up it communicates with namenode using master file? Sent from my iPhone On Mar 1, 2012, at 18:29, Mohit Anchlia mohitanch...@gmail.com wrote: Is this the right procedure to add nodes? I took some from hadoop wiki FAQ: http://wiki.apache.org/hadoop/FAQ 1. Update conf/slave 2. on the slave nodes start datanode and tasktracker 3. hadoop balancer Do I also need to run dfsadmin -refreshnodes?
Re: Adding nodes
Whatever Joey said is correct for Cloudera's distribution. For same, I am not confident about other distribution as i haven't tried them. Thanks, Anil On Thu, Mar 1, 2012 at 5:10 PM, Raj Vishwanathan rajv...@yahoo.com wrote: The master and slave files, if I remember correctly are used to start the correct daemons on the correct nodes from the master node. Raj From: Joey Echeverria j...@cloudera.com To: common-user@hadoop.apache.org common-user@hadoop.apache.org Cc: common-user@hadoop.apache.org common-user@hadoop.apache.org Sent: Thursday, March 1, 2012 4:57 PM Subject: Re: Adding nodes Not quite. Datanodes get the namenode host from fs.defalt.name in core-site.xml. Task trackers find the job tracker from the mapred.job.tracker setting in mapred-site.xml. Sent from my iPhone On Mar 1, 2012, at 18:49, Mohit Anchlia mohitanch...@gmail.com wrote: On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria j...@cloudera.com wrote: You only have to refresh nodes if you're making use of an allows file. Thanks does it mean that when tasktracker/datanode starts up it communicates with namenode using master file? Sent from my iPhone On Mar 1, 2012, at 18:29, Mohit Anchlia mohitanch...@gmail.com wrote: Is this the right procedure to add nodes? I took some from hadoop wiki FAQ: http://wiki.apache.org/hadoop/FAQ 1. Update conf/slave 2. on the slave nodes start datanode and tasktracker 3. hadoop balancer Do I also need to run dfsadmin -refreshnodes? -- Thanks Regards, Anil Gupta
Re: Adding nodes
WHat Joey said is correct for both apache and cloudera distros. The DN/TT daemons will connect to the NN/JT using the config files. The master and slave files are used for starting the correct daemons. From: anil gupta anilg...@buffalo.edu To: common-user@hadoop.apache.org; Raj Vishwanathan rajv...@yahoo.com Sent: Thursday, March 1, 2012 5:42 PM Subject: Re: Adding nodes Whatever Joey said is correct for Cloudera's distribution. For same, I am not confident about other distribution as i haven't tried them. Thanks, Anil On Thu, Mar 1, 2012 at 5:10 PM, Raj Vishwanathan rajv...@yahoo.com wrote: The master and slave files, if I remember correctly are used to start the correct daemons on the correct nodes from the master node. Raj From: Joey Echeverria j...@cloudera.com To: common-user@hadoop.apache.org common-user@hadoop.apache.org Cc: common-user@hadoop.apache.org common-user@hadoop.apache.org Sent: Thursday, March 1, 2012 4:57 PM Subject: Re: Adding nodes Not quite. Datanodes get the namenode host from fs.defalt.name in core-site.xml. Task trackers find the job tracker from the mapred.job.tracker setting in mapred-site.xml. Sent from my iPhone On Mar 1, 2012, at 18:49, Mohit Anchlia mohitanch...@gmail.com wrote: On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria j...@cloudera.com wrote: You only have to refresh nodes if you're making use of an allows file. Thanks does it mean that when tasktracker/datanode starts up it communicates with namenode using master file? Sent from my iPhone On Mar 1, 2012, at 18:29, Mohit Anchlia mohitanch...@gmail.com wrote: Is this the right procedure to add nodes? I took some from hadoop wiki FAQ: http://wiki.apache.org/hadoop/FAQ 1. Update conf/slave 2. on the slave nodes start datanode and tasktracker 3. hadoop balancer Do I also need to run dfsadmin -refreshnodes? -- Thanks Regards, Anil Gupta
Re: Adding nodes
It is initiated by the slave.If you have defined files to state which slaves can talk to the namenode (using configdfs.hosts) and which hosts cannot (using propertydfs.hosts.exclude) then you would need to edit these files and issue the refresh command.On Mar 1, 2012, at 5:35 PM, Mohit Anchlia wrote:On Thu, Mar 1, 2012 at 4:57 PM, Joey Echeverria j...@cloudera.com wrote:Not quite. Datanodes get the namenode host from fs.defalt.name incore-site.xml. Task trackers find the job tracker from themapred.job.tracker setting in mapred-site.xml.I actually meant to ask how does namenode/jobtracker know there is a newnode in the cluster. Is it initiated by namenode when slave file is edited?Or is it initiated by tasktracker when tasktracker is started?Sent from my iPhoneOn Mar 1, 2012, at 18:49, Mohit Anchlia mohitanch...@gmail.com wrote:On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria j...@cloudera.comwrote:You only have to refresh nodes if you're making use of an allows file.Thanks does it mean that when tasktracker/datanode starts up itcommunicates with namenode using master file?Sent from my iPhoneOn Mar 1, 2012, at 18:29, Mohit Anchlia mohitanch...@gmail.com wrote:Is this the right procedure to add nodes? I took some from hadoop wikiFAQ:http://wiki.apache.org/hadoop/FAQ1. Update conf/slave2. on the slave nodes start datanode and tasktracker3. hadoop balancerDo I also need to run dfsadmin -refreshnodes? --ArpitHortonworks, Inc.email: ar...@hortonworks.com
Re: Adding nodes
Thanks all for the answers!! On Thu, Mar 1, 2012 at 5:52 PM, Arpit Gupta ar...@hortonworks.com wrote: It is initiated by the slave. If you have defined files to state which slaves can talk to the namenode (using config dfs.hosts) and which hosts cannot (using property dfs.hosts.exclude) then you would need to edit these files and issue the refresh command. On Mar 1, 2012, at 5:35 PM, Mohit Anchlia wrote: On Thu, Mar 1, 2012 at 4:57 PM, Joey Echeverria j...@cloudera.com wrote: Not quite. Datanodes get the namenode host from fs.defalt.name in core-site.xml. Task trackers find the job tracker from the mapred.job.tracker setting in mapred-site.xml. I actually meant to ask how does namenode/jobtracker know there is a new node in the cluster. Is it initiated by namenode when slave file is edited? Or is it initiated by tasktracker when tasktracker is started? Sent from my iPhone On Mar 1, 2012, at 18:49, Mohit Anchlia mohitanch...@gmail.com wrote: On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria j...@cloudera.com wrote: You only have to refresh nodes if you're making use of an allows file. Thanks does it mean that when tasktracker/datanode starts up it communicates with namenode using master file? Sent from my iPhone On Mar 1, 2012, at 18:29, Mohit Anchlia mohitanch...@gmail.com wrote: Is this the right procedure to add nodes? I took some from hadoop wiki FAQ: http://wiki.apache.org/hadoop/FAQ 1. Update conf/slave 2. on the slave nodes start datanode and tasktracker 3. hadoop balancer Do I also need to run dfsadmin -refreshnodes? -- Arpit Hortonworks, Inc. email: ar...@hortonworks.com http://www.hadoopsummit.org/ http://www.hadoopsummit.org/ http://www.hadoopsummit.org/
Re: Adding nodes
Mohit, New datanodes will connect to the namenode so thats how the namenode knows. Just make sure the datanodes have the correct {fs.default.dir} in their hdfs-site.xml and then start them. The namenode can, however, choose to reject the datanode if you are using the {dfs.hosts} and {dfs.hosts.exclude} settings in the namenode's hdfs-site.xml. The namenode doesn't actually care about the slaves file. It's only used by the start/stop scripts. On 2012/03/02 10:35, Mohit Anchlia wrote: I actually meant to ask how does namenode/jobtracker know there is a new node in the cluster. Is it initiated by namenode when slave file is edited? Or is it initiated by tasktracker when tasktracker is started?
Re: LZO exception decompressing (returned -8)
Tried but still getting the error 0.4.15. Really lost with this. My hadoop release is 0.20.2 from more than a year ago. Could this be related to the problem? -- View this message in context: http://lucene.472066.n3.nabble.com/LZO-exception-decompressing-returned-8-tp3783652p3792484.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: LZO exception decompressing (returned -8)
Marc, Was the lzo libs on your server upgraded to a higher version recently? Also, when you deployed a built copy of 0.4.15, did you ensure you replaced the older native libs for hadoop-lzo as well? On Fri, Mar 2, 2012 at 9:05 AM, Marc Sturlese marc.sturl...@gmail.com wrote: Tried but still getting the error 0.4.15. Really lost with this. My hadoop release is 0.20.2 from more than a year ago. Could this be related to the problem? -- View this message in context: http://lucene.472066.n3.nabble.com/LZO-exception-decompressing-returned-8-tp3783652p3792484.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com. -- Harsh J
Re: LZO exception decompressing (returned -8)
Yes, The steps I followed where: 1-Intall lzo 2.06 in a machine with the same kernel as my nodes. 2-Compile there lzo 0.4.15 (in /lib replaced cdh3u3 per my hadoop 0.20.2 release) 3-Replace hadoop-lzo-0.4.9.jar for the now compiled hadoop-lzo-0.4.15.jar in the hadoop lib directory of all my nodes and master 4-Put de generated native files in the native lib directory of all the nodes and master 5-In my jar job, replaced the jar library hadoop-lzo-0.4.9.jar for hadoop-lzo-0.4.15.jar And sometimes when a job is running I get (4 times so the job gets killed): ...org.apache.hadoop.mapred.ReduceTask: Shuffling 3188320 bytes (1025174 raw bytes) into RAM from attempt_201202291221_1501_m_000480_0 2012-03-02 02:32:55,496 INFO org.apache.hadoop.mapred.ReduceTask: Task attempt_201202291221_1501_r_000105_0: Failed fetch #1 from attempt_201202291221_1501_m_46_0 2012-03-02 02:32:55,496 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201202291221_1501_r_000105_0 adding host hadoop-01.backend to penalty box, next contact in 4 seconds 2012-03-02 02:32:55,496 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201202291221_1501_r_000105_0: Got 1 map-outputs from previous failures 2012-03-02 02:32:55,497 FATAL org.apache.hadoop.mapred.TaskRunner: attempt_201202291221_1501_r_000105_0 : Map output copy failure : java.lang.InternalError: lzo1x_decompress returned: -8 at com.hadoop.compression.lzo.LzoDecompressor.decompressBytesDirect(Native Method) at com.hadoop.compression.lzo.LzoDecompressor.decompress(LzoDecompressor.java:305) at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:76) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1553) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1432) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1285) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1216) -- View this message in context: http://lucene.472066.n3.nabble.com/LZO-exception-decompressing-returned-8-tp3783652p3792505.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: LZO exception decompressing (returned -8)
I use to have 2.05 but now as I said I installed 2.06 -- View this message in context: http://lucene.472066.n3.nabble.com/LZO-exception-decompressing-returned-8-tp3783652p3792511.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: LZO exception decompressing (returned -8)
I know this doesn't fix lzo, but have you considered Snappy for the intermediate output compression? It gets similar compression ratios and compress/decompress speed, but arguably has better Hadoop integration. -Joey On Thu, Mar 1, 2012 at 10:01 PM, Marc Sturlese marc.sturl...@gmail.com wrote: I use to have 2.05 but now as I said I installed 2.06 -- View this message in context: http://lucene.472066.n3.nabble.com/LZO-exception-decompressing-returned-8-tp3783652p3792511.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com. -- Joseph Echeverria Cloudera, Inc. 443.305.9434
Re: LZO exception decompressing (returned -8)
Absolutely. In case I don't find the root of the problem soon I'll definitely try it. -- View this message in context: http://lucene.472066.n3.nabble.com/LZO-exception-decompressing-returned-8-tp3783652p3792531.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: LZO exception decompressing (returned -8)
Absolutely. In case I don't find the root of the problem soon I'll definitely try it. -- View this message in context: http://lucene.472066.n3.nabble.com/LZO-exception-decompressing-returned-8-tp3783652p3792530.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Comparison of Apache Pig Vs. Hadoop Streaming M/R
Hello Folks, Are there any pointers to such comparisons between Apache Pig and Hadoop Streaming Map Reduce jobs? Also there was a claim in our company that Pig performs better than Map Reduce jobs? Is this true? Are there any such benchmarks available Thanks, Subir
Re: Where Is DataJoinMapperBase?
Hi, Please look inside $HADOOP_HOME/contrib/datajoin folder of 0.20.2 version. You will find the jar. On Sat, Feb 11, 2012 at 1:09 AM, Bing Li lbl...@gmail.com wrote: Hi, all, I am starting to learn advanced Map/Reduce. However, I cannot find the class DataJoinMapperBase in my downloaded Hadoop 1.0.0 and 0.20.2. So I searched on the Web and get the following link. http://www.java2s.com/Code/Jar/h/Downloadhadoop0201datajoinjar.htm From the link I got the package, hadoop-0.20.1-datajoin.jar. My question is why the package is not included in Hadoop 1.0.0 and 0.20.2? Is the correct way to get it? Thanks so much! Best regards, Bing -- Join me at http://hadoopworkshop.eventbrite.com/
Re: DFSIO
Hi, Only HDFS should be enough. On Fri, Nov 25, 2011 at 1:45 AM, Thanh Do than...@cs.wisc.edu wrote: hi all, in order to run DFSIO in my cluster, do i need to run JobTracker, and TaskTracker, or just running HDFS is enough? Many thanks, Thanh -- Join me at http://hadoopworkshop.eventbrite.com/
Re: DFSIO
Madhu, That is incorrect. TestDFSIO is a MapReduce job and you need HDFS+MR setup to use it. On Fri, Mar 2, 2012 at 11:07 AM, madhu phatak phatak@gmail.com wrote: Hi, Only HDFS should be enough. On Fri, Nov 25, 2011 at 1:45 AM, Thanh Do than...@cs.wisc.edu wrote: hi all, in order to run DFSIO in my cluster, do i need to run JobTracker, and TaskTracker, or just running HDFS is enough? Many thanks, Thanh -- Join me at http://hadoopworkshop.eventbrite.com/ -- Harsh J
Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R
Considering Pig essentially translates scripts into Map Reduce jobs, one can always write as good Map Reduce jobs as Pig does. You can refer to Pig experience paper to see the overhead Pig introduces, but it's been improved all the time. Btw if you really care about the performance, how you configure Hadoop and Pig can also play an important role. Thanks, Jie -- Starfish is an intelligent performance tuning tool for Hadoop. Homepage: www.cs.duke.edu/starfish/ Mailing list: http://groups.google.com/group/hadoop-starfish On Thu, Mar 1, 2012 at 11:48 PM, Subir S subir.sasiku...@gmail.com wrote: Hello Folks, Are there any pointers to such comparisons between Apache Pig and Hadoop Streaming Map Reduce jobs? Also there was a claim in our company that Pig performs better than Map Reduce jobs? Is this true? Are there any such benchmarks available Thanks, Subir
Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R
On Fri, Mar 2, 2012 at 10:18 AM, Subir S subir.sasiku...@gmail.com wrote: Hello Folks, Are there any pointers to such comparisons between Apache Pig and Hadoop Streaming Map Reduce jobs? I do not see why you seek to compare these two. Pig offers a language that lets you write data-flow operations and runs these statements as a series of MR jobs for you automatically (Making it a great tool to use to get data processing done really quick, without bothering with code), while streaming is something you use to write non-Java, simple MR jobs. Both have their own purposes. Also there was a claim in our company that Pig performs better than Map Reduce jobs? Is this true? Are there any such benchmarks available Pig _runs_ MR jobs. It does do job design (and some data) optimizations based on your queries, which is what may give it an edge over designing elaborate flows of plain MR jobs with tools like Oozie/JobControl (Which takes more time to do). But regardless, Pig only makes it easy doing the same thing with Pig Latin statements for you. -- Harsh J
Re: DFSIO
Hi Harsha, Sorry i read DFSIO as DFS Input/Output which i thought reading and writing using HDFS API:) On Fri, Mar 2, 2012 at 12:32 PM, Harsh J ha...@cloudera.com wrote: Madhu, That is incorrect. TestDFSIO is a MapReduce job and you need HDFS+MR setup to use it. On Fri, Mar 2, 2012 at 11:07 AM, madhu phatak phatak@gmail.com wrote: Hi, Only HDFS should be enough. On Fri, Nov 25, 2011 at 1:45 AM, Thanh Do than...@cs.wisc.edu wrote: hi all, in order to run DFSIO in my cluster, do i need to run JobTracker, and TaskTracker, or just running HDFS is enough? Many thanks, Thanh -- Join me at http://hadoopworkshop.eventbrite.com/ -- Harsh J -- Join me at http://hadoopworkshop.eventbrite.com/