Re: Hadoop streaming cacheArchive
Norbert Burger wrote: I'm trying to use the cacheArchive command-line options with the hadoop-0.15.3-streaming.jar. I'm using the option as follows: -cacheArchive hdfs://host:50001/user/root/lib.jar#lib Unfortunately, my PERL scripts fail with an error consistent with not being able to find the 'lib' directory (which, as I understand, should point back to an extracted version of the lib.jar). Here, lib is created as a symlink in task's working directory. It will have the jar file and extracted version of jar file. Where are your PERL scripts searching for the lib? Is '.' included in your classpath. Otherwise you can use mapred.job.classpath.archives config item, this adds the files to the classpath and also to the distributed cache you can use -jobconf mapred.job.classpath.archives=hdfs://host:50001/user/root/lib.jar#lib I know that the original JAR exists in HDFS, but I don't see any evidence of lib.jar or a link called 'lib' inside my job.jar. link 'lib' will not be part of job.jar, but it will be distributed on all the nodes during task launch and task's current working directory will have the link 'lib' to the jar on cache. How can I troubleshoot cacheArchive further? Should the files/dirs specified via cacheArchive be contained inside the job.jar? If not, where should they be in HDFS? They can be anywhere on HDFS. You need give the complete path to add it to the cache. Thanks for any help. Norbert
Re: Trash option in hadoop-site.xml configuration.
Thank you for the clarification. Here is my another question. If two different clients ordered move to trash with different interval, (e.g. client #1 with fs.trash.interval = 60; client #2 with fs.trash.interval = 120) what would happen? Does namenode keep track of all these info? /Taeho On 3/20/08, dhruba Borthakur [EMAIL PROTECTED] wrote: The trash feature is a client side option and depends on the client configuration file. If the client's configuration specifies that Trash is enabled, then the HDFS client invokes a rename to Trash instead of a delete. Now, if Trash is enabled on the Namenode, then the Namenode periodically removes contents from the Trash directory. This design might be confusing to some users. But it provides the flexibility that different clients in the cluster can have either Trash enabled or disabled. Thanks, dhruba -Original Message- From: Taeho Kang [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 19, 2008 3:13 AM To: [EMAIL PROTECTED]; core-user@hadoop.apache.org; [EMAIL PROTECTED] Subject: Trash option in hadoop-site.xml configuration. Hello, I have these two machines that acts as a client to HDFS. Node #1 has Trash option enabled (e.g. fs.trash.interval set to 60) and Node #2 has Trash option off (e.g. fs.trash.interval set to 0) When I order file deletion from Node #2, the file gets deleted right away. while the file gets moved to trash when I do the same from Node #1. This is a bit of surprise to me, because I thought Trash option that I have set in the master node's config file applies to everyone who connects to / uses the HDFS. Was there any reason why Trash option was implemented in this way? Thank you in advance, /Taeho
MapFile and MapFileOutputFormat
Hi, I have two questions regarding the mapfile in hadoop/hdfs. First, when using MapFileOutputFormat as reducer's output, is there any way to change the index interval (i.e., able to call setIndexInterval() on the output MapFile)? Second, is it possible to tell what is the position in data file for a given key, assuming index interval is 1 and # of keys are small? Thanks, Rong-En Fan
RE: Trash option in hadoop-site.xml configuration.
Actually, the fs.trash.interval number has no significance on the client. If it is non-zero, then the client does a rename instead of a delete. The value specified in fs.trash.interval is used only by the namenode to periodically remove files from Trash: the periodicity is the value specified by fs.trash.interval on the namenode. hope this helps, dhruba -Original Message- From: Taeho Kang [mailto:[EMAIL PROTECTED] Sent: Thu 3/20/2008 1:53 AM To: core-user@hadoop.apache.org Subject: Re: Trash option in hadoop-site.xml configuration. Thank you for the clarification. Here is my another question. If two different clients ordered move to trash with different interval, (e.g. client #1 with fs.trash.interval = 60; client #2 with fs.trash.interval = 120) what would happen? Does namenode keep track of all these info? /Taeho On 3/20/08, dhruba Borthakur [EMAIL PROTECTED] wrote: The trash feature is a client side option and depends on the client configuration file. If the client's configuration specifies that Trash is enabled, then the HDFS client invokes a rename to Trash instead of a delete. Now, if Trash is enabled on the Namenode, then the Namenode periodically removes contents from the Trash directory. This design might be confusing to some users. But it provides the flexibility that different clients in the cluster can have either Trash enabled or disabled. Thanks, dhruba -Original Message- From: Taeho Kang [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 19, 2008 3:13 AM To: [EMAIL PROTECTED]; core-user@hadoop.apache.org; [EMAIL PROTECTED] Subject: Trash option in hadoop-site.xml configuration. Hello, I have these two machines that acts as a client to HDFS. Node #1 has Trash option enabled (e.g. fs.trash.interval set to 60) and Node #2 has Trash option off (e.g. fs.trash.interval set to 0) When I order file deletion from Node #2, the file gets deleted right away. while the file gets moved to trash when I do the same from Node #1. This is a bit of surprise to me, because I thought Trash option that I have set in the master node's config file applies to everyone who connects to / uses the HDFS. Was there any reason why Trash option was implemented in this way? Thank you in advance, /Taeho
Re: MapFile and MapFileOutputFormat
Rong-en Fan wrote: I have two questions regarding the mapfile in hadoop/hdfs. First, when using MapFileOutputFormat as reducer's output, is there any way to change the index interval (i.e., able to call setIndexInterval() on the output MapFile)? Not at present. It would probably be good to change MapFile to get this value from the Configuration. A static method could be added, MapFile#setIndexInterval(Configuration conf, int interval), that sets io.mapfile.index.interval, and the MapFile constructor could read this property from the Configuration. One could then use the static method to set this on jobs. If you need this, please file an issue in Jira. If possible, include a patch too. http://wiki.apache.org/hadoop/HowToContribute Second, is it possible to tell what is the position in data file for a given key, assuming index interval is 1 and # of keys are small? One could read the index file explicitly. It's just a SequenceFile, listing keys and positions in the data file. But why would you set the index interval to 1? And why do you need to know the position? Doug
Re: Hadoop on EC2 for large cluster
Hi, Did you see hadoop-0.16.0/src/contrib/ec2/bin/start-hadoop script? It already contains such part: echo Copying private key to slaves for slave in `cat slaves`; do scp $SSH_OPTS $PRIVATE_KEY_PATH [EMAIL PROTECTED]:/root/.ssh/id_rsa ssh $SSH_OPTS [EMAIL PROTECTED] chmod 600 /root/.ssh/id_rsa sleep 1 done Anyway, did you tried hadoop-ec2 script? It works well for task you described. Prasan Ary wrote: Hi All, I have been trying to configure Hadoop on EC2 for large number of clusters ( 100 plus). It seems that I have to copy EC2 private key to all the machines in the cluster so that they can have SSH connections. For now it seems I have to run a script to copy the key file to each of the EC2 instances. I wanted to know if there is a better way to accomplish this. Thanks, PA - Never miss a thing. Make Yahoo your homepage. --- Andrey Pankov
Re: Hadoop on EC2 for large cluster
Yes, this isn't ideal for larger clusters. There's a jira to address this: https://issues.apache.org/jira/browse/HADOOP-2410. Tom On 20/03/2008, Prasan Ary [EMAIL PROTECTED] wrote: Hi All, I have been trying to configure Hadoop on EC2 for large number of clusters ( 100 plus). It seems that I have to copy EC2 private key to all the machines in the cluster so that they can have SSH connections. For now it seems I have to run a script to copy the key file to each of the EC2 instances. I wanted to know if there is a better way to accomplish this. Thanks, PA - Never miss a thing. Make Yahoo your homepage. -- Blog: http://www.lexemetech.com/
Re: Hadoop on EC2 for large cluster
Actually, I personally use the following 2 part copy technique to copy files to a cluster of boxes: tar cf - myfile | dsh -f host-list-file -i -c -M tar xCfv /tmp - The first tar packages myfile into a tar file. dsh runs a tar that unpacks the tar (in the above case all boxes listed in host-list-file would have a /tmp/myfile after the command). Tar options that are relevant include C (chdir) and v (verbose, can be given twice) so you see what got copied. dsh options that are relevant: -i copy stdin to all ssh processes, requires -c -c do the ssh calls concurrently. -M prefix the out from the ssh with the hostname. While this is not rsync, it has the benefit of being processed concurrently, and quite flexible. Andreas Am Donnerstag, den 20.03.2008, 19:57 +0200 schrieb Andrey Pankov: Hi, Did you see hadoop-0.16.0/src/contrib/ec2/bin/start-hadoop script? It already contains such part: echo Copying private key to slaves for slave in `cat slaves`; do scp $SSH_OPTS $PRIVATE_KEY_PATH [EMAIL PROTECTED]:/root/.ssh/id_rsa ssh $SSH_OPTS [EMAIL PROTECTED] chmod 600 /root/.ssh/id_rsa sleep 1 done Anyway, did you tried hadoop-ec2 script? It works well for task you described. Prasan Ary wrote: Hi All, I have been trying to configure Hadoop on EC2 for large number of clusters ( 100 plus). It seems that I have to copy EC2 private key to all the machines in the cluster so that they can have SSH connections. For now it seems I have to run a script to copy the key file to each of the EC2 instances. I wanted to know if there is a better way to accomplish this. Thanks, PA - Never miss a thing. Make Yahoo your homepage. --- Andrey Pankov signature.asc Description: Dies ist ein digital signierter Nachrichtenteil
Re: Hadoop on EC2 for large cluster
you can't do this with the contrib/ec2 scripts/ami. but passing the master private dns name to the slaves on boot as 'user- data' works fine. when a slave starts, it contacts the master and joins the cluster. there isn't any need for a slave to rsync from the master, thus removing the dependency on them having the private key. and not using the start|stop-all scripts, you don't need to maintain the slaves file, and can thus lazily boot your cluster. to do this, you will need to create your own AMI that works this way. not hard, just time consuming. On Mar 20, 2008, at 11:56 AM, Prasan Ary wrote: Chris, What do you mean when you say boot the slaves with the master private name ? === Chris K Wensel [EMAIL PROTECTED] wrote: I found it much better to start the master first, then boot the slaves with the master private name. i do not use the start|stop-all scrips, so i do not need to maintain the slaves file. thus i don't need to push private keys around to support those scripts. this lets me start 20 nodes, then add 20 more later. or kill some. btw, get ganglia installed. life will be better knowing what's going on. also, setting up FoxyProxy on firefox lets you browse your whole cluster if you setup a ssh tunnel (socks). On Mar 20, 2008, at 10:15 AM, Prasan Ary wrote: Hi All, I have been trying to configure Hadoop on EC2 for large number of clusters ( 100 plus). It seems that I have to copy EC2 private key to all the machines in the cluster so that they can have SSH connections. For now it seems I have to run a script to copy the key file to each of the EC2 instances. I wanted to know if there is a better way to accomplish this. Thanks, PA - Never miss a thing. Make Yahoo your homepage. Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ - Looking for last minute shopping deals? Find them fast with Yahoo! Search. Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/
using a set of MapFiles - getting the right partition
Hi all-- I would like to have a reducer generate a MapFile so that in later processes I can look up the values associated with a few keys without processing an entire sequence file. However, if I have N reducers, I will generate N different map files, so to pick the right map file I will need to use the same partitioner as was used when partitioning the keys to reducers (the reducer I have running emits one value for each key it receives and no others). Should this be done manually, ie something like readers[partioner.getPartition(...)] or is there another recommended method? Eventually, I'm going to migrate to using HBase to store the key/value pairs (since I'd to take advantage of HBase's ability to cache common pairs in memory for faster retrieval), but I'm interested in seeing what the performance is like just using MapFiles. Thanks, Chris
Re: Hadoop streaming cacheArchive
Amareshwari, thanks for your help. This turned out to be user error (when packaging my JAR, I inadvertently included a lib directory, so the libraries actually existed in HDFS as ./lib/lib/perl..., when I was only expecting ./lib/perl... Thanks again, Norbert On Thu, Mar 20, 2008 at 3:03 AM, Amareshwari Sriramadasu [EMAIL PROTECTED] wrote: Norbert Burger wrote: I'm trying to use the cacheArchive command-line options with the hadoop-0.15.3-streaming.jar. I'm using the option as follows: -cacheArchive hdfs://host:50001/user/root/lib.jar#lib Unfortunately, my PERL scripts fail with an error consistent with not being able to find the 'lib' directory (which, as I understand, should point back to an extracted version of the lib.jar). Here, lib is created as a symlink in task's working directory. It will have the jar file and extracted version of jar file. Where are your PERL scripts searching for the lib? Is '.' included in your classpath. Otherwise you can use mapred.job.classpath.archives config item, this adds the files to the classpath and also to the distributed cache you can use -jobconf mapred.job.classpath.archives=hdfs://host:50001/user/root/lib.jar#lib I know that the original JAR exists in HDFS, but I don't see any evidence of lib.jar or a link called 'lib' inside my job.jar. link 'lib' will not be part of job.jar, but it will be distributed on all the nodes during task launch and task's current working directory will have the link 'lib' to the jar on cache. How can I troubleshoot cacheArchive further? Should the files/dirs specified via cacheArchive be contained inside the job.jar? If not, where should they be in HDFS? They can be anywhere on HDFS. You need give the complete path to add it to the cache. Thanks for any help. Norbert
Re: Limiting Total # of TaskTracker threads
On Tue, 18 Mar 2008 19:53:04 -0500, Ted Dunning [EMAIL PROTECTED] wrote: I think the original request was to limit the sum of maps and reduces rather than limiting the two parameters independently. Ted, yes this is exactly what I'm looking for. I just found an issue that seems to state that the old deprecated property is there, but it is not documented: https://issues.apache.org/jira/browse/HADOOP-2300 I tried using the max tasks in combination with setting the new values, but that didn't seem to work. =( My machine labelled as LIMITED MACHINE had 2 maps and 1 reduce running at the same time. The scenario I have is that I want to run multiple concurrent jobs through my cluster and have the CPU usage for that node be bound. Should I file a new issue? This was all with Hadoop 0.16.0 LIMITED MACHINE: property namemapred.tasktracker.tasks.maximum/name value2/value descriptionThe maximum number of total tasks that will be run simultaneously by a task tracker. /description /property property namemapred.tasktracker.map.tasks.maximum/name value1/value descriptionThe maximum number of map tasks that will be run simultaneously by a task tracker. /description /property property namemapred.tasktracker.reduce.tasks.maximum/name value1/value descriptionThe maximum number of reduce tasks that will be run simultaneously by a task tracker. /description /property OTHER CLUSTER MACHINES: property namemapred.tasktracker.tasks.maximum/name value8/value descriptionThe maximum number of total tasks that will be run simultaneously by a task tracker. /description /property property namemapred.tasktracker.map.tasks.maximum/name value4/value descriptionThe maximum number of map tasks that will be run simultaneously by a task tracker. /description /property property namemapred.tasktracker.reduce.tasks.maximum/name value4/value descriptionThe maximum number of reduce tasks that will be run simultaneously by a task tracker. /description /property On 3/18/08 5:26 PM, Arun C Murthy [EMAIL PROTECTED] wrote: The map/reduce tasks are not threads, they are run in separate JVMs which are forked by the tasktracker. Arun, yes, I did mean tasks, not threads. -- Jimmy
Default Combiner or default combining behaviour?
Hi, The MapReduce tutorial mentions Combiners only in passing. Is there a default Combiner or default combining behaviour? Concretely, I want to make sure that records are not getting combined behind the scenes in some way without me seeing it, and causing me to lose data. For instance, if there is a default Combiner or default combining behaviour that collapses multiple records with identical keys and values into a single record, I'd like to avoid that. Instead of blindly collapsing identical records, I'd want to aggregate their values and emit the aggregate. Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
Re: Partitioning reduce output by date
Thank you, Doug and Ted, this pointed me in the right direction, which lead to a custom OutputFormat and a RecordWriter that opens and closes the DataOutputStream based on the current key (if current key diff from previous key, close previous output and open a new one, then write) As for partitioning, that worked, too. My getPartition method now has: int dateHash = startDate.hashCode(); if (dateHash 0) dateHash = -dateHash; int partitionID = dateHash % numPartitions; return partitionID; Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Doug Cutting [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Wednesday, March 19, 2008 4:39:04 PM Subject: Re: Partitioning reduce output by date Otis Gospodnetic wrote: That numPartitions corresponds to the number of reduce tasks. What I need is partitioning that corresponds to the number of unique dates (-mm-dd) processed by the Mapper and not the number of reduce tasks. I don't know the number of distinct dates in the input ahead of time, though, so I cannot just specify the same number of reduces. I *can* get the number of unique dates by keeping track of dates in map(). I was going to take this approach and use this number in the getPartition() method, but apparently getPartition(...) is called as each input row is processed by map() call. This causes a problem for me, as I know the total number of unique dates only after *all* of the input is processed by map(). The number of partitions is indeed the number of reduces. If you were to compute it during map, then each map might generate a different number. Each map must partition into the same space, so that all partition 0 data can go to one reduce, partition 1 to another, and so on. I think Ted pointed you in the right direction: your Partitioner should partition by the hash of the date, then your OutputFormat should start writing a new file each time the date changes. That will give you a unique file per date. Doug
RE: Partitioning reduce output by date
If you want to output data to different files based on date or any value parts, you may want to check https://issues.apache.org/jira/browse/HADOOP-2906 Runping -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Thursday, March 20, 2008 4:00 PM To: core-user@hadoop.apache.org Subject: Re: Partitioning reduce output by date Thank you, Doug and Ted, this pointed me in the right direction, which lead to a custom OutputFormat and a RecordWriter that opens and closes the DataOutputStream based on the current key (if current key diff from previous key, close previous output and open a new one, then write) As for partitioning, that worked, too. My getPartition method now has: int dateHash = startDate.hashCode(); if (dateHash 0) dateHash = -dateHash; int partitionID = dateHash % numPartitions; return partitionID; Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Doug Cutting [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Wednesday, March 19, 2008 4:39:04 PM Subject: Re: Partitioning reduce output by date Otis Gospodnetic wrote: That numPartitions corresponds to the number of reduce tasks. What I need is partitioning that corresponds to the number of unique dates (- mm-dd) processed by the Mapper and not the number of reduce tasks. I don't know the number of distinct dates in the input ahead of time, though, so I cannot just specify the same number of reduces. I *can* get the number of unique dates by keeping track of dates in map(). I was going to take this approach and use this number in the getPartition() method, but apparently getPartition(...) is called as each input row is processed by map() call. This causes a problem for me, as I know the total number of unique dates only after *all* of the input is processed by map(). The number of partitions is indeed the number of reduces. If you were to compute it during map, then each map might generate a different number. Each map must partition into the same space, so that all partition 0 data can go to one reduce, partition 1 to another, and so on. I think Ted pointed you in the right direction: your Partitioner should partition by the hash of the date, then your OutputFormat should start writing a new file each time the date changes. That will give you a unique file per date. Doug
Re: Limiting Total # of TaskTracker threads
Hi, The map/reduce tasks are not threads, they are run in separate JVMs which are forked by the tasktracker. I don't understand why? is it a design to support task failures? I think that on the other hand running a thread queue (of tasks) per job per JVM would grealy improve performance, since fewer JVM init times. K. Honsali On 21/03/2008, Jimmy Wan [EMAIL PROTECTED] wrote: On Tue, 18 Mar 2008 19:53:04 -0500, Ted Dunning [EMAIL PROTECTED] wrote: I think the original request was to limit the sum of maps and reduces rather than limiting the two parameters independently. Ted, yes this is exactly what I'm looking for. I just found an issue that seems to state that the old deprecated property is there, but it is not documented: https://issues.apache.org/jira/browse/HADOOP-2300 I tried using the max tasks in combination with setting the new values, but that didn't seem to work. =( My machine labelled as LIMITED MACHINE had 2 maps and 1 reduce running at the same time. The scenario I have is that I want to run multiple concurrent jobs through my cluster and have the CPU usage for that node be bound. Should I file a new issue? This was all with Hadoop 0.16.0 LIMITED MACHINE: property namemapred.tasktracker.tasks.maximum/name value2/value descriptionThe maximum number of total tasks that will be run simultaneously by a task tracker. /description /property property namemapred.tasktracker.map.tasks.maximum/name value1/value descriptionThe maximum number of map tasks that will be run simultaneously by a task tracker. /description /property property namemapred.tasktracker.reduce.tasks.maximum/name value1/value descriptionThe maximum number of reduce tasks that will be run simultaneously by a task tracker. /description /property OTHER CLUSTER MACHINES: property namemapred.tasktracker.tasks.maximum/name value8/value descriptionThe maximum number of total tasks that will be run simultaneously by a task tracker. /description /property property namemapred.tasktracker.map.tasks.maximum/name value4/value descriptionThe maximum number of map tasks that will be run simultaneously by a task tracker. /description /property property namemapred.tasktracker.reduce.tasks.maximum/name value4/value descriptionThe maximum number of reduce tasks that will be run simultaneously by a task tracker. /description /property On 3/18/08 5:26 PM, Arun C Murthy [EMAIL PROTECTED] wrote: The map/reduce tasks are not threads, they are run in separate JVMs which are forked by the tasktracker. Arun, yes, I did mean tasks, not threads. -- Jimmy