Re: Splitting in various files
Could anyone please tell? On Sat, Apr 19, 2008 at 1:33 PM, Aayush Garg [EMAIL PROTECTED] wrote: Hi, I have written the following code for writing my key,value pairs in the file, and this file is then read by another MR. Path pth = new Path(./dir1/dir2/filename); FileSystem fs = pth.getFileSystem(jobconf); SequenceFile.Writer sqwrite = new SequenceFile.Writer(fs,conf,pth,Text.class,Custom.class); sqwrite.append(Key,value); sqwrite.close(); I problem is I get my data written in one file(filename).. How can it be split across in the number of files. If I give only the path of directory in this progam then it does not get compiled. I give only the path of directory /dir1/dir2 to another Map Reduce and it reads the file. Thanks, -- Aayush Garg, Phone: +41 76 482 240
Re: Interleaving maps/reduces from multiple jobs on the same tasktracker
Amar Kamat wrote: Jiaqi Tan wrote: Hi, Will Hadoop ever interleave multiple maps/reduces from different jobs on the same tasktracker? No. Suppose I have 2 jobs submitted to a jobtracker, one after the other. Must all maps/reduces from the first submitted job be completed before the tasktrackers will run any of the maps/reduces from the second submitted job? Tasks from a new job is never scheduled unless the tasks from other high priority jobs are completely scheduled. Within the jobs with the same priority, the start time is what finally matters. A job's priority at the JobTracker can be changed on the fly but as of now its not supported by the JobClient. But you can change it through the job's web UI. Amar Amar Thanks. Jiaqi Tan
datanode files list
Is there a way to get the list of files on each datanode? I need to be able to get all the names of the files on a specific datanode? is there a way to do it?
Re: jar files on NFS instead of DistributedCache
Joydeep Sen Sarma wrote: i would love this feature. it does not exist currently. if we set the classpath for the tasktracker - then as mentioned - it's for all the tasks. if the classpath can be set on a per task basis - that works as an excellent solution with a nfs based environment for specifying multiple jar files for the job. related - see https://issues.apache.org/jira/browse/HADOOP-1622. it asks for multiple jars to be packaged into the jobcache. but with nfs - all we need is a per job classpath option. true, but you are creating a SPOF. There is nothing like 200 boxes all block with their consoles going 'NFS Server not responding for 30s to make you wish you weren't using NFS. Because NFS IO is done in the kernel, there's no way to put policy into the apps about retry, timeouts and behaviour on failure. just say no to NFS. -- Steve Loughran http://www.1060.org/blogxter/publish/5 Author: Ant in Action http://antbook.org/
Re: Splitting in various files
I just tried the same thing (mapred.task.id)as you told..But I am getting one file named null in my directory. On Mon, Apr 21, 2008 at 8:33 AM, Amar Kamat [EMAIL PROTECTED] wrote: Aayush Garg wrote: Could anyone please tell? On Sat, Apr 19, 2008 at 1:33 PM, Aayush Garg [EMAIL PROTECTED] wrote: Hi, I have written the following code for writing my key,value pairs in the file, and this file is then read by another MR. Path pth = new Path(./dir1/dir2/filename); FileSystem fs = pth.getFileSystem(jobconf); SequenceFile.Writer sqwrite = new SequenceFile.Writer(fs,conf,pth,Text.class,Custom.class); sqwrite.append(Key,value); sqwrite.close(); I problem is I get my data written in one file(filename).. How can it be split across in the number of files. If I give only the path of directory in What do you mean by splitting a file across multiple files? If you want a separate file for each map/reduce task then you can use conf.get( mapred.task.id) to get the task id that is unique for that task. Now you can name the file like Path pth = new Path(./dir1/dir2/ + filename + - + conf.get( mapred.task.id)); Amar this progam then it does not get compiled. I give only the path of directory /dir1/dir2 to another Map Reduce and it reads the file. Thanks,
Re: Error in start up
Could anyone please help me with this error below ? I am not able to start HDFS due to this? Thanks, On Sat, Apr 19, 2008 at 7:25 PM, Aayush Garg [EMAIL PROTECTED] wrote: I have my hadoop-site.xml correct !! but it creates error in this way On Sat, Apr 19, 2008 at 6:35 PM, Stuart Sierra [EMAIL PROTECTED] wrote: On Sat, Apr 19, 2008 at 9:53 AM, Aayush Garg [EMAIL PROTECTED] wrote: I am getting following error on start up the hadoop as pseudo distributed:: bin/start-all.sh localhost: starting datanode, logging to /home/garga/Documents/hadoop-0.15.3/bin/../logs/hadoop-root-datanode-R61-neptun.out localhost: starting secondarynamenode, logging to /home/garga/Documents/hadoop-0.15.3/bin/../logs/hadoop-root-secondarynamenode-R61-neptun.out localhost: Exception in thread main java.lang.IllegalArgumentException: port out of range:-1 Hello, I'm a Hadoop newbie, but when I got this error it seemed to be caused by an empty/incorrect hadoop-site.xml. See http://wiki.apache.org/hadoop/QuickStart#head-530fd2e5b7fc3f35a210f3090f125416a79c2e1b -Stuart
Using ArrayWritable of type IntWritable
Hi, From the API , I should create new class as follows: public class IntArrayWritable extends ArrayWritable { public IntArrayWritable() { super(IntWritable.class); } } In the reducer, When executing OutputCollector.collect(WritableComparable key, IntArrayWritable arr) I am getting something like: key1[EMAIL PROTECTED] key2[EMAIL PROTECTED] key3[EMAIL PROTECTED] key4[EMAIL PROTECTED] ... ... What else do I have to override in ArrayWritable to get the IntWritable values written to the output files by the reducers? If you have some ready code for such case I will be very thankful. Cheers, -- View this message in context: http://www.nabble.com/Using-ArrayWritable-of-type-IntWritable-tp16807489p16807489.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: New bee quick questions :-)
vikas wrote: Hi, I'm new to HADOOP. And aiming to develop good amount of code with it. I've some quick questions it would be highly appreciable if some one can answer them. I was able to run HADOOP in cygwin environment. run the examples both in standalone mode as well as in a 2 node cluster. 1) How can I over come the difficulty of giving password for SSH logins when ever DataNodes are getting started. Creating SSH keys (either user or host based) and pairing the hosts with these keys. This is the first URL I got for a ssh public key authentication): http://sial.org/howto/openssh/publickey-auth/ 2) I've put some 1.5 GB of file in my Master node where even a DataNode is running. I want to see how load balancing can be done so that disk space will be utilized even from other datanodes. Not sure how to answer to this one. HDFS has knowledge of three entities: the node, the rack, and the rest. In the default configuration, each block is replicated 3 times, one for each entity. If you don't have racks and so you might want to fine tune replication of files through HDFS shell. 3) How can I add a new DataNode without stopping HADOOP. Just add it to the slaves and run start-dfs.sh. Already running nodes won't be touched. 4) Let us suppose I want to shutdown one datanode for maintenance purpose. is there any way to inform Hadoop saying that this particular datanode is going done -- please make sure the data in it is replicated else where ? Replication of blocks with a factor = 2 should do the job. In the general case, default replication is 3. You can check the replication factor through HDFS shell. 5) I was going through some videos on MAP-Reduce and few Yahoo tech talks. in that they were specifying a Hadoop cluster has multiple cores -- what does this mean ? Are you talking about multi-core processors? 5.1) can I have multiple instance of namenodes running in a cluster apart from secondary nodes ? Not sure on this, but as far as I know there's only one namenode that should be running. 6) If I go on create huge files will they be balanced among all the datanodes ? or do I need to change the creation of file location in the application. Files are divided in blocks. Then blocks are replicated. Huge files are simply composed by a larger set of blocks. In principle, you don't know where your blocks will end up, apart from the entities I mentioned before. And in principle, you shouldn't care about where they end with, because Hadoop applications will take care of sending tasks where the data reside. Ciao, Luca
hadoop 0.16.3 problems to submit job
Hi! When I'm trying to submit a streaming job to my EC2/S3 cluster, I'm getting the following errors: additionalConfSpec_:null null=@@@userJobConfProps_.get(stream.shipped.hadoopstreaming packageJobJar: [/tmp/hadoop-hadoop/hadoop-unjar53992/] [] /tmp/streamjob53993.jar tmpDir=null ^[[1;5A08/04/21 14:26:11 WARN httpclient.RestS3Service: Retrying request - attempt 1 of 5 08/04/21 14:26:11 INFO httpclient.HttpMethodDirector: I/O exception (java.net.SocketException) caught when processing request: Connection reset 08/04/21 14:26:11 INFO httpclient.HttpMethodDirector: Retrying request 08/04/21 14:27:44 WARN httpclient.RestS3Service: Retrying request - attempt 2 of 5 08/04/21 14:27:44 INFO httpclient.HttpMethodDirector: I/O exception (java.net.SocketTimeoutException) caught when processing request: Read timed out 08/04/21 14:27:44 INFO httpclient.HttpMethodDirector: Retrying request 08/04/21 14:32:12 WARN service.S3Service: Encountered 1 S3 Internal Server error(s), will retry in 50ms 08/04/21 14:32:21 WARN service.S3Service: Encountered 1 S3 Internal Server error(s), will retry in 50ms 08/04/21 14:32:26 WARN service.S3Service: Encountered 1 S3 Internal Server error(s), will retry in 50ms 08/04/21 14:32:31 WARN httpclient.RestS3Service: Retrying request - attempt 1 of 5 08/04/21 14:32:31 INFO httpclient.HttpMethodDirector: I/O exception (java.net.SocketException) caught when processing request: Connection reset 08/04/21 14:32:31 INFO httpclient.HttpMethodDirector: Retrying request 08/04/21 14:32:35 WARN httpclient.RestS3Service: Retrying request - attempt 1 of 5 08/04/21 14:32:35 INFO httpclient.HttpMethodDirector: I/O exception (java.net.SocketException) caught when processing request: Connection reset 08/04/21 14:32:35 INFO httpclient.HttpMethodDirector: Retrying request 08/04/21 14:36:06 WARN httpclient.RestS3Service: Retrying request - attempt 1 of 5 08/04/21 14:36:06 INFO httpclient.HttpMethodDirector: I/O exception (org.apache.commons.httpclient.NoHttpResponseException) caught when processing request: The server s3.amazonaws.com failed to respond 08/04/21 14:36:06 INFO httpclient.HttpMethodDirector: Retrying request 08/04/21 14:36:34 WARN httpclient.RestS3Service: Retrying request - attempt 2 of 5 08/04/21 14:36:34 INFO httpclient.HttpMethodDirector: I/O exception (java.net.SocketException) caught when processing request: Connection reset 08/04/21 14:36:34 INFO httpclient.HttpMethodDirector: Retrying request 08/04/21 14:41:47 WARN httpclient.RestS3Service: Retrying request - attempt 1 of 5 08/04/21 14:41:47 INFO httpclient.HttpMethodDirector: I/O exception (java.net.SocketTimeoutException) caught when processing request: Read timed out 08/04/21 14:41:47 INFO httpclient.HttpMethodDirector: Retrying request 08/04/21 14:54:00 WARN httpclient.RestS3Service: Retrying request - attempt 1 of 5 08/04/21 14:54:00 INFO httpclient.HttpMethodDirector: I/O exception (java.net.SocketException) caught when processing request: Connection reset 08/04/21 14:54:00 INFO httpclient.HttpMethodDirector: Retrying request 08/04/21 15:04:48 INFO mapred.FileInputFormat: Total input paths to process : 18611 08/04/21 15:06:43 WARN httpclient.RestS3Service: Retrying request - attempt 1 of 5 08/04/21 15:06:43 INFO httpclient.HttpMethodDirector: I/O exception (org.apache.commons.httpclient.NoHttpResponseException) caught when processing request: The server s3.amazonaws.com failed to respond 08/04/21 15:06:43 INFO httpclient.HttpMethodDirector: Retrying request 08/04/21 15:14:33 WARN service.S3Service: Encountered 1 S3 Internal Server error(s), will retry in 50ms 08/04/21 15:14:34 WARN httpclient.RestS3Service: Retrying request - attempt 1 of 5 08/04/21 15:14:34 INFO httpclient.HttpMethodDirector: I/O exception (java.net.SocketException) caught when processing request: Connection reset 08/04/21 15:14:34 INFO httpclient.HttpMethodDirector: Retrying request 08/04/21 15:15:02 WARN httpclient.RestS3Service: Retrying request - attempt 1 of 5 08/04/21 15:15:02 INFO httpclient.HttpMethodDirector: I/O exception (java.net.SocketException) caught when processing request: Connection reset 08/04/21 15:15:02 INFO httpclient.HttpMethodDirector: Retrying request Now I wonder if it's normal? Notice that it took hadoop 38 minutes to figure out the number of input files. Now this has been running for around an hour and I still have no idea if the job will be submitted or will hang forever (I had such cases with 0.16.2 where I stopped the submission after half a day). Any ideas? TIA, Andreas Kostyrka signature.asc Description: Dies ist ein digital signierter Nachrichtenteil
Re: New bee quick questions :-)
On 4/21/08 3:36 AM, vikas [EMAIL PROTECTED] wrote: Most of your questions have been answered by Luca, from what I can see, so let me tackle the rest a bit... 4) Let us suppose I want to shutdown one datanode for maintenance purpose. is there any way to inform Hadoop saying that this particular datanode is going done -- please make sure the data in it is replicated else where ? You want to do datanode decommissioning. See http://wiki.apache.org/hadoop/FAQ#17 for details. 5) I was going through some videos on MAP-Reduce and few Yahoo tech talks. in that they were specifying a Hadoop cluster has multiple cores -- what does this mean ? I haven't watched the tech talks in ages, but we generally refer to cores in a variety of ways. There is the single physical box verson--an individual processor has more than one execution unit, thereby giving it a degree of parallelism. Then there is the complete grid count--an individual grid can have lots and lots of processors with lots and lots of individual cores on those processors which works out to be a pretty good rough estimation of how many individual Hadoop tasks can be run simultaneously. 5.1) can I have multiple instance of namenodes running in a cluster apart from secondary nodes ? No. The name node is a single point of failure in the system. 6) If I go on create huge files will they be balanced among all the datanodes ? or do I need to change the creation of file location in the application. In addition to what Luca said, be aware that if you load a file on a machine with a data node process, the data for that file will *always* get loaded to that machine. This can cause your data nodes to get extremely unbalanced. You are much better off doing data loads *off grid*/from another machine. Since you only need the hadoop configuration and binaries available (in other words, no hadoop processes need be running), this usually isn't too painful to do. In 0.16.x, there is a rebalancer to help fix this situation, but I have no practical experience with it yet to say whether or not it works.
Re: datanode files list
Datanodes don't necessarily contain complete files. It is possible to enumerate all files and to find out which datanodes host different blocks from these files. What did you need to do? On 4/21/08 2:11 AM, Shimi K [EMAIL PROTECTED] wrote: Is there a way to get the list of files on each datanode? I need to be able to get all the names of the files on a specific datanode? is there a way to do it?
Re: datanode files list
I am using Hadoop HDFS as a distributed file system. On each DFS node I have another process which needs to read the local HDFS files. Right now I'm calling the NameNode in order to get the list of all the files in the cluster. For each file I check if it is a local file (one of the locations is the host of the node), if it is I read it. Disadvantages: * This solution works only if the entire file is not split. * It involves the NameNode. * Each node needs to iterate on all the files in the cluster. There must be a better way to do it. The perfect way will be to call the DataNode and to get a list of the local files and their blocks. On Mon, Apr 21, 2008 at 7:18 PM, Ted Dunning [EMAIL PROTECTED] wrote: Datanodes don't necessarily contain complete files. It is possible to enumerate all files and to find out which datanodes host different blocks from these files. What did you need to do? On 4/21/08 2:11 AM, Shimi K [EMAIL PROTECTED] wrote: Is there a way to get the list of files on each datanode? I need to be able to get all the names of the files on a specific datanode? is there a way to do it?
Re: datanode files list
This is kind of odd that you are doing this. It really sounds like a replication of what hadoop is doing. Why not just run a map process and have hadoop figure out which blocks are where? Can you say more about *why* you are doing this, not just what you are trying to do? On 4/21/08 10:28 AM, Shimi K [EMAIL PROTECTED] wrote: I am using Hadoop HDFS as a distributed file system. On each DFS node I have another process which needs to read the local HDFS files. Right now I'm calling the NameNode in order to get the list of all the files in the cluster. For each file I check if it is a local file (one of the locations is the host of the node), if it is I read it. Disadvantages: * This solution works only if the entire file is not split. * It involves the NameNode. * Each node needs to iterate on all the files in the cluster. There must be a better way to do it. The perfect way will be to call the DataNode and to get a list of the local files and their blocks. On Mon, Apr 21, 2008 at 7:18 PM, Ted Dunning [EMAIL PROTECTED] wrote: Datanodes don't necessarily contain complete files. It is possible to enumerate all files and to find out which datanodes host different blocks from these files. What did you need to do? On 4/21/08 2:11 AM, Shimi K [EMAIL PROTECTED] wrote: Is there a way to get the list of files on each datanode? I need to be able to get all the names of the files on a specific datanode? is there a way to do it?
Re: datanode files list
Do you remember the Caching frequently map input files thread? http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200802.mbox/[EMAIL PROTECTED] On Mon, Apr 21, 2008 at 8:31 PM, Ted Dunning [EMAIL PROTECTED] wrote: This is kind of odd that you are doing this. It really sounds like a replication of what hadoop is doing. Why not just run a map process and have hadoop figure out which blocks are where? Can you say more about *why* you are doing this, not just what you are trying to do? On 4/21/08 10:28 AM, Shimi K [EMAIL PROTECTED] wrote: I am using Hadoop HDFS as a distributed file system. On each DFS node I have another process which needs to read the local HDFS files. Right now I'm calling the NameNode in order to get the list of all the files in the cluster. For each file I check if it is a local file (one of the locations is the host of the node), if it is I read it. Disadvantages: * This solution works only if the entire file is not split. * It involves the NameNode. * Each node needs to iterate on all the files in the cluster. There must be a better way to do it. The perfect way will be to call the DataNode and to get a list of the local files and their blocks. On Mon, Apr 21, 2008 at 7:18 PM, Ted Dunning [EMAIL PROTECTED] wrote: Datanodes don't necessarily contain complete files. It is possible to enumerate all files and to find out which datanodes host different blocks from these files. What did you need to do? On 4/21/08 2:11 AM, Shimi K [EMAIL PROTECTED] wrote: Is there a way to get the list of files on each datanode? I need to be able to get all the names of the files on a specific datanode? is there a way to do it?
Re: Any API used to get the last modified time of the File in HDFS?
String fileName = file.txt Configuration config = new Configuration(); FileSystem fs = FileSystem.get(config); Path filePath = new Path(fileName); FileStatus fileStatus = fs.getFileStatus(filePath); long modificationTime = fileStatus.getModificationTime(); On Mon, Apr 21, 2008 at 5:35 AM, Samuel Guo [EMAIL PROTECTED] wrote: Hi all, Can anyone tell me : is there any api I can use to get the metadata info such as the last modified time and etc. of a File in hdfs? Thanks a lot :) Best Wishes:) Samuel Guo
Re: Hadoop remembering old mapred.map.tasks
It turns out Hadoop was not remembering anything and the answer is in the FAQ: http://wiki.apache.org/hadoop/FAQ#13 Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Otis Gospodnetic [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Sunday, April 20, 2008 8:14:43 PM Subject: Hadoop remembering old mapred.map.tasks Hi, Does Hadoop cache settings set in hadoop-*xml between runs? I'm using Hadoop 0.16.2 and have initially set the number of map and reduce tasks to 8 of each. After running a number of jobs I wanted to increase that number (to 23 maps and 11 reduces), so I changed the mapred.map.tasks and mapred.reduce.tasks properties in hadoop-site.xml. I then stopped everything (stop-all.sh) and copied my modified hadoop-site.xml to all nodes in the cluster. I also rebuilt the .job file and pushed that out to all nodes, too. However, when I start everything up again I *still* see Map Task Capacity is equal to 8, and the same for Reduce Task Capacity. Am I supposed to do something in addition to the above to make Hadoop forget my old settings? I can't find *any* references to mapred.map.tasks in any of the Hadoop files except for my hadoop-site.xml, so I can't figure out why Hadoop is still stuck on 8. Although the max capacity is set to 8, when I run my jobs now I *do* see that they get broken up into 23 maps and 11 reduces (it was 8 before), but only 8 of them run in parallel. There are 4 dual-code machines in the cluster for a total of 8 cores. Is Hadoop able to figure this out and that is why it runs only 8 tasks in parallel, despite my higher settings? Thanks, Otis
Re: incremental re-execution
Hi Ted, Thanks for your example. It's very interesting to learn about specific map reduce applications. It's non-obvious to me that it's a good idea to combine two map- reduce pairs by using the cross product of the intermediate states- you might wind up building an O(n^2) intermediate data structure instead of two O(n) ones. Even with parallelism this is not good. I'm wondering if in your example you're relying on the fact that the viewer-video matrix is sparse, so many of the pairs will have value 0? Does the map phase emit intermediate results with 0-values? Thanks, Shirley Take something like what we see in our logs of viewing. We have several log entries per view each of which contains an identifier for the viewer and for the video. These events occur on start, on progress and on completion. We want to have total views per viewer and total views per video. You can pass over the logs twice to get this data or you can pass over the data once to get total views per (viewer x video). This last is a semi- aggregated form that has no utility except that it is much smaller than the original data. Reducing the semi-aggregated from to viewer counts and video counts results in shorter total processing than processing the raw data twice. If you start with a program that has two map-reduce passes over the same data, it is likely very difficult to intuit that they could use the same intermediate data. Even with something like Pig, where you have a good representation for internal optimizations, it is probably going to be difficult to convert the two MR steps into one pre-aggregation and two final aggregations. On 4/20/08 7:39 AM, Shirley Cohen [EMAIL PROTECTED] wrote: Hi Ted, I'm confused about your second comment below: in the case where semi- aggregated data is used to produce multiple low-level aggregates, what sorts of detection did you have in mind which would be hard to do? Thanks, Shirley On Apr 16, 2008, at 7:30 PM, Ted Dunning wrote: I re-use outputs of MR programs pretty often, but when I need to reuse the map output, I just manually break the process apart into a map+identity-reducer and the multiple reducers. This is rare. It is common to have a semi-aggregated form that is much small than the original data which in turn can be used to produce multiple low definition aggregates. I would find it very surprising if you could detect these sorts of situations. On 4/16/08 5:26 PM, Shirley Cohen [EMAIL PROTECTED] wrote: Dear Hadoop Users, I'm writing to find out what you think about being able to incrementally re-execute a map reduce job. My understanding is that the current framework doesn't support it and I'd like to know whether, in your opinion, having this capability could help to speed up development and debugging. My specific questions are: 1) Do you have to re-run a job often enough that it would be valuable to incrementally re-run it? 2) Would it be helpful to save the output from a whole bunch of mappers and then try to detect whether this output can be re-used when a new job is launched? 3) Would it be helpful to be able to use the output from a map job on many reducers? Please let me know what your thoughts are and what specific applications you are working on. Much appreciation, Shirley
Re: incremental re-execution
It isn't that bad. Remember the input is sparse. Actually, the size of the original data provides a sharp bound on the size of the semi-aggregated data. In practice, the semi-aggregated data will have 1/k as many records as the original data where k is the average count. The records in the semi-aggregated result are also typically considerably smaller than the original records. Even in long tail environments, k is typically 5-20 and each record is often 2-4x smaller than the originals. This means that there is typically an order of magnitude saving in traversing the semi-aggregated result. The other major approach would be to emit typed records for each full level aggregation that you intend to do. This gives you all of your aggregates together in a single output, but you can jigger the partition function so that you have a single reduce output file per aggregate type. This approach isn't quite a simple as the semi-aggregate approach and doesn't allow retrospective ad hoc scans, but it is slightly to considerably faster than the semi-agg approach, especially if you are aggregating on more than two keys. On 4/21/08 1:06 PM, Shirley Cohen [EMAIL PROTECTED] wrote: Hi Ted, Thanks for your example. It's very interesting to learn about specific map reduce applications. It's non-obvious to me that it's a good idea to combine two map- reduce pairs by using the cross product of the intermediate states- you might wind up building an O(n^2) intermediate data structure instead of two O(n) ones. Even with parallelism this is not good. I'm wondering if in your example you're relying on the fact that the viewer-video matrix is sparse, so many of the pairs will have value 0? Does the map phase emit intermediate results with 0-values? Thanks, Shirley Take something like what we see in our logs of viewing. We have several log entries per view each of which contains an identifier for the viewer and for the video. These events occur on start, on progress and on completion. We want to have total views per viewer and total views per video. You can pass over the logs twice to get this data or you can pass over the data once to get total views per (viewer x video). This last is a semi- aggregated form that has no utility except that it is much smaller than the original data. Reducing the semi-aggregated from to viewer counts and video counts results in shorter total processing than processing the raw data twice. If you start with a program that has two map-reduce passes over the same data, it is likely very difficult to intuit that they could use the same intermediate data. Even with something like Pig, where you have a good representation for internal optimizations, it is probably going to be difficult to convert the two MR steps into one pre-aggregation and two final aggregations. On 4/20/08 7:39 AM, Shirley Cohen [EMAIL PROTECTED] wrote: Hi Ted, I'm confused about your second comment below: in the case where semi- aggregated data is used to produce multiple low-level aggregates, what sorts of detection did you have in mind which would be hard to do? Thanks, Shirley On Apr 16, 2008, at 7:30 PM, Ted Dunning wrote: I re-use outputs of MR programs pretty often, but when I need to reuse the map output, I just manually break the process apart into a map+identity-reducer and the multiple reducers. This is rare. It is common to have a semi-aggregated form that is much small than the original data which in turn can be used to produce multiple low definition aggregates. I would find it very surprising if you could detect these sorts of situations. On 4/16/08 5:26 PM, Shirley Cohen [EMAIL PROTECTED] wrote: Dear Hadoop Users, I'm writing to find out what you think about being able to incrementally re-execute a map reduce job. My understanding is that the current framework doesn't support it and I'd like to know whether, in your opinion, having this capability could help to speed up development and debugging. My specific questions are: 1) Do you have to re-run a job often enough that it would be valuable to incrementally re-run it? 2) Would it be helpful to save the output from a whole bunch of mappers and then try to detect whether this output can be re-used when a new job is launched? 3) Would it be helpful to be able to use the output from a map job on many reducers? Please let me know what your thoughts are and what specific applications you are working on. Much appreciation, Shirley
Re: Using ArrayWritable of type IntWritable
CloudyEye wrote: What else do I have to override in ArrayWritable to get the IntWritable values written to the output files by the reducers? public String toString(); Doug
Re: jar files on NFS instead of DistributedCache
I agree with the fair and balanced part. I always try to keep my clusters fair and balanced! Joydeep should mention his background. In any case, I agree that high-end filers may provide good enough NFS service, but I would also contend that HDFS has been better for me than NFS from generic servers. On 4/21/08 2:12 PM, Joydeep Sen Sarma [EMAIL PROTECTED] wrote: this is not to say that this is the right solution for large clusters or those trying to run nfs servers of linux (which, last i heard, has a notoriously bad nfs server). (perhaps open-solaris is a better option). 'fair-and-balanced' :-)
Re: jar files on NFS instead of DistributedCache
On 4/21/08 2:18 PM, Ted Dunning [EMAIL PROTECTED] wrote: I agree with the fair and balanced part. I always try to keep my clusters fair and balanced! Joydeep should mention his background. In any case, I agree that high-end filers may provide good enough NFS service, but I would also contend that HDFS has been better for me than NFS from generic servers. We take a mixed approach to the NFS problem. For grids that have some sort of service level agreement associated with it, we do not allow NFS connections. The jobs must be reasonably self contained. For other grids (research, development, etc), we do allow NFS connections and hope that people don't do stupid things. It is probably worth pointing out that it is much easier for a user to do stupid things with, say, 500 nodes than 5. So we take a much a more conservative view for grids we care about. As Joydeep said, the implementation of the stack does make a huge difference. NetApp and Sun are leaps and bounds better than most. In the case of Linux, it has made great strides forward but I'd be leary using it for the sorts of workloads we have.
Run DfsShell command after your job is complete?
Hello - Is there any way to run a DfsShell command after your job is complete within that same job run/main class? I.e after you're done with the maps and the reduces i want to directly move out of hdfs into local file system to load data into a database. can you run a DfsShell within your job class or within your java class that runs the job? thanks - Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now.
Re: Run DfsShell command after your job is complete?
Yes FsShell.java implements most of the Shell commands. You could also use the FileSystem API http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/fs/FileSystem.html Simple example http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample Thanks, Lohit - Original Message From: Kayla Jay [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Monday, April 21, 2008 7:40:52 PM Subject: Run DfsShell command after your job is complete? Hello - Is there any way to run a DfsShell command after your job is complete within that same job run/main class? I.e after you're done with the maps and the reduces i want to directly move out of hdfs into local file system to load data into a database. can you run a DfsShell within your job class or within your java class that runs the job? thanks - Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now.
Re: How to instruct Job Tracker to use certain hosts only
On Apr 18, 2008, at 1:52 PM, Htin Hlaing wrote: I would like to run the first job to run on all the compute hosts in the cluster (which is by default) and then, I would like to run the second job with only on a subset of the hosts (due to some licensing issue). One option would be to set mapred.map.max.attempts and mapred.reduce.max.attempts to larger numbers and have the map or reduce fail if it is run on a bad node. When the task re-runs, it will run on a different node. Eventually it will find a valid node. -- Owen