Hadoop Streaming -file option
Hi, everyone, Could somdone tell me the principle of -file when using Hadoop Streaming. I want to ship a big file to Slaves, so how it works? Hadoop uses SCP to copy? How does Hadoop deal with -file option?
Re: Reducer hangs at 16%
Hi, This looks like a set up issue. See http://hadoop.apache.org/core/docs/current/cluster_setup.html#Configuration+ Files On how to set this up correctly. As an aside, how are you bringing up the hadoop daemons (JobTracker, Namenode, TT and Datanodes)? Are you manually bringing them up or are you using bin/start-all.sh? Jothi On 2/23/09 3:14 PM, Jagadesh_Doddi jagadesh_do...@satyam.com wrote: I have setup a distributed environment on Fedora OS to run Hadoop. System Fedora1 is the name node, Fedora2 is Job tracker, Fedora3 and Fedora4 are task trackers. Conf/masters contains the entries Fedora1, Fedors2, and conf/slaves contains the entries Fedora3, Fedora4. When I run the sample wordcount example with single task tracker (either Fedora3 or Fedora4), it works fine and the job completes in a few seconds. However, when I add the other task tracker in conf/slaves, the reducer stop at 16% and the job completes after 13 minutes. The same problem exists in versions 16.4, 17.2.1 and 18.3. The output on the namenode console is shown below: [r...@fedora1 hadoop-0.17.2.1Cluster]# bin/hadoop jar samples/wordcount.jar org.myorg.WordCount input output 09/02/19 17:43:18 INFO mapred.FileInputFormat: Total input paths to process : 1 09/02/19 17:43:19 INFO mapred.JobClient: Running job: job_200902191741_0001 09/02/19 17:43:20 INFO mapred.JobClient: map 0% reduce 0% 09/02/19 17:43:26 INFO mapred.JobClient: map 50% reduce 0% 09/02/19 17:43:27 INFO mapred.JobClient: map 100% reduce 0% 09/02/19 17:43:35 INFO mapred.JobClient: map 100% reduce 16% 09/02/19 17:56:15 INFO mapred.JobClient: Task Id : task_200902191741_0001_m_01_0, Status : FAILED Too many fetch-failures 09/02/19 17:56:15 WARN mapred.JobClient: Error reading task outputNo route to host 09/02/19 17:56:18 WARN mapred.JobClient: Error reading task outputNo route to host 09/02/19 17:56:25 INFO mapred.JobClient: map 100% reduce 81% 09/02/19 17:56:26 INFO mapred.JobClient: map 100% reduce 100% 09/02/19 17:56:27 INFO mapred.JobClient: Job complete: job_200902191741_0001 09/02/19 17:56:27 INFO mapred.JobClient: Counters: 16 09/02/19 17:56:27 INFO mapred.JobClient: Job Counters 09/02/19 17:56:27 INFO mapred.JobClient: Launched map tasks=3 09/02/19 17:56:27 INFO mapred.JobClient: Launched reduce tasks=1 09/02/19 17:56:27 INFO mapred.JobClient: Data-local map tasks=3 09/02/19 17:56:27 INFO mapred.JobClient: Map-Reduce Framework 09/02/19 17:56:27 INFO mapred.JobClient: Map input records=5 09/02/19 17:56:27 INFO mapred.JobClient: Map output records=25 09/02/19 17:56:27 INFO mapred.JobClient: Map input bytes=138 09/02/19 17:56:27 INFO mapred.JobClient: Map output bytes=238 09/02/19 17:56:27 INFO mapred.JobClient: Combine input records=25 09/02/19 17:56:27 INFO mapred.JobClient: Combine output records=23 09/02/19 17:56:27 INFO mapred.JobClient: Reduce input groups=23 09/02/19 17:56:27 INFO mapred.JobClient: Reduce input records=23 09/02/19 17:56:27 INFO mapred.JobClient: Reduce output records=23 09/02/19 17:56:27 INFO mapred.JobClient: File Systems 09/02/19 17:56:27 INFO mapred.JobClient: Local bytes read=522 09/02/19 17:56:27 INFO mapred.JobClient: Local bytes written=1177 09/02/19 17:56:27 INFO mapred.JobClient: HDFS bytes read=208 09/02/19 17:56:27 INFO mapred.JobClient: HDFS bytes written=175 Appreciate any help on this. Thanks Jagadesh DISCLAIMER: This email (including any attachments) is intended for the sole use of the intended recipient/s and may contain material that is CONFIDENTIAL AND PRIVATE COMPANY INFORMATION. Any review or reliance by others or copying or distribution or forwarding of any or all of the contents in this message is STRICTLY PROHIBITED. If you are not the intended recipient, please contact the sender by email and delete all copies; your cooperation in this regard is appreciated.
Re: will record having same key be sent to reducer at the same time
Thanks. 2009/2/23 james warren ja...@rockyou.com Hi Nick - While your reducers may be running concurrently with your mappers, they will not begin the sort and reduce steps until all map tasks have completed. Once they actually begin the reduce stage, they will have received all values for a given key. cheers, -jw On Mon, Feb 23, 2009 at 1:00 AM, Nick Cen cenyo...@gmail.com wrote: Hi all, If i have a bunch of value that have the same key, and i have more then one reducer running, which guarantee both mapper and reducer are running concurently. Will all these value be send to the reducer at the same time? Thx -- http://daily.appspot.com/food/ -- http://daily.appspot.com/food/
RE: Reducer hangs at 16%
Hi I have setup as per the documentation in hadoop site. On namenode, I am running bin/start-dfs.sh and on job tracker, I am running bin\start-mapred.sh Thanks and Regards Jagadesh Doddi Telephone: 040-30657556 Mobile: 9949497414 -Original Message- From: Jothi Padmanabhan [mailto:joth...@yahoo-inc.com] Sent: Monday, February 23, 2009 4:00 PM To: core-user@hadoop.apache.org Subject: Re: Reducer hangs at 16% Hi, This looks like a set up issue. See http://hadoop.apache.org/core/docs/current/cluster_setup.html#Configuration+ Files On how to set this up correctly. As an aside, how are you bringing up the hadoop daemons (JobTracker, Namenode, TT and Datanodes)? Are you manually bringing them up or are you using bin/start-all.sh? Jothi On 2/23/09 3:14 PM, Jagadesh_Doddi jagadesh_do...@satyam.com wrote: I have setup a distributed environment on Fedora OS to run Hadoop. System Fedora1 is the name node, Fedora2 is Job tracker, Fedora3 and Fedora4 are task trackers. Conf/masters contains the entries Fedora1, Fedors2, and conf/slaves contains the entries Fedora3, Fedora4. When I run the sample wordcount example with single task tracker (either Fedora3 or Fedora4), it works fine and the job completes in a few seconds. However, when I add the other task tracker in conf/slaves, the reducer stop at 16% and the job completes after 13 minutes. The same problem exists in versions 16.4, 17.2.1 and 18.3. The output on the namenode console is shown below: [r...@fedora1 hadoop-0.17.2.1Cluster]# bin/hadoop jar samples/wordcount.jar org.myorg.WordCount input output 09/02/19 17:43:18 INFO mapred.FileInputFormat: Total input paths to process : 1 09/02/19 17:43:19 INFO mapred.JobClient: Running job: job_200902191741_0001 09/02/19 17:43:20 INFO mapred.JobClient: map 0% reduce 0% 09/02/19 17:43:26 INFO mapred.JobClient: map 50% reduce 0% 09/02/19 17:43:27 INFO mapred.JobClient: map 100% reduce 0% 09/02/19 17:43:35 INFO mapred.JobClient: map 100% reduce 16% 09/02/19 17:56:15 INFO mapred.JobClient: Task Id : task_200902191741_0001_m_01_0, Status : FAILED Too many fetch-failures 09/02/19 17:56:15 WARN mapred.JobClient: Error reading task outputNo route to host 09/02/19 17:56:18 WARN mapred.JobClient: Error reading task outputNo route to host 09/02/19 17:56:25 INFO mapred.JobClient: map 100% reduce 81% 09/02/19 17:56:26 INFO mapred.JobClient: map 100% reduce 100% 09/02/19 17:56:27 INFO mapred.JobClient: Job complete: job_200902191741_0001 09/02/19 17:56:27 INFO mapred.JobClient: Counters: 16 09/02/19 17:56:27 INFO mapred.JobClient: Job Counters 09/02/19 17:56:27 INFO mapred.JobClient: Launched map tasks=3 09/02/19 17:56:27 INFO mapred.JobClient: Launched reduce tasks=1 09/02/19 17:56:27 INFO mapred.JobClient: Data-local map tasks=3 09/02/19 17:56:27 INFO mapred.JobClient: Map-Reduce Framework 09/02/19 17:56:27 INFO mapred.JobClient: Map input records=5 09/02/19 17:56:27 INFO mapred.JobClient: Map output records=25 09/02/19 17:56:27 INFO mapred.JobClient: Map input bytes=138 09/02/19 17:56:27 INFO mapred.JobClient: Map output bytes=238 09/02/19 17:56:27 INFO mapred.JobClient: Combine input records=25 09/02/19 17:56:27 INFO mapred.JobClient: Combine output records=23 09/02/19 17:56:27 INFO mapred.JobClient: Reduce input groups=23 09/02/19 17:56:27 INFO mapred.JobClient: Reduce input records=23 09/02/19 17:56:27 INFO mapred.JobClient: Reduce output records=23 09/02/19 17:56:27 INFO mapred.JobClient: File Systems 09/02/19 17:56:27 INFO mapred.JobClient: Local bytes read=522 09/02/19 17:56:27 INFO mapred.JobClient: Local bytes written=1177 09/02/19 17:56:27 INFO mapred.JobClient: HDFS bytes read=208 09/02/19 17:56:27 INFO mapred.JobClient: HDFS bytes written=175 Appreciate any help on this. Thanks Jagadesh DISCLAIMER: This email (including any attachments) is intended for the sole use of the intended recipient/s and may contain material that is CONFIDENTIAL AND PRIVATE COMPANY INFORMATION. Any review or reliance by others or copying or distribution or forwarding of any or all of the contents in this message is STRICTLY PROHIBITED. If you are not the intended recipient, please contact the sender by email and delete all copies; your cooperation in this regard is appreciated. DISCLAIMER: This email (including any attachments) is intended for the sole use of the intended recipient/s and may contain material that is CONFIDENTIAL AND PRIVATE COMPANY INFORMATION. Any review or reliance by others or copying or distribution or forwarding of any or all of the contents in this message is STRICTLY PROHIBITED. If you are not the intended recipient, please contact the sender by email and delete all copies; your cooperation in this regard is appreciated.
Re: the question about the common pc?
Tim Wintle wrote: On Fri, 2009-02-20 at 13:07 +, Steve Loughran wrote: I've been doing MapReduce work over small in-memory datasets using Erlang, which works very well in such a context. I've got some (mainly python) scripts (that will probably be run with hadoop streaming eventually) that I run over multiple cpus/cores on a single machine by opening the appropriate number of named pipes and using tee and awk to split the workload something like mkfifo mypipe1 mkfifo mypipe2 awk '0 == NR % 2' mypipe1 | ./mapper | sort map_out_1 awk '0 == (NR+1) % 2' mypipe2 | ./mapper | sort map_out_2 ./get_lots_of_data | tee mypipe1 mypipe2 (wait until it's done... or send a signal from the get_lots_of_data process on completion if it's a cronjob) sort -m map_out* | ./reducer reduce_out works around the global interpreter lock in python quite nicely and doesn't need people that write the scripts (who may not be programmers) to understand multiple processes etc, just stdin and stdout. Dumbo provides py support under Hadoop: http://wiki.github.com/klbostee/dumbo https://issues.apache.org/jira/browse/HADOOP-4304 as well as that, given Hadoop is java1.6+, there's no reason why it couldn't support the javax.script engine, with JavaScript working without extra JAR files, groovy and jython once their JARs were stuck on the classpath. Some work would probably be needed to make it easier to use these languages, and then there are the tests...
Re: Reducer hangs at 16%
OK. I am guessing that your problem arises from having two entries for master. The master should be the node where the JT is run (for start-mapred.sh) and NN is run (for start-dfs.sh). This might need a bit more effort to set up. To start with, you might want to try out having both the JT and NN in the same machine (the node designated as master) and then try start-all.sh. You need to configure you hadoop-site.xml correctly as well. Jothi On 2/23/09 4:36 PM, Jagadesh_Doddi jagadesh_do...@satyam.com wrote: Hi I have setup as per the documentation in hadoop site. On namenode, I am running bin/start-dfs.sh and on job tracker, I am running bin\start-mapred.sh Thanks and Regards Jagadesh Doddi Telephone: 040-30657556 Mobile: 9949497414 -Original Message- From: Jothi Padmanabhan [mailto:joth...@yahoo-inc.com] Sent: Monday, February 23, 2009 4:00 PM To: core-user@hadoop.apache.org Subject: Re: Reducer hangs at 16% Hi, This looks like a set up issue. See http://hadoop.apache.org/core/docs/current/cluster_setup.html#Configuration+ Files On how to set this up correctly. As an aside, how are you bringing up the hadoop daemons (JobTracker, Namenode, TT and Datanodes)? Are you manually bringing them up or are you using bin/start-all.sh? Jothi On 2/23/09 3:14 PM, Jagadesh_Doddi jagadesh_do...@satyam.com wrote: I have setup a distributed environment on Fedora OS to run Hadoop. System Fedora1 is the name node, Fedora2 is Job tracker, Fedora3 and Fedora4 are task trackers. Conf/masters contains the entries Fedora1, Fedors2, and conf/slaves contains the entries Fedora3, Fedora4. When I run the sample wordcount example with single task tracker (either Fedora3 or Fedora4), it works fine and the job completes in a few seconds. However, when I add the other task tracker in conf/slaves, the reducer stop at 16% and the job completes after 13 minutes. The same problem exists in versions 16.4, 17.2.1 and 18.3. The output on the namenode console is shown below: [r...@fedora1 hadoop-0.17.2.1Cluster]# bin/hadoop jar samples/wordcount.jar org.myorg.WordCount input output 09/02/19 17:43:18 INFO mapred.FileInputFormat: Total input paths to process : 1 09/02/19 17:43:19 INFO mapred.JobClient: Running job: job_200902191741_0001 09/02/19 17:43:20 INFO mapred.JobClient: map 0% reduce 0% 09/02/19 17:43:26 INFO mapred.JobClient: map 50% reduce 0% 09/02/19 17:43:27 INFO mapred.JobClient: map 100% reduce 0% 09/02/19 17:43:35 INFO mapred.JobClient: map 100% reduce 16% 09/02/19 17:56:15 INFO mapred.JobClient: Task Id : task_200902191741_0001_m_01_0, Status : FAILED Too many fetch-failures 09/02/19 17:56:15 WARN mapred.JobClient: Error reading task outputNo route to host 09/02/19 17:56:18 WARN mapred.JobClient: Error reading task outputNo route to host 09/02/19 17:56:25 INFO mapred.JobClient: map 100% reduce 81% 09/02/19 17:56:26 INFO mapred.JobClient: map 100% reduce 100% 09/02/19 17:56:27 INFO mapred.JobClient: Job complete: job_200902191741_0001 09/02/19 17:56:27 INFO mapred.JobClient: Counters: 16 09/02/19 17:56:27 INFO mapred.JobClient: Job Counters 09/02/19 17:56:27 INFO mapred.JobClient: Launched map tasks=3 09/02/19 17:56:27 INFO mapred.JobClient: Launched reduce tasks=1 09/02/19 17:56:27 INFO mapred.JobClient: Data-local map tasks=3 09/02/19 17:56:27 INFO mapred.JobClient: Map-Reduce Framework 09/02/19 17:56:27 INFO mapred.JobClient: Map input records=5 09/02/19 17:56:27 INFO mapred.JobClient: Map output records=25 09/02/19 17:56:27 INFO mapred.JobClient: Map input bytes=138 09/02/19 17:56:27 INFO mapred.JobClient: Map output bytes=238 09/02/19 17:56:27 INFO mapred.JobClient: Combine input records=25 09/02/19 17:56:27 INFO mapred.JobClient: Combine output records=23 09/02/19 17:56:27 INFO mapred.JobClient: Reduce input groups=23 09/02/19 17:56:27 INFO mapred.JobClient: Reduce input records=23 09/02/19 17:56:27 INFO mapred.JobClient: Reduce output records=23 09/02/19 17:56:27 INFO mapred.JobClient: File Systems 09/02/19 17:56:27 INFO mapred.JobClient: Local bytes read=522 09/02/19 17:56:27 INFO mapred.JobClient: Local bytes written=1177 09/02/19 17:56:27 INFO mapred.JobClient: HDFS bytes read=208 09/02/19 17:56:27 INFO mapred.JobClient: HDFS bytes written=175 Appreciate any help on this. Thanks Jagadesh DISCLAIMER: This email (including any attachments) is intended for the sole use of the intended recipient/s and may contain material that is CONFIDENTIAL AND PRIVATE COMPANY INFORMATION. Any review or reliance by others or copying or distribution or forwarding of any or all of the contents in this message is STRICTLY PROHIBITED. If you are not the intended recipient, please contact the sender by email and delete all copies; your cooperation in this regard is appreciated. DISCLAIMER: This email (including any
RE: Reducer hangs at 16%
Hi I have changed the configuration to run Name node and job tracker on the same system. The job is started with bin/start-all.sh on NN With a single slave node, the job completes in 12 seconds, and the console output is shown below: [r...@fedora1 hadoop-0.18.3]# bin/hadoop jar samples/wordcount.jar org.myorg.WordCount input output1 09/02/23 17:19:30 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 09/02/23 17:19:30 INFO mapred.FileInputFormat: Total input paths to process : 1 09/02/23 17:19:30 INFO mapred.FileInputFormat: Total input paths to process : 1 09/02/23 17:19:30 INFO mapred.JobClient: Running job: job_200902231717_0001 09/02/23 17:19:31 INFO mapred.JobClient: map 0% reduce 0% 09/02/23 17:19:37 INFO mapred.JobClient: map 100% reduce 0% 09/02/23 17:19:42 INFO mapred.JobClient: Job complete: job_200902231717_0001 09/02/23 17:19:42 INFO mapred.JobClient: Counters: 16 09/02/23 17:19:42 INFO mapred.JobClient: Job Counters 09/02/23 17:19:42 INFO mapred.JobClient: Data-local map tasks=2 09/02/23 17:19:42 INFO mapred.JobClient: Launched reduce tasks=1 09/02/23 17:19:42 INFO mapred.JobClient: Launched map tasks=2 09/02/23 17:19:42 INFO mapred.JobClient: Map-Reduce Framework 09/02/23 17:19:42 INFO mapred.JobClient: Map output records=25 09/02/23 17:19:42 INFO mapred.JobClient: Reduce input records=23 09/02/23 17:19:42 INFO mapred.JobClient: Map output bytes=238 09/02/23 17:19:42 INFO mapred.JobClient: Map input records=5 09/02/23 17:19:42 INFO mapred.JobClient: Combine output records=46 09/02/23 17:19:42 INFO mapred.JobClient: Map input bytes=138 09/02/23 17:19:42 INFO mapred.JobClient: Combine input records=48 09/02/23 17:19:42 INFO mapred.JobClient: Reduce input groups=23 09/02/23 17:19:42 INFO mapred.JobClient: Reduce output records=23 09/02/23 17:19:42 INFO mapred.JobClient: File Systems 09/02/23 17:19:42 INFO mapred.JobClient: HDFS bytes written=175 09/02/23 17:19:42 INFO mapred.JobClient: Local bytes written=648 09/02/23 17:19:42 INFO mapred.JobClient: HDFS bytes read=208 09/02/23 17:19:42 INFO mapred.JobClient: Local bytes read=281 With two slave nodes, the job completes in 13 minutes, and the console output is shown below: [r...@fedora1 hadoop-0.18.3]# bin/hadoop jar samples/wordcount.jar org.myorg.WordCount input output2 09/02/23 17:25:38 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 09/02/23 17:25:38 INFO mapred.FileInputFormat: Total input paths to process : 1 09/02/23 17:25:38 INFO mapred.FileInputFormat: Total input paths to process : 1 09/02/23 17:25:39 INFO mapred.JobClient: Running job: job_200902231722_0001 09/02/23 17:25:40 INFO mapred.JobClient: map 0% reduce 0% 09/02/23 17:25:42 INFO mapred.JobClient: map 50% reduce 0% 09/02/23 17:25:43 INFO mapred.JobClient: map 100% reduce 0% 09/02/23 17:25:58 INFO mapred.JobClient: map 100% reduce 16% 09/02/23 17:38:31 INFO mapred.JobClient: Task Id : attempt_200902231722_0001_m_00_0, Status : FAILED Too many fetch-failures 09/02/23 17:38:31 WARN mapred.JobClient: Error reading task outputNo route to host 09/02/23 17:38:31 WARN mapred.JobClient: Error reading task outputNo route to host 09/02/23 17:38:43 INFO mapred.JobClient: Job complete: job_200902231722_0001 09/02/23 17:38:43 INFO mapred.JobClient: Counters: 16 09/02/23 17:38:43 INFO mapred.JobClient: Job Counters 09/02/23 17:38:43 INFO mapred.JobClient: Data-local map tasks=3 09/02/23 17:38:43 INFO mapred.JobClient: Launched reduce tasks=1 09/02/23 17:38:43 INFO mapred.JobClient: Launched map tasks=3 09/02/23 17:38:43 INFO mapred.JobClient: Map-Reduce Framework 09/02/23 17:38:43 INFO mapred.JobClient: Map output records=25 09/02/23 17:38:43 INFO mapred.JobClient: Reduce input records=23 09/02/23 17:38:43 INFO mapred.JobClient: Map output bytes=238 09/02/23 17:38:43 INFO mapred.JobClient: Map input records=5 09/02/23 17:38:43 INFO mapred.JobClient: Combine output records=46 09/02/23 17:38:43 INFO mapred.JobClient: Map input bytes=138 09/02/23 17:38:43 INFO mapred.JobClient: Combine input records=48 09/02/23 17:38:43 INFO mapred.JobClient: Reduce input groups=23 09/02/23 17:38:43 INFO mapred.JobClient: Reduce output records=23 09/02/23 17:38:43 INFO mapred.JobClient: File Systems 09/02/23 17:38:43 INFO mapred.JobClient: HDFS bytes written=175 09/02/23 17:38:43 INFO mapred.JobClient: Local bytes written=648 09/02/23 17:38:43 INFO mapred.JobClient: HDFS bytes read=208 09/02/23 17:38:43 INFO mapred.JobClient: Local bytes read=281 Thanks Jagadesh -Original Message- From: Jothi Padmanabhan [mailto:joth...@yahoo-inc.com] Sent: Monday, February 23, 2009 4:57 PM To: core-user@hadoop.apache.org Subject: Re: Reducer hangs at 16% OK. I am guessing that your problem arises from
Re: Hadoop Streaming -file option
Hadoop uses RMI for file copy operations. Clients listen port 50010 for this operation. I assume, it's sending the file as byte stream. Cheers, Rasit 2009/2/23 Bing TANG whutg...@gmail.com Hi, everyone, Could somdone tell me the principle of -file when using Hadoop Streaming. I want to ship a big file to Slaves, so how it works? Hadoop uses SCP to copy? How does Hadoop deal with -file option? -- M. Raşit ÖZDAŞ
Re: Reducer hangs at 16%
Looks like the reducer is able to fetch map output files from the local box but fails to fetch it from the remote box. Can you check if there is no firewall issue or /etc/hosts entries are correct? Amar Jagadesh_Doddi wrote: Hi I have changed the configuration to run Name node and job tracker on the same system. The job is started with bin/start-all.sh on NN With a single slave node, the job completes in 12 seconds, and the console output is shown below: [r...@fedora1 hadoop-0.18.3]# bin/hadoop jar samples/wordcount.jar org.myorg.WordCount input output1 09/02/23 17:19:30 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 09/02/23 17:19:30 INFO mapred.FileInputFormat: Total input paths to process : 1 09/02/23 17:19:30 INFO mapred.FileInputFormat: Total input paths to process : 1 09/02/23 17:19:30 INFO mapred.JobClient: Running job: job_200902231717_0001 09/02/23 17:19:31 INFO mapred.JobClient: map 0% reduce 0% 09/02/23 17:19:37 INFO mapred.JobClient: map 100% reduce 0% 09/02/23 17:19:42 INFO mapred.JobClient: Job complete: job_200902231717_0001 09/02/23 17:19:42 INFO mapred.JobClient: Counters: 16 09/02/23 17:19:42 INFO mapred.JobClient: Job Counters 09/02/23 17:19:42 INFO mapred.JobClient: Data-local map tasks=2 09/02/23 17:19:42 INFO mapred.JobClient: Launched reduce tasks=1 09/02/23 17:19:42 INFO mapred.JobClient: Launched map tasks=2 09/02/23 17:19:42 INFO mapred.JobClient: Map-Reduce Framework 09/02/23 17:19:42 INFO mapred.JobClient: Map output records=25 09/02/23 17:19:42 INFO mapred.JobClient: Reduce input records=23 09/02/23 17:19:42 INFO mapred.JobClient: Map output bytes=238 09/02/23 17:19:42 INFO mapred.JobClient: Map input records=5 09/02/23 17:19:42 INFO mapred.JobClient: Combine output records=46 09/02/23 17:19:42 INFO mapred.JobClient: Map input bytes=138 09/02/23 17:19:42 INFO mapred.JobClient: Combine input records=48 09/02/23 17:19:42 INFO mapred.JobClient: Reduce input groups=23 09/02/23 17:19:42 INFO mapred.JobClient: Reduce output records=23 09/02/23 17:19:42 INFO mapred.JobClient: File Systems 09/02/23 17:19:42 INFO mapred.JobClient: HDFS bytes written=175 09/02/23 17:19:42 INFO mapred.JobClient: Local bytes written=648 09/02/23 17:19:42 INFO mapred.JobClient: HDFS bytes read=208 09/02/23 17:19:42 INFO mapred.JobClient: Local bytes read=281 With two slave nodes, the job completes in 13 minutes, and the console output is shown below: [r...@fedora1 hadoop-0.18.3]# bin/hadoop jar samples/wordcount.jar org.myorg.WordCount input output2 09/02/23 17:25:38 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 09/02/23 17:25:38 INFO mapred.FileInputFormat: Total input paths to process : 1 09/02/23 17:25:38 INFO mapred.FileInputFormat: Total input paths to process : 1 09/02/23 17:25:39 INFO mapred.JobClient: Running job: job_200902231722_0001 09/02/23 17:25:40 INFO mapred.JobClient: map 0% reduce 0% 09/02/23 17:25:42 INFO mapred.JobClient: map 50% reduce 0% 09/02/23 17:25:43 INFO mapred.JobClient: map 100% reduce 0% 09/02/23 17:25:58 INFO mapred.JobClient: map 100% reduce 16% 09/02/23 17:38:31 INFO mapred.JobClient: Task Id : attempt_200902231722_0001_m_00_0, Status : FAILED Too many fetch-failures 09/02/23 17:38:31 WARN mapred.JobClient: Error reading task outputNo route to host 09/02/23 17:38:31 WARN mapred.JobClient: Error reading task outputNo route to host 09/02/23 17:38:43 INFO mapred.JobClient: Job complete: job_200902231722_0001 09/02/23 17:38:43 INFO mapred.JobClient: Counters: 16 09/02/23 17:38:43 INFO mapred.JobClient: Job Counters 09/02/23 17:38:43 INFO mapred.JobClient: Data-local map tasks=3 09/02/23 17:38:43 INFO mapred.JobClient: Launched reduce tasks=1 09/02/23 17:38:43 INFO mapred.JobClient: Launched map tasks=3 09/02/23 17:38:43 INFO mapred.JobClient: Map-Reduce Framework 09/02/23 17:38:43 INFO mapred.JobClient: Map output records=25 09/02/23 17:38:43 INFO mapred.JobClient: Reduce input records=23 09/02/23 17:38:43 INFO mapred.JobClient: Map output bytes=238 09/02/23 17:38:43 INFO mapred.JobClient: Map input records=5 09/02/23 17:38:43 INFO mapred.JobClient: Combine output records=46 09/02/23 17:38:43 INFO mapred.JobClient: Map input bytes=138 09/02/23 17:38:43 INFO mapred.JobClient: Combine input records=48 09/02/23 17:38:43 INFO mapred.JobClient: Reduce input groups=23 09/02/23 17:38:43 INFO mapred.JobClient: Reduce output records=23 09/02/23 17:38:43 INFO mapred.JobClient: File Systems 09/02/23 17:38:43 INFO mapred.JobClient: HDFS bytes written=175 09/02/23 17:38:43 INFO mapred.JobClient: Local bytes written=648 09/02/23 17:38:43 INFO mapred.JobClient: HDFS bytes read=208 09/02/23 17:38:43 INFO mapred.JobClient: Local bytes read=281 Thanks Jagadesh -Original
Can anyone verify Hadoop FS shell command return codes?
I'm attempting to use Hadoop FS shell (http://hadoop .apache.org/core/docs/current/hdfs_shell.html) within a ruby script. My challenge is that I'm unable to get the function return value of the commands I'm invoking. As an example, I try to run get as follows hadoop fs -get /user/hadoop/testFile.txt . From the command line this generally works but I need to be able to verify that it is working during execution in my ruby script. The command should return 0 on success and -1 on error. Based on http://pasadenarb.com/2007/03/ruby-shell-commands.html I am using backticks to make the hadoop call and get the return value. Here is a dialogue within irb (Ruby's interactive shell) in which the command was not successful: irb(main):001:0 `hadoop dfs -get testFile.txt .` get: null = and a dialogue within irb in which the command was successful irb(main):010:0 `hadoop dfs -get testFile.txt .` = In both cases, neither a 0 nor a 1 appeared as a return value; indeed nothing was returned. Can anyone who is using the FS command shell return values within any scripting language (Ruby, PHP, Perl, ...) please confirm that it is working as expected or send an example snippet? Thanks, John
RE: Reducer hangs at 16%
It works as longs as I use any one of the slave nodes. The moment I add both the slave nodes to conf/slaves, It fails. So there is no issue with firewall or /etc/hosts entries. Thanks and Regards Jagadesh Doddi Telephone: 040-30657556 Mobile: 9949497414 -Original Message- From: Amar Kamat [mailto:ama...@yahoo-inc.com] Sent: Monday, February 23, 2009 6:26 PM To: core-user@hadoop.apache.org Subject: Re: Reducer hangs at 16% Looks like the reducer is able to fetch map output files from the local box but fails to fetch it from the remote box. Can you check if there is no firewall issue or /etc/hosts entries are correct? Amar Jagadesh_Doddi wrote: Hi I have changed the configuration to run Name node and job tracker on the same system. The job is started with bin/start-all.sh on NN With a single slave node, the job completes in 12 seconds, and the console output is shown below: [r...@fedora1 hadoop-0.18.3]# bin/hadoop jar samples/wordcount.jar org.myorg.WordCount input output1 09/02/23 17:19:30 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 09/02/23 17:19:30 INFO mapred.FileInputFormat: Total input paths to process : 1 09/02/23 17:19:30 INFO mapred.FileInputFormat: Total input paths to process : 1 09/02/23 17:19:30 INFO mapred.JobClient: Running job: job_200902231717_0001 09/02/23 17:19:31 INFO mapred.JobClient: map 0% reduce 0% 09/02/23 17:19:37 INFO mapred.JobClient: map 100% reduce 0% 09/02/23 17:19:42 INFO mapred.JobClient: Job complete: job_200902231717_0001 09/02/23 17:19:42 INFO mapred.JobClient: Counters: 16 09/02/23 17:19:42 INFO mapred.JobClient: Job Counters 09/02/23 17:19:42 INFO mapred.JobClient: Data-local map tasks=2 09/02/23 17:19:42 INFO mapred.JobClient: Launched reduce tasks=1 09/02/23 17:19:42 INFO mapred.JobClient: Launched map tasks=2 09/02/23 17:19:42 INFO mapred.JobClient: Map-Reduce Framework 09/02/23 17:19:42 INFO mapred.JobClient: Map output records=25 09/02/23 17:19:42 INFO mapred.JobClient: Reduce input records=23 09/02/23 17:19:42 INFO mapred.JobClient: Map output bytes=238 09/02/23 17:19:42 INFO mapred.JobClient: Map input records=5 09/02/23 17:19:42 INFO mapred.JobClient: Combine output records=46 09/02/23 17:19:42 INFO mapred.JobClient: Map input bytes=138 09/02/23 17:19:42 INFO mapred.JobClient: Combine input records=48 09/02/23 17:19:42 INFO mapred.JobClient: Reduce input groups=23 09/02/23 17:19:42 INFO mapred.JobClient: Reduce output records=23 09/02/23 17:19:42 INFO mapred.JobClient: File Systems 09/02/23 17:19:42 INFO mapred.JobClient: HDFS bytes written=175 09/02/23 17:19:42 INFO mapred.JobClient: Local bytes written=648 09/02/23 17:19:42 INFO mapred.JobClient: HDFS bytes read=208 09/02/23 17:19:42 INFO mapred.JobClient: Local bytes read=281 With two slave nodes, the job completes in 13 minutes, and the console output is shown below: [r...@fedora1 hadoop-0.18.3]# bin/hadoop jar samples/wordcount.jar org.myorg.WordCount input output2 09/02/23 17:25:38 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 09/02/23 17:25:38 INFO mapred.FileInputFormat: Total input paths to process : 1 09/02/23 17:25:38 INFO mapred.FileInputFormat: Total input paths to process : 1 09/02/23 17:25:39 INFO mapred.JobClient: Running job: job_200902231722_0001 09/02/23 17:25:40 INFO mapred.JobClient: map 0% reduce 0% 09/02/23 17:25:42 INFO mapred.JobClient: map 50% reduce 0% 09/02/23 17:25:43 INFO mapred.JobClient: map 100% reduce 0% 09/02/23 17:25:58 INFO mapred.JobClient: map 100% reduce 16% 09/02/23 17:38:31 INFO mapred.JobClient: Task Id : attempt_200902231722_0001_m_00_0, Status : FAILED Too many fetch-failures 09/02/23 17:38:31 WARN mapred.JobClient: Error reading task outputNo route to host 09/02/23 17:38:31 WARN mapred.JobClient: Error reading task outputNo route to host 09/02/23 17:38:43 INFO mapred.JobClient: Job complete: job_200902231722_0001 09/02/23 17:38:43 INFO mapred.JobClient: Counters: 16 09/02/23 17:38:43 INFO mapred.JobClient: Job Counters 09/02/23 17:38:43 INFO mapred.JobClient: Data-local map tasks=3 09/02/23 17:38:43 INFO mapred.JobClient: Launched reduce tasks=1 09/02/23 17:38:43 INFO mapred.JobClient: Launched map tasks=3 09/02/23 17:38:43 INFO mapred.JobClient: Map-Reduce Framework 09/02/23 17:38:43 INFO mapred.JobClient: Map output records=25 09/02/23 17:38:43 INFO mapred.JobClient: Reduce input records=23 09/02/23 17:38:43 INFO mapred.JobClient: Map output bytes=238 09/02/23 17:38:43 INFO mapred.JobClient: Map input records=5 09/02/23 17:38:43 INFO mapred.JobClient: Combine output records=46 09/02/23 17:38:43 INFO mapred.JobClient: Map input bytes=138 09/02/23 17:38:43 INFO mapred.JobClient: Combine
CfP 4th Workshop on Virtualization in High-Performance Cloud Computing (VHPC'09)
Apologies if you received multiple copies of this message. = CALL FOR PAPERS 4th Workshop on Virtualization in High-Performance Cloud Computing VHPC'09 as part of Euro-Par 2009, Delft, The Netherlands = Date: August 25, 2009 Euro-Par 2009: http://europar2009.ewi.tudelft.nl/ Workshop URL: http://vhpc.org SUBMISSION DEADLINE: Abstracts: March 12, 2009 Full Paper: June 8, 2009 Scope: Virtualization has become a common abstraction layer in modern data centers, enabling resource owners to manage complex infrastructure independently of their applications. Conjointly virtualization is becoming a driving technology for a manifold of industry grade IT services. Piloted by the Amazon Elastic Computing Cloud services, the cloud concept includes the notion of a separation between resource owners and users, adding services such as hosted application frameworks and queuing. Utilizing the same infrastructure, clouds carry significant potential for use in high-performance scientific computing. The ability of clouds to provide for requests and releases of vast computing resource dynamically and close to the marginal cost of providing the services is unprecedented in the history of scientific and commercial computing. Distributed computing concepts that leverage federated resource access are popular within the grid community, but have not seen previously desired deployed levels so far. Also, many of the scientific datacenters have not adopted virtualization or cloud concepts yet. This workshop aims to bring together industrial providers with the scientific community in order to foster discussion, collaboration and mutual exchange of knowledge and experience. The workshop will be one day in length, composed of 20 min paper presentations, each followed by 10 min discussion sections. Presentations may be accompanied by interactive demonstrations. It concludes with a 30 min panel discussion by presenters. TOPICS Topics include, but are not limited to, the following subjects: - Virtualization in cloud, cluster and grid environments - VM-based cloud performance modeling - Workload characterizations for VM-based environments - Software as a Service (SaaS) - Cloud reliability, fault-tolerance, and security - Cloud, cluster and grid filesystems for VMs - QoS and and service level guarantees - Virtualized I/O - VMMs and storage virtualization - Research and education use cases - VM cloud, cluster distribution algorithms - MPI, PVM on virtual machines - Cloud APIs - Cloud load balancing - Hardware support for virtualization - High-performance network virtualization - High-speed interconnects - Bottleneck management - Hypervisor extensions and tools for cluster and grid computing - Network architectures for VM-based environments - VMMs/Hypervisors - Cloud use cases - Performance management and tuning hosts and guest VMs - Fault tolerant VM environments - VMM performance tuning on various load types - Cloud provisioning - Xen/other VMM cloud/cluster/grid tools - Device access from VMs - Management, deployment of VM-based environments PAPER SUBMISSION Papers submitted to the workshop will be reviewed by at least two members of the program committee and external reviewers. Submissions should include abstract, key words, the e-mail address of the corresponding author, and must not exceed 10 pages, including tables and figures at a main font size no smaller than 11 point. Submission of a paper should be regarded as a commitment that, should the paper be accepted, at least one of the authors will register and attend the conference to present the work. Accepted papers will be published in the Springer LNCS series - the format must be according to the Springer LNCS Style. Initial submissions are in PDF, accepted papers will be requested to provided source files. Format Guidelines: http://www.springer.de/comp/lncs/authors.html Submission Link: http://edas.info/newPaper.php?c=7364 IMPORTANT DATES March 12 - Abstract submission due June 8- Full paper submission July 14 - Acceptance notification August 3 - Camera-ready version due August 25-28 - Conference CHAIR Michael Alexander (chair), Scaled Infrastructure KG, Austria Marcus Hardt (co-chair), Forschungszentrum Karlsruhe, Germany PROGRAM COMMITTEE Padmashree Apparao, Intel Corp., USA Hassan Barada, Khalifa University, UAE Volker Buege, University of Karlsruhe, Germany Isabel Campos, IFCA, Spain Stephen Childs, Trinity College Dublin, Ireland William Gardner, University of Guelph, Canada Derek Groen, UVA, The Netherlands Ahmad Hammad, FZK, Germany Sverre Jarp, CERN, Switzerland Xuxian Jiang, NC State, USA Kenji Kaneda, University of Tokyo, Japan
Re: Reducer hangs at 16%
The fact that it works with one slave node doesn't mean much, because when the slave is alone, it's copying map outputs from itself and thus not going through the firewall. It sounds like the slaves can't open a connection to each other, which could well mean a firewall problem. Can you look at the output of the reduce task (by clicking it in the running tasks column in the web UI and going on to see the last 8k of output)? I imagine it will have fetched data from one slave and will be failing to connect to the other one. On Mon, Feb 23, 2009 at 5:03 AM, Jagadesh_Doddi jagadesh_do...@satyam.comwrote: It works as longs as I use any one of the slave nodes. The moment I add both the slave nodes to conf/slaves, It fails. So there is no issue with firewall or /etc/hosts entries. Thanks and Regards Jagadesh Doddi Telephone: 040-30657556 Mobile: 9949497414 -Original Message- From: Amar Kamat [mailto:ama...@yahoo-inc.com] Sent: Monday, February 23, 2009 6:26 PM To: core-user@hadoop.apache.org Subject: Re: Reducer hangs at 16% Looks like the reducer is able to fetch map output files from the local box but fails to fetch it from the remote box. Can you check if there is no firewall issue or /etc/hosts entries are correct? Amar Jagadesh_Doddi wrote: Hi I have changed the configuration to run Name node and job tracker on the same system. The job is started with bin/start-all.sh on NN With a single slave node, the job completes in 12 seconds, and the console output is shown below: [r...@fedora1 hadoop-0.18.3]# bin/hadoop jar samples/wordcount.jar org.myorg.WordCount input output1 09/02/23 17:19:30 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 09/02/23 17:19:30 INFO mapred.FileInputFormat: Total input paths to process : 1 09/02/23 17:19:30 INFO mapred.FileInputFormat: Total input paths to process : 1 09/02/23 17:19:30 INFO mapred.JobClient: Running job: job_200902231717_0001 09/02/23 17:19:31 INFO mapred.JobClient: map 0% reduce 0% 09/02/23 17:19:37 INFO mapred.JobClient: map 100% reduce 0% 09/02/23 17:19:42 INFO mapred.JobClient: Job complete: job_200902231717_0001 09/02/23 17:19:42 INFO mapred.JobClient: Counters: 16 09/02/23 17:19:42 INFO mapred.JobClient: Job Counters 09/02/23 17:19:42 INFO mapred.JobClient: Data-local map tasks=2 09/02/23 17:19:42 INFO mapred.JobClient: Launched reduce tasks=1 09/02/23 17:19:42 INFO mapred.JobClient: Launched map tasks=2 09/02/23 17:19:42 INFO mapred.JobClient: Map-Reduce Framework 09/02/23 17:19:42 INFO mapred.JobClient: Map output records=25 09/02/23 17:19:42 INFO mapred.JobClient: Reduce input records=23 09/02/23 17:19:42 INFO mapred.JobClient: Map output bytes=238 09/02/23 17:19:42 INFO mapred.JobClient: Map input records=5 09/02/23 17:19:42 INFO mapred.JobClient: Combine output records=46 09/02/23 17:19:42 INFO mapred.JobClient: Map input bytes=138 09/02/23 17:19:42 INFO mapred.JobClient: Combine input records=48 09/02/23 17:19:42 INFO mapred.JobClient: Reduce input groups=23 09/02/23 17:19:42 INFO mapred.JobClient: Reduce output records=23 09/02/23 17:19:42 INFO mapred.JobClient: File Systems 09/02/23 17:19:42 INFO mapred.JobClient: HDFS bytes written=175 09/02/23 17:19:42 INFO mapred.JobClient: Local bytes written=648 09/02/23 17:19:42 INFO mapred.JobClient: HDFS bytes read=208 09/02/23 17:19:42 INFO mapred.JobClient: Local bytes read=281 With two slave nodes, the job completes in 13 minutes, and the console output is shown below: [r...@fedora1 hadoop-0.18.3]# bin/hadoop jar samples/wordcount.jar org.myorg.WordCount input output2 09/02/23 17:25:38 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 09/02/23 17:25:38 INFO mapred.FileInputFormat: Total input paths to process : 1 09/02/23 17:25:38 INFO mapred.FileInputFormat: Total input paths to process : 1 09/02/23 17:25:39 INFO mapred.JobClient: Running job: job_200902231722_0001 09/02/23 17:25:40 INFO mapred.JobClient: map 0% reduce 0% 09/02/23 17:25:42 INFO mapred.JobClient: map 50% reduce 0% 09/02/23 17:25:43 INFO mapred.JobClient: map 100% reduce 0% 09/02/23 17:25:58 INFO mapred.JobClient: map 100% reduce 16% 09/02/23 17:38:31 INFO mapred.JobClient: Task Id : attempt_200902231722_0001_m_00_0, Status : FAILED Too many fetch-failures 09/02/23 17:38:31 WARN mapred.JobClient: Error reading task outputNo route to host 09/02/23 17:38:31 WARN mapred.JobClient: Error reading task outputNo route to host 09/02/23 17:38:43 INFO mapred.JobClient: Job complete: job_200902231722_0001 09/02/23 17:38:43 INFO mapred.JobClient: Counters: 16 09/02/23 17:38:43 INFO mapred.JobClient: Job Counters 09/02/23 17:38:43 INFO mapred.JobClient: Data-local map
Re: Can anyone verify Hadoop FS shell command return codes?
You should distinguish between the output of a command and the return value of the command: usually they are captured in different ways by the interpreters (scripting languages or shells). For example: 1) in perl the return value is captured by using the system function: $rv = system(cmd); so the $rv variable contains the value returned by cmd. Instead, with backticks you get the output of cmd: $out = `cmd`; 2) in shells (sh/bash/tcsh) the return value is stored in the variable $? (dollar char followed by question-mark char). Instead, the output is again obtained with backticks. I don't know the way in which irb captures the return value: for analogy I would say that backticks are used for capturing the output even in irb. Best Roldano On Mon, Feb 23, 2009 at 02:02:22PM +0100, S D wrote: I'm attempting to use Hadoop FS shell (http://hadoop .apache.org/core/docs/current/hdfs_shell.html) within a ruby script. My challenge is that I'm unable to get the function return value of the commands I'm invoking. As an example, I try to run get as follows hadoop fs -get /user/hadoop/testFile.txt . From the command line this generally works but I need to be able to verify that it is working during execution in my ruby script. The command should return 0 on success and -1 on error. Based on http://pasadenarb.com/2007/03/ruby-shell-commands.html I am using backticks to make the hadoop call and get the return value. Here is a dialogue within irb (Ruby's interactive shell) in which the command was not successful: irb(main):001:0 `hadoop dfs -get testFile.txt .` get: null = and a dialogue within irb in which the command was successful irb(main):010:0 `hadoop dfs -get testFile.txt .` = In both cases, neither a 0 nor a 1 appeared as a return value; indeed nothing was returned. Can anyone who is using the FS command shell return values within any scripting language (Ruby, PHP, Perl, ...) please confirm that it is working as expected or send an example snippet? Thanks, John
Re: Hadoop Streaming -file option
On Feb 23, 2009, at 2:01 AM, Bing TANG wrote: Hi, everyone, Could somdone tell me the principle of -file when using Hadoop Streaming. I want to ship a big file to Slaves, so how it works? Hadoop uses SCP to copy? How does Hadoop deal with -file option? No, -file just copies the file from the local filesystem to HDFS, and the DistributedCache copies it to the local filesystem of the node on which the map/reduce task runs. Arun
Batching key/value pairs to map
part of my map/reduce process could be greatly sped up by mapping key/value pairs in batches instead of mapping them one by one. I'd like to do the following: protected abstract void batchMap(OutputCollectorK2, V2 k2V2OutputCollector, Reporter reporter) throws IOException; public void map(K1 key1, V1 value1, OutputCollectorK2, V2 output, Reporter reporter) throws IOException { keys.add(key1.copy()); values.add(value1.copy()); if (++currentSize == batchSize) { batchMap(output, reporter); clear(); } } public void close() throws IOException { if (currentSize 0) { // I don't have access to my OutputCollector or Reporter here! batchMap(output, reporter); clear(); } } Can I safely hang onto my OutputCollector and Reporter from calls to map? I'm currently running Hadoop 0.17.2.1. Is this something I could do in Hadoop 0.19.X?
Re: Batching key/value pairs to map
On Mon, Feb 23, 2009 at 12:06 PM, Jimmy Wan ji...@indeed.com wrote: part of my map/reduce process could be greatly sped up by mapping key/value pairs in batches instead of mapping them one by one. Can I safely hang onto my OutputCollector and Reporter from calls to map? Yes. You can even use them in the close, so that you can process the last batch of records. *smile* One problem that you will quickly hit is that Hadoop reuses the objects that are passed to map and reduce. So, you'll need to clone them before putting them into the collection. I'm currently running Hadoop 0.17.2.1. Is this something I could do in Hadoop 0.19.X? I don't think any of this changed between 0.17 and 0.19, other than in 0.17 the reduce's inputs were always new objects. In 0.18 and after, the reduce's inputs are reused. -- Owen
Re: Batching key/value pairs to map
Great, thanks Owen. I actually ran into the object reuse problem a long time ago. The output of my MR processes gets turned into a series of large INSERT statements that wasn't performing unless I batched them in inserts of several K entries. I'm not sure if this is possible, but it would certainly be nice to either: 1) pass the OutputCollector and Reporter to the close() method. 2) Provide accessors to the OutputCollector and the Reporter. Now every single one of my maps is going to have a pair of 1-2 extra no-ops. I'll check to see if that's on the list of outstanding FRs. On Mon, Feb 23, 2009 at 15:30, Owen O'Malley owen.omal...@gmail.com wrote: On Mon, Feb 23, 2009 at 12:06 PM, Jimmy Wan ji...@indeed.com wrote: part of my map/reduce process could be greatly sped up by mapping key/value pairs in batches instead of mapping them one by one. Can I safely hang onto my OutputCollector and Reporter from calls to map? Yes. You can even use them in the close, so that you can process the last batch of records. *smile* One problem that you will quickly hit is that Hadoop reuses the objects that are passed to map and reduce. So, you'll need to clone them before putting them into the collection. I'm currently running Hadoop 0.17.2.1. Is this something I could do in Hadoop 0.19.X? I don't think any of this changed between 0.17 and 0.19, other than in 0.17 the reduce's inputs were always new objects. In 0.18 and after, the reduce's inputs are reused.
Re: Batching key/value pairs to map
On Feb 23, 2009, at 2:19 PM, Jimmy Wan wrote: I'm not sure if this is possible, but it would certainly be nice to either: 1) pass the OutputCollector and Reporter to the close() method. 2) Provide accessors to the OutputCollector and the Reporter. If you look at the 0.20 branch, which hasn't released yet, there is a new map/reduce api. That api does provide a lot more control. Take a look at Mapper, which provide setup, map, and cleanup hooks: http://tinyurl.com/bquvxq The map method looks like: /** * Called once for each key/value pair in the input split. Most applications * should override this, but the default is the identity function. */ @SuppressWarnings(unchecked) protected void map(KEYIN key, VALUEIN value, Context context) throws IOException, InterruptedException { context.write((KEYOUT) key, (VALUEOUT) value); } But there is also a run method that drives the task. The default is given below, but it can be overridden by the application. /** * Expert users can override this method for more complete control over the * execution of the Mapper. * @param context * @throws IOException */ public void run(Context context) throws IOException, InterruptedException { setup(context); while (context.nextKeyValue()) { map(context.getCurrentKey(), context.getCurrentValue(), context); } cleanup(context); } Clearly, in your application you could override run to make a list of 100 key, value pairs or something. -- Owen
Re: Design issue for a problem using Map Reduce
Thanks Sagar...That helps to a certain extent. But is dependency not a common occurrence among equations? Doesn't Hadoop provide a way to solve such equations in parallel? Going in for a sequential calculation might prove to be a major performance degradation given tens of thousands of numbers. Does any one have any ideas ? Thanks. On Sun, Feb 15, 2009 at 1:34 AM, Sagar Naik sn...@attributor.com wrote: Here is one thought N maps and 1 Reduce, input to map: t,w(t) output of map t, w(t)*w(t) I assume t is an integer. So in case of 1 reducer, u will receive t0, square(w(0) t1, square(w(1) t2, square(w(2) t3, square(w(3) Note this wiil be a sorted series on t. in reduce static prevF = 0; reduce(t, square_w_t) { f = square_w_t * A + B * prevF ; output.collect(t,f) prevF = f } According to me the step of B*F(t-1) is inherently sequential. So all we can do is parallelize the a*w(t)*w(t) part. -Sagar some speed wrote: Hello all, I am trying to implement a Map Reduce Chain to solve a particular statistic problem. I have come to a point where I have to solve the following type of equation in Hadoop: F(t)= A*w(t)*w(t) + B*F(t-1); Given: F(0)=0, A and B are Alpha and Beta and their values are known. Now, W is series of numbers (There could be *a million* or more numbers). So to Solve the equation in terms of Map Reduce, there are basically 2 issues which I can think of: 1) How will I be able to get the value of F(t-1) since it means as each step i need the value from the previous iteration. And that is not possible while computing parallely. 2) the w(t) values have to be read and applied in order also ,and, again that is a prb while computing parallely. Can some please help me go abt this problem and overcome the issues? Thanks, Sharath
Re: the question about the common pc?
On Mon, 2009-02-23 at 11:14 +, Steve Loughran wrote: Dumbo provides py support under Hadoop: http://wiki.github.com/klbostee/dumbo https://issues.apache.org/jira/browse/HADOOP-4304 Ooh, nice - I hadn't seen dumbo. That's far cleaner than the python wrapper to streaming I'd hacked together. I'm probably going to be using hadoop more again in the near future so I'll bookmark that, thanks Steve. Personally I only need text based records, so I'm fine using a wrapper around streaming Tim Wintle
mysql metastore problems
Hi, I'm having some problems setting up the metastore using mysql. I've browsed the message archives, but don't see anything that helps. My configuration files, look like: **hive-site.xml** ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? configuration property namehive.metastore.local/name valuetrue/value descriptionthis is local store/description /property property namehive.metastore.warehouse.dir/name value/user/hive/warehouse/value descriptiondefault location for Hive tables/description /property property namehive.aux.jars.path/name value/home/ryan/hive/branch-0.2/install/custom//value descriptionwhere custom serdes live /description /property /configuration **jpox.properties** javax.jdo.PersistenceManagerFactoryClass=org.jpox.PersistenceManagerFactoryImpl org.jpox.validateTables=false org.jpox.validateColumns=false org.jpox.validateConstraints=false org.jpox.storeManagerType=rdbms org.jpox.autoCreateSchema=true org.jpox.autoStartMechanismMode=checked org.jpox.transactionIsolation=read_committed javax.jdo.option.DetachAllOnCommit=true javax.jdo.option.NontransactionalRead=true javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver javax.jdo.option.ConnectionURL=jdbc:mysql://localhost/hive_metastore?createDatabaseIfNotExist=true javax.jdo.option.ConnectionUserName=hive_user javax.jdo.option.ConnectionPassword=hive_pass org.jpox.cache.level2=true org.jpox.cache.level2.type=SOFT And then with just the default hive-default.xml file. I haven't worked with JPOX tables before, but I'm under the impression that this will automatically get created from the autoCreateSchema flag. Has anyone had any luck with this? I'm getting the following error: hive r...@dali:~/hive/branch-0.2/install/bin$ hive Hive history file=/tmp/ryan/hive_job_log_ryan_200902231722_1088538316.txt hive create table test_table (id INT, name STRING); FAILED: Error in metadata: javax.jdo.JDODataStoreException: Error adding class org.apache.hadoop.hive.metastore.model.MDatabase to list of persistence-managed classes : Table/View 'JPOX_TABLES' does not exist. java.sql.SQLSyntaxErrorException: Table/View 'JPOX_TABLES' does not exist. at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source) at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown Source) at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown Source) at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown Source) at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown Source) at org.apache.derby.impl.jdbc.EmbedPreparedStatement.init(Unknown Source) at org.apache.derby.impl.jdbc.EmbedPreparedStatement20.init(Unknown Source) at org.apache.derby.impl.jdbc.EmbedPreparedStatement30.init(Unknown Source) at org.apache.derby.impl.jdbc.EmbedPreparedStatement40.init(Unknown Source) at org.apache.derby.jdbc.Driver40.newEmbedPreparedStatement(Unknown Source) at org.apache.derby.impl.jdbc.EmbedConnection.prepareStatement(Unknown Source) at org.apache.derby.impl.jdbc.EmbedConnection.prepareStatement(Unknown Source) at org.jpox.store.rdbms.SQLController.getStatementForQuery(SQLController.java:324) at org.jpox.store.rdbms.SQLController.getStatementForQuery(SQLController.java:263) at org.jpox.store.rdbms.table.SchemaTable.hasClass(SchemaTable.java:280) at org.jpox.store.rdbms.table.SchemaTable.addClass(SchemaTable.java:222) at org.jpox.store.rdbms.SchemaAutoStarter.addClass(SchemaAutoStarter.java:255) at org.jpox.store.AbstractStoreManager.registerStoreData(AbstractStoreManager.java:363) at org.jpox.store.rdbms.RDBMSManager.access$3000(RDBMSManager.java:171) at org.jpox.store.rdbms.RDBMSManager$ClassAdder.addClassTable(RDBMSManager.java:3001) at org.jpox.store.rdbms.RDBMSManager$ClassAdder.addClassTables(RDBMSManager.java:2804) at org.jpox.store.rdbms.RDBMSManager$ClassAdder.addClassTablesAndValidate(RDBMSManager.java:3098) at org.jpox.store.rdbms.RDBMSManager$ClassAdder.run(RDBMSManager.java:2729) at org.jpox.store.rdbms.RDBMSManager$MgmtTransaction.execute(RDBMSManager.java:2609) at org.jpox.store.rdbms.RDBMSManager.addClasses(RDBMSManager.java:825) at org.jpox.store.AbstractStoreManager.addClass(AbstractStoreManager.java:624) at org.jpox.store.mapped.MappedStoreManager.getDatastoreClass(MappedStoreManager.java:343) at org.jpox.store.rdbms.RDBMSManager.getPropertiesForGenerator(RDBMSManager.java:1630) at org.jpox.store.AbstractStoreManager.getStrategyValue(AbstractStoreManager.java:945) at org.jpox.ObjectManagerImpl.newObjectId(ObjectManagerImpl.java:2473) at org.jpox.state.JDOStateManagerImpl.setIdentity(JDOStateManagerImpl.java:792) at
Re: mysql metastore problems
Please ignore the last message, I must have responded to an older hive message and it put in the Hadoop core mailing list instead. I've reposted on the hive user list. On Mon, Feb 23, 2009 at 5:29 PM, Ryan Shih ryan.s...@gmail.com wrote: Hi, I'm having some problems setting up the metastore using mysql. I've browsed the message archives, but don't see anything that helps. My configuration files, look like: **hive-site.xml** ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? configuration property namehive.metastore.local/name valuetrue/value descriptionthis is local store/description /property property namehive.metastore.warehouse.dir/name value/user/hive/warehouse/value descriptiondefault location for Hive tables/description /property property namehive.aux.jars.path/name value/home/ryan/hive/branch-0.2/install/custom//value descriptionwhere custom serdes live /description /property /configuration **jpox.properties** javax.jdo.PersistenceManagerFactoryClass=org.jpox.PersistenceManagerFactoryImpl org.jpox.validateTables=false org.jpox.validateColumns=false org.jpox.validateConstraints=false org.jpox.storeManagerType=rdbms org.jpox.autoCreateSchema=true org.jpox.autoStartMechanismMode=checked org.jpox.transactionIsolation=read_committed javax.jdo.option.DetachAllOnCommit=true javax.jdo.option.NontransactionalRead=true javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver javax.jdo.option.ConnectionURL=jdbc:mysql://localhost/hive_metastore?createDatabaseIfNotExist=true javax.jdo.option.ConnectionUserName=hive_user javax.jdo.option.ConnectionPassword=hive_pass org.jpox.cache.level2=true org.jpox.cache.level2.type=SOFT And then with just the default hive-default.xml file. I haven't worked with JPOX tables before, but I'm under the impression that this will automatically get created from the autoCreateSchema flag. Has anyone had any luck with this? I'm getting the following error: hive r...@dali:~/hive/branch-0.2/install/bin$ hive Hive history file=/tmp/ryan/hive_job_log_ryan_200902231722_1088538316.txt hive create table test_table (id INT, name STRING); FAILED: Error in metadata: javax.jdo.JDODataStoreException: Error adding class org.apache.hadoop.hive.metastore.model.MDatabase to list of persistence-managed classes : Table/View 'JPOX_TABLES' does not exist. java.sql.SQLSyntaxErrorException: Table/View 'JPOX_TABLES' does not exist. at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source) at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown Source) at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown Source) at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown Source) at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown Source) at org.apache.derby.impl.jdbc.EmbedPreparedStatement.init(Unknown Source) at org.apache.derby.impl.jdbc.EmbedPreparedStatement20.init(Unknown Source) at org.apache.derby.impl.jdbc.EmbedPreparedStatement30.init(Unknown Source) at org.apache.derby.impl.jdbc.EmbedPreparedStatement40.init(Unknown Source) at org.apache.derby.jdbc.Driver40.newEmbedPreparedStatement(Unknown Source) at org.apache.derby.impl.jdbc.EmbedConnection.prepareStatement(Unknown Source) at org.apache.derby.impl.jdbc.EmbedConnection.prepareStatement(Unknown Source) at org.jpox.store.rdbms.SQLController.getStatementForQuery(SQLController.java:324) at org.jpox.store.rdbms.SQLController.getStatementForQuery(SQLController.java:263) at org.jpox.store.rdbms.table.SchemaTable.hasClass(SchemaTable.java:280) at org.jpox.store.rdbms.table.SchemaTable.addClass(SchemaTable.java:222) at org.jpox.store.rdbms.SchemaAutoStarter.addClass(SchemaAutoStarter.java:255) at org.jpox.store.AbstractStoreManager.registerStoreData(AbstractStoreManager.java:363) at org.jpox.store.rdbms.RDBMSManager.access$3000(RDBMSManager.java:171) at org.jpox.store.rdbms.RDBMSManager$ClassAdder.addClassTable(RDBMSManager.java:3001) at org.jpox.store.rdbms.RDBMSManager$ClassAdder.addClassTables(RDBMSManager.java:2804) at org.jpox.store.rdbms.RDBMSManager$ClassAdder.addClassTablesAndValidate(RDBMSManager.java:3098) at org.jpox.store.rdbms.RDBMSManager$ClassAdder.run(RDBMSManager.java:2729) at org.jpox.store.rdbms.RDBMSManager$MgmtTransaction.execute(RDBMSManager.java:2609) at org.jpox.store.rdbms.RDBMSManager.addClasses(RDBMSManager.java:825) at org.jpox.store.AbstractStoreManager.addClass(AbstractStoreManager.java:624) at org.jpox.store.mapped.MappedStoreManager.getDatastoreClass(MappedStoreManager.java:343) at
Re: Batching key/value pairs to map
We have a MR program that collects once for each token on a line. What types of applications can benefit from batch mapping?
Re: How to use JobConf.setKeyFieldPartitionerOptions() method
Thanks Jason, It works. 2009/2/23 jason hadoop jason.had...@gmail.com For reasons that are not clear, in 19, the partitioner steps one character past the end of the field unless you are very explicit in your key specification. One would assume that -k2 would pick up the second token, even if it was the last field in the key, but -k2,2 is required As near as I can tell the -kX syntax means piece X including the separator character, which will of course not be present, if this is the last piece. In your case, try *setKeyFieldPartitionerOptions(-k 1,1);* I believe it will work. On Sun, Feb 22, 2009 at 12:55 AM, Nick Cen cenyo...@gmail.com wrote: Hi All, Assume the output key from the mapper has the format k1,k2, what i wanna to do is to use the k1 instead the whole key to partition the output, what parameter value shoud i provide to the setKeyFieldPartitionerOptions(). i have try -k 1, but it throuth an ArrayIndexOutOfBound Exception, Thanks in advance. -- http://daily.appspot.com/food/ -- http://daily.appspot.com/food/
Re: Limit number of records or total size in combiner input using jobconf?
Thank you. On Fri, Feb 20, 2009 at 5:34 PM, Chris Douglas chri...@yahoo-inc.com wrote: So here are my questions: (1) is there a jobconf hint to limit the number of records in kviter? I can (and have) made a fix to my code that processes the values in a combiner step in batches (i.e takes N at a go,processes that and repeat), but was wondering if i could just set an option. Approximately and indirectly, yes. You can limit the amount of memory allocated to storing serialized records in memory (io.sort.mb) and the percentage of that space reserved for storing record metadata (io.sort.record.percent, IIRC). That can be used to limit the number of records in each spill, though you may also need to disable the combiner during the merge, where you may run into the same problem. You're almost certainly better off designing your combiner to scale well (as you have), since you'll hit this in the reduce, too. Since this occurred in the MapContext, changing the number of reducers wont help. (2) How does changing the number of reducers help at all? I have 7 machines, so I feel 11 (a prime close to 7, why a prime?) is good enough (some machines are 16GB others 32GB) Your combiner will look at all the records for a partition and only those records in a partition. If your partitioner distributes your records evenly in a particular spill, then increasing the total number of partitions will decrease the number of records your combiner considers in each call. For most partitioners, whether the number of reducers is prime should be irrelevant. -C
hdfs disappears
Hi everyone! I am using Hadoop Core (version 0.19.0), os : Ubuntu 8.04, on one single machine (for testing purpose). Everytime I shutdown my computer and turn on it again, I can't access the virtual distributed file system just by command {$HADOOP_HOME}/bin/start-all.sh. All the data has disappeared, and I have to reformat the file system (using {$HADOOP_HOME}/bin/hadoop namenode -format) before start-all.sh. Can any one explain me how to fix this problem? Thanks in advance. Vu Nguyen.
Re: hdfs disappears
Hello, Where are you saving your data? If it's being written into /tmp, it will be deleted every time you restart your computer. I believe writing into /tmp is the default for Hadoop unless you changed it in hadoop-site.xml. Brian On Feb 23, 2009, at 10:00 PM, Anh Vũ Nguyễn wrote: Hi everyone! I am using Hadoop Core (version 0.19.0), os : Ubuntu 8.04, on one single machine (for testing purpose). Everytime I shutdown my computer and turn on it again, I can't access the virtual distributed file system just by command {$HADOOP_HOME}/bin/start-all.sh. All the data has disappeared, and I have to reformat the file system (using {$HADOOP_HOME}/bin/hadoop namenode -format) before start-all.sh. Can any one explain me how to fix this problem? Thanks in advance. Vu Nguyen.
Re: hdfs disappears
Exactly the same thing happened to me, and Brian gave the same answer. What if the default is changed to the user's home directory somewhere? On Mon, Feb 23, 2009 at 10:05 PM, Brian Bockelman bbock...@cse.unl.eduwrote: Hello, Where are you saving your data? If it's being written into /tmp, it will be deleted every time you restart your computer. I believe writing into /tmp is the default for Hadoop unless you changed it in hadoop-site.xml. Brian On Feb 23, 2009, at 10:00 PM, Anh Vũ Nguyễn wrote: Hi everyone! I am using Hadoop Core (version 0.19.0), os : Ubuntu 8.04, on one single machine (for testing purpose). Everytime I shutdown my computer and turn on it again, I can't access the virtual distributed file system just by command {$HADOOP_HOME}/bin/start-all.sh. All the data has disappeared, and I have to reformat the file system (using {$HADOOP_HOME}/bin/hadoop namenode -format) before start-all.sh. Can any one explain me how to fix this problem? Thanks in advance. Vu Nguyen.