Hadoop Streaming -file option

2009-02-23 Thread Bing TANG
Hi, everyone,
Could somdone tell me the principle of -file when using Hadoop
Streaming. I want to ship a big file to Slaves, so how it works?

Hadoop uses SCP to copy? How does Hadoop deal with -file option?




Re: Reducer hangs at 16%

2009-02-23 Thread Jothi Padmanabhan
Hi,

This looks like a set up issue. See
http://hadoop.apache.org/core/docs/current/cluster_setup.html#Configuration+
Files
On how to set this up correctly.

As an aside, how are you bringing up the hadoop daemons (JobTracker,
Namenode, TT and Datanodes)?  Are you manually bringing them up or are you
using bin/start-all.sh?

Jothi


On 2/23/09 3:14 PM, Jagadesh_Doddi jagadesh_do...@satyam.com wrote:

 I have setup a distributed environment on Fedora OS to run Hadoop.
 System Fedora1 is the name node, Fedora2 is Job tracker, Fedora3 and Fedora4
 are task trackers.
 Conf/masters contains the entries Fedora1, Fedors2, and conf/slaves contains
 the entries Fedora3, Fedora4.
 When I run the sample wordcount example with single task tracker (either
 Fedora3 or Fedora4), it works fine and the job completes in a few seconds.
 However, when I add the other task tracker in conf/slaves, the reducer stop at
 16% and the job completes after 13 minutes.
 The same problem exists in versions 16.4, 17.2.1 and 18.3. The output on the
 namenode console is shown below:
 
 [r...@fedora1 hadoop-0.17.2.1Cluster]# bin/hadoop jar samples/wordcount.jar
 org.myorg.WordCount input output
 09/02/19 17:43:18 INFO mapred.FileInputFormat: Total input paths to process :
 1
 09/02/19 17:43:19 INFO mapred.JobClient: Running job: job_200902191741_0001
 09/02/19 17:43:20 INFO mapred.JobClient:  map 0% reduce 0%
 09/02/19 17:43:26 INFO mapred.JobClient:  map 50% reduce 0%
 09/02/19 17:43:27 INFO mapred.JobClient:  map 100% reduce 0%
 09/02/19 17:43:35 INFO mapred.JobClient:  map 100% reduce 16%
 09/02/19 17:56:15 INFO mapred.JobClient: Task Id :
 task_200902191741_0001_m_01_0, Status : FAILED
 Too many fetch-failures
 09/02/19 17:56:15 WARN mapred.JobClient: Error reading task outputNo route to
 host
 09/02/19 17:56:18 WARN mapred.JobClient: Error reading task outputNo route to
 host
 09/02/19 17:56:25 INFO mapred.JobClient:  map 100% reduce 81%
 09/02/19 17:56:26 INFO mapred.JobClient:  map 100% reduce 100%
 09/02/19 17:56:27 INFO mapred.JobClient: Job complete: job_200902191741_0001
 09/02/19 17:56:27 INFO mapred.JobClient: Counters: 16
 09/02/19 17:56:27 INFO mapred.JobClient:   Job Counters
 09/02/19 17:56:27 INFO mapred.JobClient: Launched map tasks=3
 09/02/19 17:56:27 INFO mapred.JobClient: Launched reduce tasks=1
 09/02/19 17:56:27 INFO mapred.JobClient: Data-local map tasks=3
 09/02/19 17:56:27 INFO mapred.JobClient:   Map-Reduce Framework
 09/02/19 17:56:27 INFO mapred.JobClient: Map input records=5
 09/02/19 17:56:27 INFO mapred.JobClient: Map output records=25
 09/02/19 17:56:27 INFO mapred.JobClient: Map input bytes=138
 09/02/19 17:56:27 INFO mapred.JobClient: Map output bytes=238
 09/02/19 17:56:27 INFO mapred.JobClient: Combine input records=25
 09/02/19 17:56:27 INFO mapred.JobClient: Combine output records=23
 09/02/19 17:56:27 INFO mapred.JobClient: Reduce input groups=23
 09/02/19 17:56:27 INFO mapred.JobClient: Reduce input records=23
 09/02/19 17:56:27 INFO mapred.JobClient: Reduce output records=23
 09/02/19 17:56:27 INFO mapred.JobClient:   File Systems
 09/02/19 17:56:27 INFO mapred.JobClient: Local bytes read=522
 09/02/19 17:56:27 INFO mapred.JobClient: Local bytes written=1177
 09/02/19 17:56:27 INFO mapred.JobClient: HDFS bytes read=208
 09/02/19 17:56:27 INFO mapred.JobClient: HDFS bytes written=175
 
 Appreciate any help on this.
 
 Thanks
 
 Jagadesh
 
 DISCLAIMER:
 This email (including any attachments) is intended for the sole use of the
 intended recipient/s and may contain material that is CONFIDENTIAL AND PRIVATE
 COMPANY INFORMATION. Any review or reliance by others or copying or
 distribution or forwarding of any or all of the contents in this message is
 STRICTLY PROHIBITED. If you are not the intended recipient, please contact the
 sender by email and delete all copies; your cooperation in this regard is
 appreciated.



Re: will record having same key be sent to reducer at the same time

2009-02-23 Thread Nick Cen
Thanks.

2009/2/23 james warren ja...@rockyou.com

 Hi Nick -
 While your reducers may be running concurrently with your mappers, they
 will
 not begin the sort and reduce steps until all map tasks have completed.
  Once they actually begin the reduce stage, they will have received all
 values for a given key.

 cheers,
 -jw

 On Mon, Feb 23, 2009 at 1:00 AM, Nick Cen cenyo...@gmail.com wrote:

  Hi all,
 
  If i have a bunch of value that have the same key, and i have more then
 one
  reducer running, which guarantee both mapper and reducer are running
  concurently. Will all these value be send to the reducer at the same
 time?
  Thx
 
  --
  http://daily.appspot.com/food/
 




-- 
http://daily.appspot.com/food/


RE: Reducer hangs at 16%

2009-02-23 Thread Jagadesh_Doddi
Hi

I have setup as per the documentation in hadoop site.
On namenode, I am running bin/start-dfs.sh and on job tracker, I am running 
bin\start-mapred.sh

Thanks and Regards

Jagadesh Doddi
Telephone: 040-30657556
Mobile: 9949497414



-Original Message-
From: Jothi Padmanabhan [mailto:joth...@yahoo-inc.com]
Sent: Monday, February 23, 2009 4:00 PM
To: core-user@hadoop.apache.org
Subject: Re: Reducer hangs at 16%

Hi,

This looks like a set up issue. See
http://hadoop.apache.org/core/docs/current/cluster_setup.html#Configuration+
Files
On how to set this up correctly.

As an aside, how are you bringing up the hadoop daemons (JobTracker,
Namenode, TT and Datanodes)?  Are you manually bringing them up or are you
using bin/start-all.sh?

Jothi


On 2/23/09 3:14 PM, Jagadesh_Doddi jagadesh_do...@satyam.com wrote:

 I have setup a distributed environment on Fedora OS to run Hadoop.
 System Fedora1 is the name node, Fedora2 is Job tracker, Fedora3 and Fedora4
 are task trackers.
 Conf/masters contains the entries Fedora1, Fedors2, and conf/slaves contains
 the entries Fedora3, Fedora4.
 When I run the sample wordcount example with single task tracker (either
 Fedora3 or Fedora4), it works fine and the job completes in a few seconds.
 However, when I add the other task tracker in conf/slaves, the reducer stop at
 16% and the job completes after 13 minutes.
 The same problem exists in versions 16.4, 17.2.1 and 18.3. The output on the
 namenode console is shown below:

 [r...@fedora1 hadoop-0.17.2.1Cluster]# bin/hadoop jar samples/wordcount.jar
 org.myorg.WordCount input output
 09/02/19 17:43:18 INFO mapred.FileInputFormat: Total input paths to process :
 1
 09/02/19 17:43:19 INFO mapred.JobClient: Running job: job_200902191741_0001
 09/02/19 17:43:20 INFO mapred.JobClient:  map 0% reduce 0%
 09/02/19 17:43:26 INFO mapred.JobClient:  map 50% reduce 0%
 09/02/19 17:43:27 INFO mapred.JobClient:  map 100% reduce 0%
 09/02/19 17:43:35 INFO mapred.JobClient:  map 100% reduce 16%
 09/02/19 17:56:15 INFO mapred.JobClient: Task Id :
 task_200902191741_0001_m_01_0, Status : FAILED
 Too many fetch-failures
 09/02/19 17:56:15 WARN mapred.JobClient: Error reading task outputNo route to
 host
 09/02/19 17:56:18 WARN mapred.JobClient: Error reading task outputNo route to
 host
 09/02/19 17:56:25 INFO mapred.JobClient:  map 100% reduce 81%
 09/02/19 17:56:26 INFO mapred.JobClient:  map 100% reduce 100%
 09/02/19 17:56:27 INFO mapred.JobClient: Job complete: job_200902191741_0001
 09/02/19 17:56:27 INFO mapred.JobClient: Counters: 16
 09/02/19 17:56:27 INFO mapred.JobClient:   Job Counters
 09/02/19 17:56:27 INFO mapred.JobClient: Launched map tasks=3
 09/02/19 17:56:27 INFO mapred.JobClient: Launched reduce tasks=1
 09/02/19 17:56:27 INFO mapred.JobClient: Data-local map tasks=3
 09/02/19 17:56:27 INFO mapred.JobClient:   Map-Reduce Framework
 09/02/19 17:56:27 INFO mapred.JobClient: Map input records=5
 09/02/19 17:56:27 INFO mapred.JobClient: Map output records=25
 09/02/19 17:56:27 INFO mapred.JobClient: Map input bytes=138
 09/02/19 17:56:27 INFO mapred.JobClient: Map output bytes=238
 09/02/19 17:56:27 INFO mapred.JobClient: Combine input records=25
 09/02/19 17:56:27 INFO mapred.JobClient: Combine output records=23
 09/02/19 17:56:27 INFO mapred.JobClient: Reduce input groups=23
 09/02/19 17:56:27 INFO mapred.JobClient: Reduce input records=23
 09/02/19 17:56:27 INFO mapred.JobClient: Reduce output records=23
 09/02/19 17:56:27 INFO mapred.JobClient:   File Systems
 09/02/19 17:56:27 INFO mapred.JobClient: Local bytes read=522
 09/02/19 17:56:27 INFO mapred.JobClient: Local bytes written=1177
 09/02/19 17:56:27 INFO mapred.JobClient: HDFS bytes read=208
 09/02/19 17:56:27 INFO mapred.JobClient: HDFS bytes written=175

 Appreciate any help on this.

 Thanks

 Jagadesh

 DISCLAIMER:
 This email (including any attachments) is intended for the sole use of the
 intended recipient/s and may contain material that is CONFIDENTIAL AND PRIVATE
 COMPANY INFORMATION. Any review or reliance by others or copying or
 distribution or forwarding of any or all of the contents in this message is
 STRICTLY PROHIBITED. If you are not the intended recipient, please contact the
 sender by email and delete all copies; your cooperation in this regard is
 appreciated.



DISCLAIMER:
This email (including any attachments) is intended for the sole use of the 
intended recipient/s and may contain material that is CONFIDENTIAL AND PRIVATE 
COMPANY INFORMATION. Any review or reliance by others or copying or 
distribution or forwarding of any or all of the contents in this message is 
STRICTLY PROHIBITED. If you are not the intended recipient, please contact the 
sender by email and delete all copies; your cooperation in this regard is 
appreciated.


Re: the question about the common pc?

2009-02-23 Thread Steve Loughran

Tim Wintle wrote:

On Fri, 2009-02-20 at 13:07 +, Steve Loughran wrote:
I've been doing MapReduce work over small in-memory datasets 
using Erlang,  which works very well in such a context.


I've got some (mainly python) scripts (that will probably be run with
hadoop streaming eventually) that I run over multiple cpus/cores on a
single machine by opening the appropriate number of named pipes and
using tee and awk to split the workload

something like


mkfifo mypipe1
mkfifo mypipe2
awk '0 == NR % 2'  mypipe1 | ./mapper | sort  map_out_1

  awk '0 == (NR+1) % 2'  mypipe2 | ./mapper | sort  map_out_2

./get_lots_of_data | tee mypipe1  mypipe2


(wait until it's done... or send a signal from the get_lots_of_data
process on completion if it's a cronjob)


sort -m map_out* | ./reducer  reduce_out


works around the global interpreter lock in python quite nicely and
doesn't need people that write the scripts (who may not be programmers)
to understand multiple processes etc, just stdin and stdout.



Dumbo provides py support under Hadoop:
 http://wiki.github.com/klbostee/dumbo
 https://issues.apache.org/jira/browse/HADOOP-4304

as well as that, given Hadoop is java1.6+, there's no reason why it 
couldn't support the javax.script engine, with JavaScript working 
without extra JAR files, groovy and jython once their JARs were stuck on 
the classpath. Some work would probably be needed to make it easier to 
use these languages, and then there are the tests...


Re: Reducer hangs at 16%

2009-02-23 Thread Jothi Padmanabhan
OK. I am guessing that your problem arises from having two entries for
master. The master should be the node where the JT is run (for
start-mapred.sh) and NN is run (for start-dfs.sh). This might need a bit
more effort to set up. To start with, you might want to try out having both
the JT and NN in the same machine (the node designated as master) and then
try start-all.sh. You need to configure you hadoop-site.xml correctly as
well.

Jothi




On 2/23/09 4:36 PM, Jagadesh_Doddi jagadesh_do...@satyam.com wrote:

 Hi
 
 I have setup as per the documentation in hadoop site.
 On namenode, I am running bin/start-dfs.sh and on job tracker, I am running
 bin\start-mapred.sh
 
 Thanks and Regards
 
 Jagadesh Doddi
 Telephone: 040-30657556
 Mobile: 9949497414
 
 
 
 -Original Message-
 From: Jothi Padmanabhan [mailto:joth...@yahoo-inc.com]
 Sent: Monday, February 23, 2009 4:00 PM
 To: core-user@hadoop.apache.org
 Subject: Re: Reducer hangs at 16%
 
 Hi,
 
 This looks like a set up issue. See
 http://hadoop.apache.org/core/docs/current/cluster_setup.html#Configuration+
 Files
 On how to set this up correctly.
 
 As an aside, how are you bringing up the hadoop daemons (JobTracker,
 Namenode, TT and Datanodes)?  Are you manually bringing them up or are you
 using bin/start-all.sh?
 
 Jothi
 
 
 On 2/23/09 3:14 PM, Jagadesh_Doddi jagadesh_do...@satyam.com wrote:
 
 I have setup a distributed environment on Fedora OS to run Hadoop.
 System Fedora1 is the name node, Fedora2 is Job tracker, Fedora3 and Fedora4
 are task trackers.
 Conf/masters contains the entries Fedora1, Fedors2, and conf/slaves contains
 the entries Fedora3, Fedora4.
 When I run the sample wordcount example with single task tracker (either
 Fedora3 or Fedora4), it works fine and the job completes in a few seconds.
 However, when I add the other task tracker in conf/slaves, the reducer stop
 at
 16% and the job completes after 13 minutes.
 The same problem exists in versions 16.4, 17.2.1 and 18.3. The output on the
 namenode console is shown below:
 
 [r...@fedora1 hadoop-0.17.2.1Cluster]# bin/hadoop jar samples/wordcount.jar
 org.myorg.WordCount input output
 09/02/19 17:43:18 INFO mapred.FileInputFormat: Total input paths to process :
 1
 09/02/19 17:43:19 INFO mapred.JobClient: Running job: job_200902191741_0001
 09/02/19 17:43:20 INFO mapred.JobClient:  map 0% reduce 0%
 09/02/19 17:43:26 INFO mapred.JobClient:  map 50% reduce 0%
 09/02/19 17:43:27 INFO mapred.JobClient:  map 100% reduce 0%
 09/02/19 17:43:35 INFO mapred.JobClient:  map 100% reduce 16%
 09/02/19 17:56:15 INFO mapred.JobClient: Task Id :
 task_200902191741_0001_m_01_0, Status : FAILED
 Too many fetch-failures
 09/02/19 17:56:15 WARN mapred.JobClient: Error reading task outputNo route to
 host
 09/02/19 17:56:18 WARN mapred.JobClient: Error reading task outputNo route to
 host
 09/02/19 17:56:25 INFO mapred.JobClient:  map 100% reduce 81%
 09/02/19 17:56:26 INFO mapred.JobClient:  map 100% reduce 100%
 09/02/19 17:56:27 INFO mapred.JobClient: Job complete: job_200902191741_0001
 09/02/19 17:56:27 INFO mapred.JobClient: Counters: 16
 09/02/19 17:56:27 INFO mapred.JobClient:   Job Counters
 09/02/19 17:56:27 INFO mapred.JobClient: Launched map tasks=3
 09/02/19 17:56:27 INFO mapred.JobClient: Launched reduce tasks=1
 09/02/19 17:56:27 INFO mapred.JobClient: Data-local map tasks=3
 09/02/19 17:56:27 INFO mapred.JobClient:   Map-Reduce Framework
 09/02/19 17:56:27 INFO mapred.JobClient: Map input records=5
 09/02/19 17:56:27 INFO mapred.JobClient: Map output records=25
 09/02/19 17:56:27 INFO mapred.JobClient: Map input bytes=138
 09/02/19 17:56:27 INFO mapred.JobClient: Map output bytes=238
 09/02/19 17:56:27 INFO mapred.JobClient: Combine input records=25
 09/02/19 17:56:27 INFO mapred.JobClient: Combine output records=23
 09/02/19 17:56:27 INFO mapred.JobClient: Reduce input groups=23
 09/02/19 17:56:27 INFO mapred.JobClient: Reduce input records=23
 09/02/19 17:56:27 INFO mapred.JobClient: Reduce output records=23
 09/02/19 17:56:27 INFO mapred.JobClient:   File Systems
 09/02/19 17:56:27 INFO mapred.JobClient: Local bytes read=522
 09/02/19 17:56:27 INFO mapred.JobClient: Local bytes written=1177
 09/02/19 17:56:27 INFO mapred.JobClient: HDFS bytes read=208
 09/02/19 17:56:27 INFO mapred.JobClient: HDFS bytes written=175
 
 Appreciate any help on this.
 
 Thanks
 
 Jagadesh
 
 DISCLAIMER:
 This email (including any attachments) is intended for the sole use of the
 intended recipient/s and may contain material that is CONFIDENTIAL AND
 PRIVATE
 COMPANY INFORMATION. Any review or reliance by others or copying or
 distribution or forwarding of any or all of the contents in this message is
 STRICTLY PROHIBITED. If you are not the intended recipient, please contact
 the
 sender by email and delete all copies; your cooperation in this regard is
 appreciated.
 
 
 
 DISCLAIMER:
 This email (including any 

RE: Reducer hangs at 16%

2009-02-23 Thread Jagadesh_Doddi
Hi

I have changed the configuration to run Name node and job tracker on the same 
system.
The job is started with bin/start-all.sh on NN
With a single slave node, the job completes in 12 seconds, and the console 
output is shown below:

[r...@fedora1 hadoop-0.18.3]# bin/hadoop jar samples/wordcount.jar 
org.myorg.WordCount input output1
09/02/23 17:19:30 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
the arguments. Applications should implement Tool for the same.
09/02/23 17:19:30 INFO mapred.FileInputFormat: Total input paths to process : 1
09/02/23 17:19:30 INFO mapred.FileInputFormat: Total input paths to process : 1
09/02/23 17:19:30 INFO mapred.JobClient: Running job: job_200902231717_0001
09/02/23 17:19:31 INFO mapred.JobClient:  map 0% reduce 0%
09/02/23 17:19:37 INFO mapred.JobClient:  map 100% reduce 0%
09/02/23 17:19:42 INFO mapred.JobClient: Job complete: job_200902231717_0001
09/02/23 17:19:42 INFO mapred.JobClient: Counters: 16
09/02/23 17:19:42 INFO mapred.JobClient:   Job Counters
09/02/23 17:19:42 INFO mapred.JobClient: Data-local map tasks=2
09/02/23 17:19:42 INFO mapred.JobClient: Launched reduce tasks=1
09/02/23 17:19:42 INFO mapred.JobClient: Launched map tasks=2
09/02/23 17:19:42 INFO mapred.JobClient:   Map-Reduce Framework
09/02/23 17:19:42 INFO mapred.JobClient: Map output records=25
09/02/23 17:19:42 INFO mapred.JobClient: Reduce input records=23
09/02/23 17:19:42 INFO mapred.JobClient: Map output bytes=238
09/02/23 17:19:42 INFO mapred.JobClient: Map input records=5
09/02/23 17:19:42 INFO mapred.JobClient: Combine output records=46
09/02/23 17:19:42 INFO mapred.JobClient: Map input bytes=138
09/02/23 17:19:42 INFO mapred.JobClient: Combine input records=48
09/02/23 17:19:42 INFO mapred.JobClient: Reduce input groups=23
09/02/23 17:19:42 INFO mapred.JobClient: Reduce output records=23
09/02/23 17:19:42 INFO mapred.JobClient:   File Systems
09/02/23 17:19:42 INFO mapred.JobClient: HDFS bytes written=175
09/02/23 17:19:42 INFO mapred.JobClient: Local bytes written=648
09/02/23 17:19:42 INFO mapred.JobClient: HDFS bytes read=208
09/02/23 17:19:42 INFO mapred.JobClient: Local bytes read=281

With two slave nodes, the job completes in 13 minutes, and the console output 
is shown below:

[r...@fedora1 hadoop-0.18.3]# bin/hadoop jar samples/wordcount.jar 
org.myorg.WordCount input output2
09/02/23 17:25:38 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
the arguments. Applications should implement Tool for the same.
09/02/23 17:25:38 INFO mapred.FileInputFormat: Total input paths to process : 1
09/02/23 17:25:38 INFO mapred.FileInputFormat: Total input paths to process : 1
09/02/23 17:25:39 INFO mapred.JobClient: Running job: job_200902231722_0001
09/02/23 17:25:40 INFO mapred.JobClient:  map 0% reduce 0%
09/02/23 17:25:42 INFO mapred.JobClient:  map 50% reduce 0%
09/02/23 17:25:43 INFO mapred.JobClient:  map 100% reduce 0%
09/02/23 17:25:58 INFO mapred.JobClient:  map 100% reduce 16%
09/02/23 17:38:31 INFO mapred.JobClient: Task Id : 
attempt_200902231722_0001_m_00_0, Status : FAILED
Too many fetch-failures
09/02/23 17:38:31 WARN mapred.JobClient: Error reading task outputNo route to 
host
09/02/23 17:38:31 WARN mapred.JobClient: Error reading task outputNo route to 
host
09/02/23 17:38:43 INFO mapred.JobClient: Job complete: job_200902231722_0001
09/02/23 17:38:43 INFO mapred.JobClient: Counters: 16
09/02/23 17:38:43 INFO mapred.JobClient:   Job Counters
09/02/23 17:38:43 INFO mapred.JobClient: Data-local map tasks=3
09/02/23 17:38:43 INFO mapred.JobClient: Launched reduce tasks=1
09/02/23 17:38:43 INFO mapred.JobClient: Launched map tasks=3
09/02/23 17:38:43 INFO mapred.JobClient:   Map-Reduce Framework
09/02/23 17:38:43 INFO mapred.JobClient: Map output records=25
09/02/23 17:38:43 INFO mapred.JobClient: Reduce input records=23
09/02/23 17:38:43 INFO mapred.JobClient: Map output bytes=238
09/02/23 17:38:43 INFO mapred.JobClient: Map input records=5
09/02/23 17:38:43 INFO mapred.JobClient: Combine output records=46
09/02/23 17:38:43 INFO mapred.JobClient: Map input bytes=138
09/02/23 17:38:43 INFO mapred.JobClient: Combine input records=48
09/02/23 17:38:43 INFO mapred.JobClient: Reduce input groups=23
09/02/23 17:38:43 INFO mapred.JobClient: Reduce output records=23
09/02/23 17:38:43 INFO mapred.JobClient:   File Systems
09/02/23 17:38:43 INFO mapred.JobClient: HDFS bytes written=175
09/02/23 17:38:43 INFO mapred.JobClient: Local bytes written=648
09/02/23 17:38:43 INFO mapred.JobClient: HDFS bytes read=208
09/02/23 17:38:43 INFO mapred.JobClient: Local bytes read=281

Thanks

Jagadesh



-Original Message-
From: Jothi Padmanabhan [mailto:joth...@yahoo-inc.com]
Sent: Monday, February 23, 2009 4:57 PM
To: core-user@hadoop.apache.org
Subject: Re: Reducer hangs at 16%

OK. I am guessing that your problem arises from 

Re: Hadoop Streaming -file option

2009-02-23 Thread Rasit OZDAS
Hadoop uses RMI for file copy operations.
Clients listen port 50010 for this operation.
I assume, it's sending the file as byte stream.

Cheers,
Rasit

2009/2/23 Bing TANG whutg...@gmail.com

 Hi, everyone,
 Could somdone tell me the principle of -file when using Hadoop
 Streaming. I want to ship a big file to Slaves, so how it works?

 Hadoop uses SCP to copy? How does Hadoop deal with -file option?





-- 
M. Raşit ÖZDAŞ


Re: Reducer hangs at 16%

2009-02-23 Thread Amar Kamat
Looks like the reducer is able to fetch map output files from the local 
box but fails to fetch it from the remote box. Can you check if there is 
no firewall issue or /etc/hosts entries are correct?

Amar
Jagadesh_Doddi wrote:

Hi

I have changed the configuration to run Name node and job tracker on the same 
system.
The job is started with bin/start-all.sh on NN
With a single slave node, the job completes in 12 seconds, and the console 
output is shown below:

[r...@fedora1 hadoop-0.18.3]# bin/hadoop jar samples/wordcount.jar 
org.myorg.WordCount input output1
09/02/23 17:19:30 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
the arguments. Applications should implement Tool for the same.
09/02/23 17:19:30 INFO mapred.FileInputFormat: Total input paths to process : 1
09/02/23 17:19:30 INFO mapred.FileInputFormat: Total input paths to process : 1
09/02/23 17:19:30 INFO mapred.JobClient: Running job: job_200902231717_0001
09/02/23 17:19:31 INFO mapred.JobClient:  map 0% reduce 0%
09/02/23 17:19:37 INFO mapred.JobClient:  map 100% reduce 0%
09/02/23 17:19:42 INFO mapred.JobClient: Job complete: job_200902231717_0001
09/02/23 17:19:42 INFO mapred.JobClient: Counters: 16
09/02/23 17:19:42 INFO mapred.JobClient:   Job Counters
09/02/23 17:19:42 INFO mapred.JobClient: Data-local map tasks=2
09/02/23 17:19:42 INFO mapred.JobClient: Launched reduce tasks=1
09/02/23 17:19:42 INFO mapred.JobClient: Launched map tasks=2
09/02/23 17:19:42 INFO mapred.JobClient:   Map-Reduce Framework
09/02/23 17:19:42 INFO mapred.JobClient: Map output records=25
09/02/23 17:19:42 INFO mapred.JobClient: Reduce input records=23
09/02/23 17:19:42 INFO mapred.JobClient: Map output bytes=238
09/02/23 17:19:42 INFO mapred.JobClient: Map input records=5
09/02/23 17:19:42 INFO mapred.JobClient: Combine output records=46
09/02/23 17:19:42 INFO mapred.JobClient: Map input bytes=138
09/02/23 17:19:42 INFO mapred.JobClient: Combine input records=48
09/02/23 17:19:42 INFO mapred.JobClient: Reduce input groups=23
09/02/23 17:19:42 INFO mapred.JobClient: Reduce output records=23
09/02/23 17:19:42 INFO mapred.JobClient:   File Systems
09/02/23 17:19:42 INFO mapred.JobClient: HDFS bytes written=175
09/02/23 17:19:42 INFO mapred.JobClient: Local bytes written=648
09/02/23 17:19:42 INFO mapred.JobClient: HDFS bytes read=208
09/02/23 17:19:42 INFO mapred.JobClient: Local bytes read=281

With two slave nodes, the job completes in 13 minutes, and the console output 
is shown below:

[r...@fedora1 hadoop-0.18.3]# bin/hadoop jar samples/wordcount.jar 
org.myorg.WordCount input output2
09/02/23 17:25:38 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
the arguments. Applications should implement Tool for the same.
09/02/23 17:25:38 INFO mapred.FileInputFormat: Total input paths to process : 1
09/02/23 17:25:38 INFO mapred.FileInputFormat: Total input paths to process : 1
09/02/23 17:25:39 INFO mapred.JobClient: Running job: job_200902231722_0001
09/02/23 17:25:40 INFO mapred.JobClient:  map 0% reduce 0%
09/02/23 17:25:42 INFO mapred.JobClient:  map 50% reduce 0%
09/02/23 17:25:43 INFO mapred.JobClient:  map 100% reduce 0%
09/02/23 17:25:58 INFO mapred.JobClient:  map 100% reduce 16%
09/02/23 17:38:31 INFO mapred.JobClient: Task Id : 
attempt_200902231722_0001_m_00_0, Status : FAILED
Too many fetch-failures
09/02/23 17:38:31 WARN mapred.JobClient: Error reading task outputNo route to 
host
09/02/23 17:38:31 WARN mapred.JobClient: Error reading task outputNo route to 
host
09/02/23 17:38:43 INFO mapred.JobClient: Job complete: job_200902231722_0001
09/02/23 17:38:43 INFO mapred.JobClient: Counters: 16
09/02/23 17:38:43 INFO mapred.JobClient:   Job Counters
09/02/23 17:38:43 INFO mapred.JobClient: Data-local map tasks=3
09/02/23 17:38:43 INFO mapred.JobClient: Launched reduce tasks=1
09/02/23 17:38:43 INFO mapred.JobClient: Launched map tasks=3
09/02/23 17:38:43 INFO mapred.JobClient:   Map-Reduce Framework
09/02/23 17:38:43 INFO mapred.JobClient: Map output records=25
09/02/23 17:38:43 INFO mapred.JobClient: Reduce input records=23
09/02/23 17:38:43 INFO mapred.JobClient: Map output bytes=238
09/02/23 17:38:43 INFO mapred.JobClient: Map input records=5
09/02/23 17:38:43 INFO mapred.JobClient: Combine output records=46
09/02/23 17:38:43 INFO mapred.JobClient: Map input bytes=138
09/02/23 17:38:43 INFO mapred.JobClient: Combine input records=48
09/02/23 17:38:43 INFO mapred.JobClient: Reduce input groups=23
09/02/23 17:38:43 INFO mapred.JobClient: Reduce output records=23
09/02/23 17:38:43 INFO mapred.JobClient:   File Systems
09/02/23 17:38:43 INFO mapred.JobClient: HDFS bytes written=175
09/02/23 17:38:43 INFO mapred.JobClient: Local bytes written=648
09/02/23 17:38:43 INFO mapred.JobClient: HDFS bytes read=208
09/02/23 17:38:43 INFO mapred.JobClient: Local bytes read=281

Thanks

Jagadesh



-Original 

Can anyone verify Hadoop FS shell command return codes?

2009-02-23 Thread S D
I'm attempting to use Hadoop FS shell (http://hadoop
.apache.org/core/docs/current/hdfs_shell.html) within a ruby script. My
challenge is that I'm unable to get the function return value of the
commands I'm invoking. As an example, I try to run get as follows

hadoop fs -get /user/hadoop/testFile.txt .

From the command line this generally works but I need to be able to verify
that it is working during execution in my ruby script. The command should
return 0 on success and -1 on error. Based on

http://pasadenarb.com/2007/03/ruby-shell-commands.html

I am using backticks to make the hadoop call and get the return value. Here
is a dialogue within irb (Ruby's interactive shell) in which the command was
not successful:

irb(main):001:0 `hadoop dfs -get testFile.txt .`
get: null
= 

and a dialogue within irb in which the command was successful

irb(main):010:0 `hadoop dfs -get testFile.txt .`
= 

In both cases, neither a 0 nor a 1 appeared as a return value; indeed
nothing was returned. Can anyone who is using the FS command shell return
values within any scripting language (Ruby, PHP, Perl, ...) please confirm
that it is working as expected or send an example snippet?

Thanks,
John


RE: Reducer hangs at 16%

2009-02-23 Thread Jagadesh_Doddi
It works as longs as I use any one of the slave nodes.
The moment I add both the slave nodes to conf/slaves, It fails.
So there is no issue with firewall or /etc/hosts entries.

Thanks and Regards

Jagadesh Doddi
Telephone: 040-30657556
Mobile: 9949497414



-Original Message-
From: Amar Kamat [mailto:ama...@yahoo-inc.com]
Sent: Monday, February 23, 2009 6:26 PM
To: core-user@hadoop.apache.org
Subject: Re: Reducer hangs at 16%

Looks like the reducer is able to fetch map output files from the local
box but fails to fetch it from the remote box. Can you check if there is
no firewall issue or /etc/hosts entries are correct?
Amar
Jagadesh_Doddi wrote:
 Hi

 I have changed the configuration to run Name node and job tracker on the same 
 system.
 The job is started with bin/start-all.sh on NN
 With a single slave node, the job completes in 12 seconds, and the console 
 output is shown below:

 [r...@fedora1 hadoop-0.18.3]# bin/hadoop jar samples/wordcount.jar 
 org.myorg.WordCount input output1
 09/02/23 17:19:30 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
 the arguments. Applications should implement Tool for the same.
 09/02/23 17:19:30 INFO mapred.FileInputFormat: Total input paths to process : 
 1
 09/02/23 17:19:30 INFO mapred.FileInputFormat: Total input paths to process : 
 1
 09/02/23 17:19:30 INFO mapred.JobClient: Running job: job_200902231717_0001
 09/02/23 17:19:31 INFO mapred.JobClient:  map 0% reduce 0%
 09/02/23 17:19:37 INFO mapred.JobClient:  map 100% reduce 0%
 09/02/23 17:19:42 INFO mapred.JobClient: Job complete: job_200902231717_0001
 09/02/23 17:19:42 INFO mapred.JobClient: Counters: 16
 09/02/23 17:19:42 INFO mapred.JobClient:   Job Counters
 09/02/23 17:19:42 INFO mapred.JobClient: Data-local map tasks=2
 09/02/23 17:19:42 INFO mapred.JobClient: Launched reduce tasks=1
 09/02/23 17:19:42 INFO mapred.JobClient: Launched map tasks=2
 09/02/23 17:19:42 INFO mapred.JobClient:   Map-Reduce Framework
 09/02/23 17:19:42 INFO mapred.JobClient: Map output records=25
 09/02/23 17:19:42 INFO mapred.JobClient: Reduce input records=23
 09/02/23 17:19:42 INFO mapred.JobClient: Map output bytes=238
 09/02/23 17:19:42 INFO mapred.JobClient: Map input records=5
 09/02/23 17:19:42 INFO mapred.JobClient: Combine output records=46
 09/02/23 17:19:42 INFO mapred.JobClient: Map input bytes=138
 09/02/23 17:19:42 INFO mapred.JobClient: Combine input records=48
 09/02/23 17:19:42 INFO mapred.JobClient: Reduce input groups=23
 09/02/23 17:19:42 INFO mapred.JobClient: Reduce output records=23
 09/02/23 17:19:42 INFO mapred.JobClient:   File Systems
 09/02/23 17:19:42 INFO mapred.JobClient: HDFS bytes written=175
 09/02/23 17:19:42 INFO mapred.JobClient: Local bytes written=648
 09/02/23 17:19:42 INFO mapred.JobClient: HDFS bytes read=208
 09/02/23 17:19:42 INFO mapred.JobClient: Local bytes read=281

 With two slave nodes, the job completes in 13 minutes, and the console output 
 is shown below:

 [r...@fedora1 hadoop-0.18.3]# bin/hadoop jar samples/wordcount.jar 
 org.myorg.WordCount input output2
 09/02/23 17:25:38 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
 the arguments. Applications should implement Tool for the same.
 09/02/23 17:25:38 INFO mapred.FileInputFormat: Total input paths to process : 
 1
 09/02/23 17:25:38 INFO mapred.FileInputFormat: Total input paths to process : 
 1
 09/02/23 17:25:39 INFO mapred.JobClient: Running job: job_200902231722_0001
 09/02/23 17:25:40 INFO mapred.JobClient:  map 0% reduce 0%
 09/02/23 17:25:42 INFO mapred.JobClient:  map 50% reduce 0%
 09/02/23 17:25:43 INFO mapred.JobClient:  map 100% reduce 0%
 09/02/23 17:25:58 INFO mapred.JobClient:  map 100% reduce 16%
 09/02/23 17:38:31 INFO mapred.JobClient: Task Id : 
 attempt_200902231722_0001_m_00_0, Status : FAILED
 Too many fetch-failures
 09/02/23 17:38:31 WARN mapred.JobClient: Error reading task outputNo route to 
 host
 09/02/23 17:38:31 WARN mapred.JobClient: Error reading task outputNo route to 
 host
 09/02/23 17:38:43 INFO mapred.JobClient: Job complete: job_200902231722_0001
 09/02/23 17:38:43 INFO mapred.JobClient: Counters: 16
 09/02/23 17:38:43 INFO mapred.JobClient:   Job Counters
 09/02/23 17:38:43 INFO mapred.JobClient: Data-local map tasks=3
 09/02/23 17:38:43 INFO mapred.JobClient: Launched reduce tasks=1
 09/02/23 17:38:43 INFO mapred.JobClient: Launched map tasks=3
 09/02/23 17:38:43 INFO mapred.JobClient:   Map-Reduce Framework
 09/02/23 17:38:43 INFO mapred.JobClient: Map output records=25
 09/02/23 17:38:43 INFO mapred.JobClient: Reduce input records=23
 09/02/23 17:38:43 INFO mapred.JobClient: Map output bytes=238
 09/02/23 17:38:43 INFO mapred.JobClient: Map input records=5
 09/02/23 17:38:43 INFO mapred.JobClient: Combine output records=46
 09/02/23 17:38:43 INFO mapred.JobClient: Map input bytes=138
 09/02/23 17:38:43 INFO mapred.JobClient: Combine 

CfP 4th Workshop on Virtualization in High-Performance Cloud Computing (VHPC'09)

2009-02-23 Thread Marcus Hardt
Apologies if you received multiple copies of this message.


=
CALL FOR PAPERS

4th Workshop on

Virtualization in High-Performance Cloud Computing
VHPC'09

as part of Euro-Par 2009, Delft, The Netherlands
=

Date: August 25, 2009

Euro-Par 2009:  http://europar2009.ewi.tudelft.nl/
Workshop URL:   http://vhpc.org

SUBMISSION  DEADLINE:
Abstracts:  March 12, 2009
Full Paper: June 8, 2009


Scope:

Virtualization has  become a common  abstraction layer in  modern data
centers,  enabling resource  owners to  manage  complex infrastructure
independently  of their  applications.   Conjointly virtualization  is
becoming  a driving  technology for  a manifold  of industry  grade IT
services. Piloted by the  Amazon Elastic Computing Cloud services, the
cloud  concept includes the  notion of  a separation  between resource
owners  and   users,  adding  services  such   as  hosted  application
frameworks  and  queuing. Utilizing  the  same infrastructure,  clouds
carry  significant potential  for use  in  high-performance scientific
computing. The ability of clouds  to provide for requests and releases
of vast computing resource dynamically  and close to the marginal cost
of  providing  the  services   is  unprecedented  in  the  history  of
scientific and commercial computing.

Distributed computing concepts that leverage federated resource access
are popular  within the grid  community, but have not  seen previously
desired  deployed  levels  so  far.   Also,  many  of  the  scientific
datacenters have not adopted virtualization or cloud concepts yet.

This  workshop aims to  bring together  industrial providers  with the
scientific community in order  to foster discussion, collaboration and
mutual exchange of knowledge and experience.

The  workshop will  be one  day in  length, composed  of 20  min paper
presentations,   each  followed   by  10   min   discussion  sections.
Presentations  may be  accompanied by  interactive  demonstrations. It
concludes with a 30 min panel discussion by presenters.


TOPICS

Topics include, but are not limited to, the following subjects:

- Virtualization in cloud, cluster and grid environments
- VM-based cloud performance modeling
- Workload characterizations for VM-based environments
- Software as a Service (SaaS)
- Cloud reliability, fault-tolerance, and security
- Cloud, cluster and grid filesystems for VMs
- QoS and and service level guarantees
- Virtualized I/O
- VMMs and storage virtualization
- Research and education use cases
- VM cloud, cluster distribution algorithms
- MPI, PVM  on virtual machines
- Cloud APIs
- Cloud load balancing
- Hardware support for virtualization
- High-performance network virtualization
- High-speed interconnects
- Bottleneck management
- Hypervisor extensions and tools for cluster and grid computing
- Network architectures for VM-based environments
- VMMs/Hypervisors
- Cloud use cases
- Performance management and tuning hosts and guest VMs
- Fault tolerant VM environments
- VMM performance tuning on various load types
- Cloud provisioning
- Xen/other VMM cloud/cluster/grid tools
- Device access from VMs
- Management, deployment of VM-based environments


PAPER SUBMISSION

Papers  submitted to the  workshop will  be reviewed  by at  least two
members of  the program committee and  external reviewers. Submissions
should  include  abstract,  key  words,  the  e-mail  address  of  the
corresponding author,  and must not exceed 10  pages, including tables
and figures at  a main font size no smaller  than 11 point. Submission
of a paper  should be regarded as a commitment  that, should the paper
be accepted, at least one of  the authors will register and attend the
conference to present the work.

Accepted papers  will be published in  the Springer LNCS  series - the
format  must  be  according   to  the  Springer  LNCS  Style.  Initial
submissions are in PDF, accepted  papers will be requested to provided
source files.

Format Guidelines:  http://www.springer.de/comp/lncs/authors.html
Submission Link: http://edas.info/newPaper.php?c=7364

IMPORTANT DATES

March 12  -  Abstract submission due
June 8-  Full  paper  submission
July 14   -  Acceptance notification
August 3  -  Camera-ready version due
August 25-28  -  Conference

CHAIR

Michael Alexander (chair),  Scaled Infrastructure KG, Austria
Marcus Hardt (co-chair), Forschungszentrum Karlsruhe, Germany


PROGRAM COMMITTEE

Padmashree Apparao,  Intel Corp., USA
Hassan Barada, Khalifa University, UAE
Volker Buege,  University of Karlsruhe, Germany
Isabel Campos,  IFCA, Spain
Stephen Childs,  Trinity College Dublin, Ireland
William Gardner,  University of Guelph, Canada
Derek Groen,  UVA, The Netherlands
Ahmad Hammad,  FZK, Germany
Sverre Jarp,  CERN, Switzerland
Xuxian Jiang,  NC State, USA
Kenji Kaneda,  University of Tokyo, Japan

Re: Reducer hangs at 16%

2009-02-23 Thread Matei Zaharia
The fact that it works with one slave node doesn't mean much, because when
the slave is alone, it's copying map outputs from itself and thus not going
through the firewall. It sounds like the slaves can't open a connection to
each other, which could well mean a firewall problem. Can you look at the
output of the reduce task (by clicking it in the running tasks column in
the web UI and going on to see the last 8k of output)? I imagine it will
have fetched data from one slave and will be failing to connect to the other
one.

On Mon, Feb 23, 2009 at 5:03 AM, Jagadesh_Doddi
jagadesh_do...@satyam.comwrote:

 It works as longs as I use any one of the slave nodes.
 The moment I add both the slave nodes to conf/slaves, It fails.
 So there is no issue with firewall or /etc/hosts entries.

 Thanks and Regards

 Jagadesh Doddi
 Telephone: 040-30657556
 Mobile: 9949497414



 -Original Message-
 From: Amar Kamat [mailto:ama...@yahoo-inc.com]
 Sent: Monday, February 23, 2009 6:26 PM
 To: core-user@hadoop.apache.org
 Subject: Re: Reducer hangs at 16%

 Looks like the reducer is able to fetch map output files from the local
 box but fails to fetch it from the remote box. Can you check if there is
 no firewall issue or /etc/hosts entries are correct?
 Amar
 Jagadesh_Doddi wrote:
  Hi
 
  I have changed the configuration to run Name node and job tracker on the
 same system.
  The job is started with bin/start-all.sh on NN
  With a single slave node, the job completes in 12 seconds, and the
 console output is shown below:
 
  [r...@fedora1 hadoop-0.18.3]# bin/hadoop jar samples/wordcount.jar
 org.myorg.WordCount input output1
  09/02/23 17:19:30 WARN mapred.JobClient: Use GenericOptionsParser for
 parsing the arguments. Applications should implement Tool for the same.
  09/02/23 17:19:30 INFO mapred.FileInputFormat: Total input paths to
 process : 1
  09/02/23 17:19:30 INFO mapred.FileInputFormat: Total input paths to
 process : 1
  09/02/23 17:19:30 INFO mapred.JobClient: Running job:
 job_200902231717_0001
  09/02/23 17:19:31 INFO mapred.JobClient:  map 0% reduce 0%
  09/02/23 17:19:37 INFO mapred.JobClient:  map 100% reduce 0%
  09/02/23 17:19:42 INFO mapred.JobClient: Job complete:
 job_200902231717_0001
  09/02/23 17:19:42 INFO mapred.JobClient: Counters: 16
  09/02/23 17:19:42 INFO mapred.JobClient:   Job Counters
  09/02/23 17:19:42 INFO mapred.JobClient: Data-local map tasks=2
  09/02/23 17:19:42 INFO mapred.JobClient: Launched reduce tasks=1
  09/02/23 17:19:42 INFO mapred.JobClient: Launched map tasks=2
  09/02/23 17:19:42 INFO mapred.JobClient:   Map-Reduce Framework
  09/02/23 17:19:42 INFO mapred.JobClient: Map output records=25
  09/02/23 17:19:42 INFO mapred.JobClient: Reduce input records=23
  09/02/23 17:19:42 INFO mapred.JobClient: Map output bytes=238
  09/02/23 17:19:42 INFO mapred.JobClient: Map input records=5
  09/02/23 17:19:42 INFO mapred.JobClient: Combine output records=46
  09/02/23 17:19:42 INFO mapred.JobClient: Map input bytes=138
  09/02/23 17:19:42 INFO mapred.JobClient: Combine input records=48
  09/02/23 17:19:42 INFO mapred.JobClient: Reduce input groups=23
  09/02/23 17:19:42 INFO mapred.JobClient: Reduce output records=23
  09/02/23 17:19:42 INFO mapred.JobClient:   File Systems
  09/02/23 17:19:42 INFO mapred.JobClient: HDFS bytes written=175
  09/02/23 17:19:42 INFO mapred.JobClient: Local bytes written=648
  09/02/23 17:19:42 INFO mapred.JobClient: HDFS bytes read=208
  09/02/23 17:19:42 INFO mapred.JobClient: Local bytes read=281
 
  With two slave nodes, the job completes in 13 minutes, and the console
 output is shown below:
 
  [r...@fedora1 hadoop-0.18.3]# bin/hadoop jar samples/wordcount.jar
 org.myorg.WordCount input output2
  09/02/23 17:25:38 WARN mapred.JobClient: Use GenericOptionsParser for
 parsing the arguments. Applications should implement Tool for the same.
  09/02/23 17:25:38 INFO mapred.FileInputFormat: Total input paths to
 process : 1
  09/02/23 17:25:38 INFO mapred.FileInputFormat: Total input paths to
 process : 1
  09/02/23 17:25:39 INFO mapred.JobClient: Running job:
 job_200902231722_0001
  09/02/23 17:25:40 INFO mapred.JobClient:  map 0% reduce 0%
  09/02/23 17:25:42 INFO mapred.JobClient:  map 50% reduce 0%
  09/02/23 17:25:43 INFO mapred.JobClient:  map 100% reduce 0%
  09/02/23 17:25:58 INFO mapred.JobClient:  map 100% reduce 16%
  09/02/23 17:38:31 INFO mapred.JobClient: Task Id :
 attempt_200902231722_0001_m_00_0, Status : FAILED
  Too many fetch-failures
  09/02/23 17:38:31 WARN mapred.JobClient: Error reading task outputNo
 route to host
  09/02/23 17:38:31 WARN mapred.JobClient: Error reading task outputNo
 route to host
  09/02/23 17:38:43 INFO mapred.JobClient: Job complete:
 job_200902231722_0001
  09/02/23 17:38:43 INFO mapred.JobClient: Counters: 16
  09/02/23 17:38:43 INFO mapred.JobClient:   Job Counters
  09/02/23 17:38:43 INFO mapred.JobClient: Data-local map 

Re: Can anyone verify Hadoop FS shell command return codes?

2009-02-23 Thread Roldano Cattoni
You should distinguish between the output of a command and the return value
of the command: usually they are captured in different ways by the
interpreters (scripting languages or shells).

For example:

1) in perl the return value is captured by using the system function:
   $rv = system(cmd);
so the $rv variable contains the value returned by cmd. 
Instead, with backticks you get the output of cmd:
   $out = `cmd`;

2) in shells (sh/bash/tcsh) the return value is stored in the variable $?
(dollar char followed by question-mark char).
Instead, the output is again obtained with backticks.

I don't know the way in which irb captures the return value: for analogy I
would say that backticks are used for capturing the output even in irb.

Best

  Roldano



On Mon, Feb 23, 2009 at 02:02:22PM +0100, S D wrote:
 I'm attempting to use Hadoop FS shell (http://hadoop
 .apache.org/core/docs/current/hdfs_shell.html) within a ruby script. My
 challenge is that I'm unable to get the function return value of the
 commands I'm invoking. As an example, I try to run get as follows
 
 hadoop fs -get /user/hadoop/testFile.txt .
 
 From the command line this generally works but I need to be able to verify
 that it is working during execution in my ruby script. The command should
 return 0 on success and -1 on error. Based on
 
 http://pasadenarb.com/2007/03/ruby-shell-commands.html
 
 I am using backticks to make the hadoop call and get the return value. Here
 is a dialogue within irb (Ruby's interactive shell) in which the command was
 not successful:
 
 irb(main):001:0 `hadoop dfs -get testFile.txt .`
 get: null
 = 
 
 and a dialogue within irb in which the command was successful
 
 irb(main):010:0 `hadoop dfs -get testFile.txt .`
 = 
 
 In both cases, neither a 0 nor a 1 appeared as a return value; indeed
 nothing was returned. Can anyone who is using the FS command shell return
 values within any scripting language (Ruby, PHP, Perl, ...) please confirm
 that it is working as expected or send an example snippet?
 
 Thanks,
 John


Re: Hadoop Streaming -file option

2009-02-23 Thread Arun C Murthy


On Feb 23, 2009, at 2:01 AM, Bing TANG wrote:


Hi, everyone,
Could somdone tell me the principle of -file when using Hadoop
Streaming. I want to ship a big file to Slaves, so how it works?

Hadoop uses SCP to copy? How does Hadoop deal with -file option?



No, -file just copies the file from the local filesystem to HDFS, and  
the DistributedCache copies it to the local filesystem of the node on  
which the map/reduce task runs.


Arun



Batching key/value pairs to map

2009-02-23 Thread Jimmy Wan
part of my map/reduce process could be greatly sped up by mapping
key/value pairs in batches instead of mapping them one by one. I'd
like to do the following:
protected abstract void batchMap(OutputCollectorK2, V2
k2V2OutputCollector, Reporter reporter) throws IOException;

public void map(K1 key1, V1 value1, OutputCollectorK2, V2
output, Reporter reporter) throws IOException {
keys.add(key1.copy());
values.add(value1.copy());
if (++currentSize == batchSize) {
batchMap(output, reporter);
clear();
}
}

public void close() throws IOException {
if (currentSize  0) {
// I don't have access to my OutputCollector or Reporter here!
batchMap(output, reporter);
clear();
}
}

Can I safely hang onto my OutputCollector and Reporter from calls to map?

I'm currently running Hadoop 0.17.2.1. Is this something I could do in
Hadoop 0.19.X?


Re: Batching key/value pairs to map

2009-02-23 Thread Owen O'Malley
On Mon, Feb 23, 2009 at 12:06 PM, Jimmy Wan ji...@indeed.com wrote:

 part of my map/reduce process could be greatly sped up by mapping
 key/value pairs in batches instead of mapping them one by one.
 Can I safely hang onto my OutputCollector and Reporter from calls to map?


Yes. You can even use them in the close, so that you can process the last
batch of records. *smile* One problem that you will quickly hit is that
Hadoop reuses the objects that are passed to map and reduce. So, you'll need
to clone them before putting them into the collection.

I'm currently running Hadoop 0.17.2.1. Is this something I could do in
 Hadoop 0.19.X?


I don't think any of this changed between 0.17 and 0.19, other than in 0.17
the reduce's inputs were always new objects. In 0.18 and after, the reduce's
inputs are reused.

-- Owen


Re: Batching key/value pairs to map

2009-02-23 Thread Jimmy Wan
Great, thanks Owen. I actually ran into the object reuse problem a
long time ago. The output of my MR processes gets turned into a series
of large INSERT statements that wasn't performing unless I batched
them in inserts of several K entries. I'm not sure if this is
possible, but it would certainly be nice to either:
1) pass the OutputCollector and Reporter to the close() method.
2) Provide accessors to the OutputCollector and the Reporter.

Now every single one of my maps is going to have a pair of 1-2 extra no-ops.

I'll check to see if that's on the list of outstanding FRs.

On Mon, Feb 23, 2009 at 15:30, Owen O'Malley owen.omal...@gmail.com wrote:
 On Mon, Feb 23, 2009 at 12:06 PM, Jimmy Wan ji...@indeed.com wrote:

 part of my map/reduce process could be greatly sped up by mapping
 key/value pairs in batches instead of mapping them one by one.
 Can I safely hang onto my OutputCollector and Reporter from calls to map?

 Yes. You can even use them in the close, so that you can process the last
 batch of records. *smile* One problem that you will quickly hit is that
 Hadoop reuses the objects that are passed to map and reduce. So, you'll need
 to clone them before putting them into the collection.

 I'm currently running Hadoop 0.17.2.1. Is this something I could do in
 Hadoop 0.19.X?

 I don't think any of this changed between 0.17 and 0.19, other than in 0.17
 the reduce's inputs were always new objects. In 0.18 and after, the reduce's
 inputs are reused.


Re: Batching key/value pairs to map

2009-02-23 Thread Owen O'Malley


On Feb 23, 2009, at 2:19 PM, Jimmy Wan wrote:


 I'm not sure if this is
possible, but it would certainly be nice to either:
1) pass the OutputCollector and Reporter to the close() method.
2) Provide accessors to the OutputCollector and the Reporter.


If you look at the 0.20 branch, which hasn't released yet, there is a  
new map/reduce api. That api does provide a lot more control. Take a  
look at Mapper, which provide setup, map, and cleanup hooks:


http://tinyurl.com/bquvxq

The map method looks like:

  /**
   * Called once for each key/value pair in the input split. Most  
applications

   * should override this, but the default is the identity function.
   */
  @SuppressWarnings(unchecked)
  protected void map(KEYIN key, VALUEIN value,
  Context context) throws  
IOException, InterruptedException {

context.write((KEYOUT) key, (VALUEOUT) value);
  }

But there is also a run method that drives the task. The default is  
given below, but it can be overridden by the application.


  /**
   * Expert users can override this method for more complete control  
over the

   * execution of the Mapper.
   * @param context
   * @throws IOException
   */
  public void run(Context context) throws IOException,  
InterruptedException {

setup(context);
while (context.nextKeyValue()) {
  map(context.getCurrentKey(), context.getCurrentValue(), context);
}
cleanup(context);
  }

Clearly, in your application you could override run to make a list of  
100 key, value pairs or something.


-- Owen


Re: Design issue for a problem using Map Reduce

2009-02-23 Thread some speed
Thanks Sagar...That helps to a certain extent.
But is dependency not a common occurrence among equations? Doesn't Hadoop
provide a way to solve such equations in parallel?
Going in for a sequential calculation might prove to be a major performance
degradation given tens of thousands of numbers. Does any one have any ideas
?

Thanks.

On Sun, Feb 15, 2009 at 1:34 AM, Sagar Naik sn...@attributor.com wrote:

 Here is one thought
 N maps and 1 Reduce,
 input to map: t,w(t)
 output of map t, w(t)*w(t)
 I assume t is an integer. So in case of 1 reducer, u will receive
 t0, square(w(0)
 t1, square(w(1)
 t2, square(w(2)
 t3, square(w(3)
 Note this wiil be a sorted series on t.

 in reduce

 static prevF = 0;

 reduce(t, square_w_t)
 {
  f = square_w_t * A  + B * prevF ;
  output.collect(t,f)
  prevF = f
 }

 According to me the step of B*F(t-1) is inherently sequential.
 So all we can do is parallelize the a*w(t)*w(t) part.

 -Sagar


 some speed wrote:

 Hello all,

 I am trying to implement a Map Reduce Chain to solve a particular
 statistic
 problem. I have come to a point where I have to solve the following type
 of
 equation in Hadoop:

 F(t)= A*w(t)*w(t) + B*F(t-1); Given: F(0)=0, A and B are Alpha and
 Beta
 and their values are known.

 Now, W is series of numbers (There could be *a million* or more numbers).

 So to Solve the equation in terms of Map Reduce, there are basically 2
 issues which I can think of:

 1) How will I be able to get the value of F(t-1) since it means as each
 step
 i need the value from the previous iteration. And that is not possible
 while
 computing parallely.
 2) the w(t) values have to be read and applied in order also ,and, again
 that is a prb while computing parallely.

 Can some please help me go abt this problem and overcome the issues?

 Thanks,

 Sharath






Re: the question about the common pc?

2009-02-23 Thread Tim Wintle
On Mon, 2009-02-23 at 11:14 +, Steve Loughran wrote:
 Dumbo provides py support under Hadoop:
   http://wiki.github.com/klbostee/dumbo
   https://issues.apache.org/jira/browse/HADOOP-4304

Ooh, nice - I hadn't seen dumbo. That's far cleaner than the python
wrapper to streaming I'd hacked together.

I'm probably going to be using hadoop more again in the near future so
I'll bookmark that, thanks Steve.

Personally I only need text based records, so I'm fine using a wrapper
around streaming

Tim Wintle



mysql metastore problems

2009-02-23 Thread Ryan Shih
Hi, I'm having some problems setting up the metastore using mysql. I've
browsed the message archives, but don't see anything that helps. My
configuration files, look like:
**hive-site.xml**
?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?

configuration
property
namehive.metastore.local/name
valuetrue/value
descriptionthis is local store/description
/property
property
namehive.metastore.warehouse.dir/name
value/user/hive/warehouse/value
descriptiondefault location for Hive tables/description
/property
property
namehive.aux.jars.path/name
value/home/ryan/hive/branch-0.2/install/custom//value
descriptionwhere custom serdes live /description
/property
/configuration


**jpox.properties**
javax.jdo.PersistenceManagerFactoryClass=org.jpox.PersistenceManagerFactoryImpl
org.jpox.validateTables=false
org.jpox.validateColumns=false
org.jpox.validateConstraints=false
org.jpox.storeManagerType=rdbms
org.jpox.autoCreateSchema=true
org.jpox.autoStartMechanismMode=checked
org.jpox.transactionIsolation=read_committed
javax.jdo.option.DetachAllOnCommit=true
javax.jdo.option.NontransactionalRead=true
javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver
javax.jdo.option.ConnectionURL=jdbc:mysql://localhost/hive_metastore?createDatabaseIfNotExist=true
javax.jdo.option.ConnectionUserName=hive_user
javax.jdo.option.ConnectionPassword=hive_pass
org.jpox.cache.level2=true
org.jpox.cache.level2.type=SOFT

And then with just the default hive-default.xml file. I haven't worked with
JPOX tables before, but I'm under the impression that this will
automatically get created from the autoCreateSchema flag. Has anyone had any
luck with this?

I'm getting the following error:
hive r...@dali:~/hive/branch-0.2/install/bin$ hive
Hive history file=/tmp/ryan/hive_job_log_ryan_200902231722_1088538316.txt
hive create table test_table (id INT, name STRING);
FAILED: Error in metadata: javax.jdo.JDODataStoreException: Error adding
class org.apache.hadoop.hive.metastore.model.MDatabase to list of
persistence-managed classes : Table/View 'JPOX_TABLES' does not exist.
java.sql.SQLSyntaxErrorException: Table/View 'JPOX_TABLES' does not exist.
at
org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown
Source)
at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown
Source)
at
org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown
Source)
at
org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown
Source)
at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown
Source)
at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown
Source)
at org.apache.derby.impl.jdbc.EmbedPreparedStatement.init(Unknown
Source)
at org.apache.derby.impl.jdbc.EmbedPreparedStatement20.init(Unknown
Source)
at org.apache.derby.impl.jdbc.EmbedPreparedStatement30.init(Unknown
Source)
at org.apache.derby.impl.jdbc.EmbedPreparedStatement40.init(Unknown
Source)
at org.apache.derby.jdbc.Driver40.newEmbedPreparedStatement(Unknown
Source)
at org.apache.derby.impl.jdbc.EmbedConnection.prepareStatement(Unknown
Source)
at org.apache.derby.impl.jdbc.EmbedConnection.prepareStatement(Unknown
Source)
at
org.jpox.store.rdbms.SQLController.getStatementForQuery(SQLController.java:324)
at
org.jpox.store.rdbms.SQLController.getStatementForQuery(SQLController.java:263)
at org.jpox.store.rdbms.table.SchemaTable.hasClass(SchemaTable.java:280)
at org.jpox.store.rdbms.table.SchemaTable.addClass(SchemaTable.java:222)
at
org.jpox.store.rdbms.SchemaAutoStarter.addClass(SchemaAutoStarter.java:255)
at
org.jpox.store.AbstractStoreManager.registerStoreData(AbstractStoreManager.java:363)
at org.jpox.store.rdbms.RDBMSManager.access$3000(RDBMSManager.java:171)
at
org.jpox.store.rdbms.RDBMSManager$ClassAdder.addClassTable(RDBMSManager.java:3001)
at
org.jpox.store.rdbms.RDBMSManager$ClassAdder.addClassTables(RDBMSManager.java:2804)
at
org.jpox.store.rdbms.RDBMSManager$ClassAdder.addClassTablesAndValidate(RDBMSManager.java:3098)
at
org.jpox.store.rdbms.RDBMSManager$ClassAdder.run(RDBMSManager.java:2729)
at
org.jpox.store.rdbms.RDBMSManager$MgmtTransaction.execute(RDBMSManager.java:2609)
at org.jpox.store.rdbms.RDBMSManager.addClasses(RDBMSManager.java:825)
at
org.jpox.store.AbstractStoreManager.addClass(AbstractStoreManager.java:624)
at
org.jpox.store.mapped.MappedStoreManager.getDatastoreClass(MappedStoreManager.java:343)
at
org.jpox.store.rdbms.RDBMSManager.getPropertiesForGenerator(RDBMSManager.java:1630)
at
org.jpox.store.AbstractStoreManager.getStrategyValue(AbstractStoreManager.java:945)
at org.jpox.ObjectManagerImpl.newObjectId(ObjectManagerImpl.java:2473)
at
org.jpox.state.JDOStateManagerImpl.setIdentity(JDOStateManagerImpl.java:792)
at

Re: mysql metastore problems

2009-02-23 Thread Ryan Shih
Please ignore the last message, I must have responded to an older hive
message and it put in the Hadoop core mailing list instead. I've reposted on
the hive user list.

On Mon, Feb 23, 2009 at 5:29 PM, Ryan Shih ryan.s...@gmail.com wrote:

 Hi, I'm having some problems setting up the metastore using mysql. I've
 browsed the message archives, but don't see anything that helps. My
 configuration files, look like:
 **hive-site.xml**
 ?xml version=1.0?
 ?xml-stylesheet type=text/xsl href=configuration.xsl?

 configuration
 property
 namehive.metastore.local/name
 valuetrue/value
 descriptionthis is local store/description
 /property
 property
 namehive.metastore.warehouse.dir/name
 value/user/hive/warehouse/value
 descriptiondefault location for Hive tables/description
 /property
 property
 namehive.aux.jars.path/name
 value/home/ryan/hive/branch-0.2/install/custom//value
 descriptionwhere custom serdes live /description
 /property
 /configuration


 **jpox.properties**

 javax.jdo.PersistenceManagerFactoryClass=org.jpox.PersistenceManagerFactoryImpl
 org.jpox.validateTables=false
 org.jpox.validateColumns=false
 org.jpox.validateConstraints=false
 org.jpox.storeManagerType=rdbms
 org.jpox.autoCreateSchema=true
 org.jpox.autoStartMechanismMode=checked
 org.jpox.transactionIsolation=read_committed
 javax.jdo.option.DetachAllOnCommit=true
 javax.jdo.option.NontransactionalRead=true
 javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver

 javax.jdo.option.ConnectionURL=jdbc:mysql://localhost/hive_metastore?createDatabaseIfNotExist=true
 javax.jdo.option.ConnectionUserName=hive_user
 javax.jdo.option.ConnectionPassword=hive_pass
 org.jpox.cache.level2=true
 org.jpox.cache.level2.type=SOFT

 And then with just the default hive-default.xml file. I haven't worked with
 JPOX tables before, but I'm under the impression that this will
 automatically get created from the autoCreateSchema flag. Has anyone had any
 luck with this?

 I'm getting the following error:
 hive r...@dali:~/hive/branch-0.2/install/bin$ hive
 Hive history file=/tmp/ryan/hive_job_log_ryan_200902231722_1088538316.txt
 hive create table test_table (id INT, name STRING);
 FAILED: Error in metadata: javax.jdo.JDODataStoreException: Error adding
 class org.apache.hadoop.hive.metastore.model.MDatabase to list of
 persistence-managed classes : Table/View 'JPOX_TABLES' does not exist.
 java.sql.SQLSyntaxErrorException: Table/View 'JPOX_TABLES' does not exist.
 at
 org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown
 Source)
 at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown
 Source)
 at
 org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown
 Source)
 at
 org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown
 Source)
 at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown
 Source)
 at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown
 Source)
 at org.apache.derby.impl.jdbc.EmbedPreparedStatement.init(Unknown
 Source)
 at org.apache.derby.impl.jdbc.EmbedPreparedStatement20.init(Unknown
 Source)
 at org.apache.derby.impl.jdbc.EmbedPreparedStatement30.init(Unknown
 Source)
 at org.apache.derby.impl.jdbc.EmbedPreparedStatement40.init(Unknown
 Source)
 at org.apache.derby.jdbc.Driver40.newEmbedPreparedStatement(Unknown
 Source)
 at org.apache.derby.impl.jdbc.EmbedConnection.prepareStatement(Unknown
 Source)
 at org.apache.derby.impl.jdbc.EmbedConnection.prepareStatement(Unknown
 Source)
 at
 org.jpox.store.rdbms.SQLController.getStatementForQuery(SQLController.java:324)
 at
 org.jpox.store.rdbms.SQLController.getStatementForQuery(SQLController.java:263)
 at
 org.jpox.store.rdbms.table.SchemaTable.hasClass(SchemaTable.java:280)
 at
 org.jpox.store.rdbms.table.SchemaTable.addClass(SchemaTable.java:222)
 at
 org.jpox.store.rdbms.SchemaAutoStarter.addClass(SchemaAutoStarter.java:255)
 at
 org.jpox.store.AbstractStoreManager.registerStoreData(AbstractStoreManager.java:363)
 at org.jpox.store.rdbms.RDBMSManager.access$3000(RDBMSManager.java:171)
 at
 org.jpox.store.rdbms.RDBMSManager$ClassAdder.addClassTable(RDBMSManager.java:3001)
 at
 org.jpox.store.rdbms.RDBMSManager$ClassAdder.addClassTables(RDBMSManager.java:2804)
 at
 org.jpox.store.rdbms.RDBMSManager$ClassAdder.addClassTablesAndValidate(RDBMSManager.java:3098)
 at
 org.jpox.store.rdbms.RDBMSManager$ClassAdder.run(RDBMSManager.java:2729)
 at
 org.jpox.store.rdbms.RDBMSManager$MgmtTransaction.execute(RDBMSManager.java:2609)
 at org.jpox.store.rdbms.RDBMSManager.addClasses(RDBMSManager.java:825)
 at
 org.jpox.store.AbstractStoreManager.addClass(AbstractStoreManager.java:624)
 at
 org.jpox.store.mapped.MappedStoreManager.getDatastoreClass(MappedStoreManager.java:343)
 at
 

Re: Batching key/value pairs to map

2009-02-23 Thread Edward Capriolo
We have a MR program that collects once for each token on a line. What
types of applications can benefit from batch mapping?


Re: How to use JobConf.setKeyFieldPartitionerOptions() method

2009-02-23 Thread Nick Cen
Thanks Jason, It works.

2009/2/23 jason hadoop jason.had...@gmail.com

 For reasons that are not clear, in 19, the partitioner steps one character
 past the end of the field unless you are very explicit in your key
 specification.
 One would assume that -k2 would pick up the second token, even if it was
 the
 last field in the key, but -k2,2 is required

 As near as I can tell the -kX syntax means piece X including the separator
 character, which will of course not be present, if this is the last piece.

 In your case, try *setKeyFieldPartitionerOptions(-k 1,1);* I believe it
 will work.


 On Sun, Feb 22, 2009 at 12:55 AM, Nick Cen cenyo...@gmail.com wrote:

  Hi All,
 
  Assume the output key from the mapper has the format k1,k2, what i
 wanna
  to do is to use the k1 instead the whole key to partition the output,
 what
  parameter value shoud i provide to the setKeyFieldPartitionerOptions(). i
  have try -k 1, but it throuth an ArrayIndexOutOfBound Exception, Thanks
  in
  advance.
 
  --
  http://daily.appspot.com/food/
 




-- 
http://daily.appspot.com/food/


Re: Limit number of records or total size in combiner input using jobconf?

2009-02-23 Thread Saptarshi Guha
Thank you.




On Fri, Feb 20, 2009 at 5:34 PM, Chris Douglas chri...@yahoo-inc.com wrote:
 So here are my questions:
 (1) is there a  jobconf hint to limit the number of records in kviter?
 I can (and have) made a fix to my code that processes the values in a
 combiner step in batches (i.e takes N at a go,processes that and
 repeat), but was wondering if i could just set an option.

 Approximately and indirectly, yes. You can limit the amount of memory
 allocated to storing serialized records in memory (io.sort.mb) and the
 percentage of that space reserved for storing record metadata
 (io.sort.record.percent, IIRC). That can be used to limit the number of
 records in each spill, though you may also need to disable the combiner
 during the merge, where you may run into the same problem.

 You're almost certainly better off designing your combiner to scale well (as
 you have), since you'll hit this in the reduce, too.

 Since this occurred in the MapContext, changing the number of reducers
 wont help.
 (2) How does changing the number of reducers help at all? I have 7
 machines, so I feel 11 (a prime close to 7, why a prime?) is good
 enough (some machines are 16GB others 32GB)

 Your combiner will look at all the records for a partition and only those
 records in a partition. If your partitioner distributes your records evenly
 in a particular spill, then increasing the total number of partitions will
 decrease the number of records your combiner considers in each call. For
 most partitioners, whether the number of reducers is prime should be
 irrelevant. -C



hdfs disappears

2009-02-23 Thread Anh Vũ Nguyễn
Hi everyone!
I am using Hadoop Core (version 0.19.0), os : Ubuntu 8.04, on one single
machine (for testing purpose). Everytime I shutdown my computer and turn on
it again, I can't access the virtual distributed file system just by command
{$HADOOP_HOME}/bin/start-all.sh. All the data has disappeared, and I have
to reformat the file system (using {$HADOOP_HOME}/bin/hadoop namenode
-format) before start-all.sh. Can any one explain me how to fix this
problem?
Thanks in advance.
Vu Nguyen.


Re: hdfs disappears

2009-02-23 Thread Brian Bockelman

Hello,

Where are you saving your data?  If it's being written into /tmp, it  
will be deleted every time you restart your computer.  I believe  
writing into /tmp is the default for Hadoop unless you changed it in  
hadoop-site.xml.


Brian

On Feb 23, 2009, at 10:00 PM, Anh Vũ Nguyễn wrote:


Hi everyone!
I am using Hadoop Core (version 0.19.0), os : Ubuntu 8.04, on one  
single
machine (for testing purpose). Everytime I shutdown my computer and  
turn on
it again, I can't access the virtual distributed file system just by  
command
{$HADOOP_HOME}/bin/start-all.sh. All the data has disappeared, and  
I have

to reformat the file system (using {$HADOOP_HOME}/bin/hadoop namenode
-format) before start-all.sh. Can any one explain me how to fix this
problem?
Thanks in advance.
Vu Nguyen.




Re: hdfs disappears

2009-02-23 Thread Mark Kerzner
Exactly the same thing happened to me, and Brian gave the same answer. What
if the default is changed to the user's home directory somewhere?

On Mon, Feb 23, 2009 at 10:05 PM, Brian Bockelman bbock...@cse.unl.eduwrote:

 Hello,

 Where are you saving your data?  If it's being written into /tmp, it will
 be deleted every time you restart your computer.  I believe writing into
 /tmp is the default for Hadoop unless you changed it in hadoop-site.xml.

 Brian


 On Feb 23, 2009, at 10:00 PM, Anh Vũ Nguyễn wrote:

  Hi everyone!
 I am using Hadoop Core (version 0.19.0), os : Ubuntu 8.04, on one single
 machine (for testing purpose). Everytime I shutdown my computer and turn
 on
 it again, I can't access the virtual distributed file system just by
 command
 {$HADOOP_HOME}/bin/start-all.sh. All the data has disappeared, and I
 have
 to reformat the file system (using {$HADOOP_HOME}/bin/hadoop namenode
 -format) before start-all.sh. Can any one explain me how to fix this
 problem?
 Thanks in advance.
 Vu Nguyen.