Installation and Configuration

2008-03-10 Thread edward yoon
I tried to install the current trunk version, and I received error message :

[EMAIL PROTECTED] hadoop]# bin/hadoop dfs -ls
ls: Cannot access .: No such file or directory.
08/03/10 15:39:49 INFO fs.FileSystem: FileSystem.closeAll() threw an exception:
java.io.IOException: Filesystem closed

What's wrong?
Thanks.
-- 
B. Regards,
Edward yoon @ NHN, corp.


Re: [HOD] Collecting MapReduce logs

2008-03-10 Thread Hemanth Yamijala

Luca,

Luca wrote:

Hello everyone,
I wonder what is the meaning of hodring.log-destination-uri versus 
hodring.log-dir. I'd like to collect MapReduce UI logs after a job has 
been run and the only attribute seems to be hod.hadoop-ui-log-dir, in 
the hod section.


log-destination-uri is a config option for uploading hadoop logs after 
the cluster is deallocated. log-dir is used to store logs generated by 
the HOD processes itself. If you want MapReduce UI logs, 
hadoop-ui-log-dir is what you want, as you rightly noted.


With that attribute specified, logs are all grabbed in that directory, 
producing a large amount of html files. Is there a way to collect 
them, maybe as a .tar.gz, in a place somewhere related to the user?


Sorry, no, we don't have that option yet. In fact going forward, Hadoop 
might solve this problem on its own. HADOOP-2178 seems to be related to 
this, but I haven't looked at it too closely.


Additionally, how do administrators specify variables in these values? 
Which interpreter interprets them? For instance, variables specified 
in a bash fashion like $USER in section hodring or ringmaster work 
well (I guess they are interpreted by bash itself) but if specified in 
the hod section they're not: I tried with

[hod]
hadoop-ui-log-dir=/somedir/$USER
but any hod command fails displaying an error on that line.

We are definitely planning to build this capability, as part of the work 
we will be doing for HADOOP-2849.

Cheers,
Luca





File size and number of files considerations

2008-03-10 Thread Naama Kraus
Hi,

In our system, we plan to upload data into Hadoop from external sources and
use it later on for analysis tasks. The interface to the external
repositories allows us to fetch pieces of data in chunks. E.g. get n records
at a time. Records are relatively small, though the overall amount of data
is assumed to be large. For each repository, we fetch pieces of data in a
serial manner. Number of repositories is small (few of them).

My first step is to put the data in plain files in HDFS. My question is what
is the optimized file sizes to use. Many small files (to the extent of each
record in a file) ? - guess not. Few huge files each holding all data of
same type ? Or maybe put each chunk we get in a separate file, and close it
right after a chunk was uploaded ?

How would HFDS perform best, with few large files or more smaller files ? As
I wrote we plan to run MapReduce jobs over the data in the files in order to
organize the data and analyze it.

Thanks for any help,
Naama

-- 
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales. (Albert
Einstein)


Re: File size and number of files considerations

2008-03-10 Thread Amar Kamat

On Mon, 10 Mar 2008, Naama Kraus wrote:


Hi,

In our system, we plan to upload data into Hadoop from external sources and
use it later on for analysis tasks. The interface to the external
repositories allows us to fetch pieces of data in chunks. E.g. get n records
at a time. Records are relatively small, though the overall amount of data
is assumed to be large. For each repository, we fetch pieces of data in a
serial manner. Number of repositories is small (few of them).

My first step is to put the data in plain files in HDFS. My question is what
is the optimized file sizes to use. Many small files (to the extent of each
record in a file) ? - guess not. Few huge files each holding all data of
same type ? Or maybe put each chunk we get in a separate file, and close it
right after a chunk was uploaded ?

I think it should be more based on the size of the data you want to 
process in a map which I think here is the chunk size, no?
Larger the file less the replicas and hence more the network transfers in 
case of more maps. In case of smaller file size the NN will be bottleneck 
but you will end up having more replicas for each map task and hence more locality.

Amar

How would HFDS perform best, with few large files or more smaller files ? As
I wrote we plan to run MapReduce jobs over the data in the files in order to
organize the data and analyze it.

Thanks for any help,
Naama




What's difference between Objectgrid and Hadoop MapReduce

2008-03-10 Thread Jian Shu
Hi,

I found ObjectGrid is very similar to Hadoop MapReduce. Could anyone help me
compare it? I knew Hadoop provided not only MapReduce but HDFS.  If
ObjectGrid could do the same thing as MapReduce, why do we still re-invent
it?


-- 

My Blog:
http://trip-in-life.spaces.live.com


Re: What's difference between Objectgrid and Hadoop MapReduce

2008-03-10 Thread Bob Futrelle
It is true that if you Google on

objectgrid mapreduce

you get some documents that describe how map and reduce can be
implemented. They also support replication, etc.  But is it open
source?  Looks more like an IBM product to me, used as a back end for
webapps.

 -- Bob Futrelle


On Mon, Mar 10, 2008 at 10:03 AM, Jian Shu [EMAIL PROTECTED] wrote:
 Hi,

  I found ObjectGrid is very similar to Hadoop MapReduce. Could anyone help me
  compare it? I knew Hadoop provided not only MapReduce but HDFS.  If
  ObjectGrid could do the same thing as MapReduce, why do we still re-invent
  it?


  --
  
  My Blog:
  http://trip-in-life.spaces.live.com



A contrib package to build/update a Lucene index

2008-03-10 Thread Ning Li
Hi,

Is there any interest in a contrib package to build/update a Lucene index?

I should have asked the question before creating the JIRA issue and
attaching the patch. In any case, more details can be found at

https://issues.apache.org/jira/browse/HADOOP-2951


Regards,
Ning


Re: dynamically adding slaves to hadoop cluster

2008-03-10 Thread tjohn



Mafish Liu wrote:
 
 On Mon, Mar 10, 2008 at 9:47 AM, Mafish Liu [EMAIL PROTECTED] wrote:
 
 You should do the following steps:
 1. Have hadoop deployed on the new node with the same directory structure
 and configuration.
 2. Just run $HADOOP_HOME/bin/hadoop datanode and jobtracker.
 
 Addition: do not run bin/hadoop namenode -format before you run
 datanode,
 or you will get a error like Incompatible namespaceIDs ...
 


 Datanode and jobtracker will contact to namenode specified in hadoop
 configuration file automatically and finish adding new node to the hadoop
 cluster.


 On Mon, Mar 10, 2008 at 4:56 AM, Aaron Kimball [EMAIL PROTECTED]
 wrote:

  Yes. You should have the same hadoop-site across all your slaves. They
  will need to know the DNS name for the namenode and jobtracker.
 
  - Aaron
 
  tjohn wrote:
  
   Mahadev Konar wrote:
  
   I believe (as far as I remember) you should be able to add the node
  by
   bringing up the datanode or tasktracker on the remote machine. The
   Namenode or the jobtracker (I think) does not check for the nodes in
  the
   slaves file. The slaves file is just to start up all the daemon's by
   ssshing to all the nodes in the slaves file during startup. So you
   should just be able to startup the datanode pointing to correct
  namenode
   and it should work.
  
   Regards
   Mahadev
  
  
  
  
   Sorry for my ignorance... To make a datanode/tasktraker point to the
   namenode what should i do? Have i to edit the hadoop-site.xml? Thanks
  
   John
  
  
 



 --
 [EMAIL PROTECTED]
 Institute of Computing Technology, Chinese Academy of Sciences, Beijing.

 
 
 
 -- 
 [EMAIL PROTECTED]
 Institute of Computing Technology, Chinese Academy of Sciences, Beijing.
 
 

Thanks a lot guys! It worked fine and it was exactly what i was looking for.
Best wishes, 
John.

-- 
View this message in context: 
http://www.nabble.com/dynamically-adding-slaves-to-hadoop-cluster-tp15943388p15950796.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: dynamically adding slaves to hadoop cluster

2008-03-10 Thread Jason Venner
We have done this, and it works well. The one downside, is that the 
stop-dfs.sh and stop-mapred.sh (and of course stop-all.sh) doen't seem 
to control the hand started datanodes/job trackers. I am assuming it is 
because the pid files haven't been written to the pid directory but have 
not investigated.


Is there a /proper/ way to bring up the processes on the slave node so 
that the master will recognize them at *stop* time?


tjohn wrote:


Mafish Liu wrote:
  

On Mon, Mar 10, 2008 at 9:47 AM, Mafish Liu [EMAIL PROTECTED] wrote:



You should do the following steps:
1. Have hadoop deployed on the new node with the same directory structure
and configuration.
2. Just run $HADOOP_HOME/bin/hadoop datanode and jobtracker.
  

Addition: do not run bin/hadoop namenode -format before you run
datanode,
or you will get a error like Incompatible namespaceIDs ...



Datanode and jobtracker will contact to namenode specified in hadoop
configuration file automatically and finish adding new node to the hadoop
cluster.


On Mon, Mar 10, 2008 at 4:56 AM, Aaron Kimball [EMAIL PROTECTED]
wrote:

  

Yes. You should have the same hadoop-site across all your slaves. They
will need to know the DNS name for the namenode and jobtracker.

- Aaron

tjohn wrote:


Mahadev Konar wrote:

  

I believe (as far as I remember) you should be able to add the node


by


bringing up the datanode or tasktracker on the remote machine. The
Namenode or the jobtracker (I think) does not check for the nodes in


the


slaves file. The slaves file is just to start up all the daemon's by
ssshing to all the nodes in the slaves file during startup. So you
should just be able to startup the datanode pointing to correct


namenode


and it should work.

Regards
Mahadev





Sorry for my ignorance... To make a datanode/tasktraker point to the
namenode what should i do? Have i to edit the hadoop-site.xml? Thanks

John


  


--
[EMAIL PROTECTED]
Institute of Computing Technology, Chinese Academy of Sciences, Beijing.

  


--
[EMAIL PROTECTED]
Institute of Computing Technology, Chinese Academy of Sciences, Beijing.





Thanks a lot guys! It worked fine and it was exactly what i was looking for.
Best wishes, 
John.


  

--
Jason Venner
Attributor - Publish with Confidence http://www.attributor.com/
Attributor is hiring Hadoop Wranglers, contact if interested


Re: File size and number of files considerations

2008-03-10 Thread Ted Dunning

Amar's comments are a little strange.

Replication occurs at the block level, not the file level.  Storing data in
a small number of large files or a large number of small files will have
less than a factor of two effect on number of replicated blocks if the small
files are 64MB.  Files smaller than that will hurt performance due to seek
costs.

To address Naama's question, you should consolidate your files so that you
have files of at least 64 MB and preferably a bit larger than that.  This
helps because it allows the reading of the files to proceed in a nice
sequential manner which can greatly increase throughput.

If consolidating these files off-line is difficult, it is easy to do in a
preliminary map-reduce step.  This will incur a one-time cost, but if you
are doing multiple passes over the data later, it will be worth it.


On 3/10/08 3:12 AM, Amar Kamat [EMAIL PROTECTED] wrote:

 On Mon, 10 Mar 2008, Naama Kraus wrote:
 
 Hi,
 
 In our system, we plan to upload data into Hadoop from external sources and
 use it later on for analysis tasks. The interface to the external
 repositories allows us to fetch pieces of data in chunks. E.g. get n records
 at a time. Records are relatively small, though the overall amount of data
 is assumed to be large. For each repository, we fetch pieces of data in a
 serial manner. Number of repositories is small (few of them).
 
 My first step is to put the data in plain files in HDFS. My question is what
 is the optimized file sizes to use. Many small files (to the extent of each
 record in a file) ? - guess not. Few huge files each holding all data of
 same type ? Or maybe put each chunk we get in a separate file, and close it
 right after a chunk was uploaded ?
 
 I think it should be more based on the size of the data you want to
 process in a map which I think here is the chunk size, no?
 Larger the file less the replicas and hence more the network transfers in
 case of more maps. In case of smaller file size the NN will be bottleneck
 but you will end up having more replicas for each map task and hence more
 locality.
 Amar
 How would HFDS perform best, with few large files or more smaller files ? As
 I wrote we plan to run MapReduce jobs over the data in the files in order to
 organize the data and analyze it.
 
 Thanks for any help,
 Naama
 
 



S3/EC2 setup problem: port 9001 unreachable

2008-03-10 Thread Andreas Kostyrka
Hi!

I'm trying to setup a Hadoop 0.16.0 cluster on EC2/S3. (Manually, not
using the Hadoop AMIs)

I've got the S3 based HDFS working, but I'm stumped when I try to get a
test job running:

[EMAIL PROTECTED]:~/hadoop-0.16.0$ time bin/hadoop jar 
contrib/streaming/hadoop-0.16.0-streaming.jar -mapper /tmp/test.sh -reducer cat 
-input testlogs/* -output testlogs-output
additionalConfSpec_:null
null=@@@userJobConfProps_.get(stream.shipped.hadoopstreaming
packageJobJar: [/tmp/hadoop-hadoop/hadoop-unjar17969/] [] 
/tmp/streamjob17970.jar tmpDir=null
08/03/10 14:01:28 INFO mapred.FileInputFormat: Total input paths to process : 
152
08/03/10 14:02:58 INFO streaming.StreamJob: getLocalDirs(): 
[/tmp/hadoop-hadoop/mapred/local]
08/03/10 14:02:58 INFO streaming.StreamJob: Running job: job_200803101400_0001
08/03/10 14:02:58 INFO streaming.StreamJob: To kill this job, run:
08/03/10 14:02:58 INFO streaming.StreamJob: 
/home/hadoop/hadoop-0.16.0/bin/../bin/hadoop job  
-Dmapred.job.tracker=ec2-67-202-58-97.compute-1.amazonaws.com:9001 -kill 
job_200803101400_0001
08/03/10 14:02:58 INFO streaming.StreamJob: Tracking URL: 
http://ip-10-251-75-165.ec2.internal:50030/jobdetails.jsp?jobid=job_200803101400_0001
08/03/10 14:02:59 INFO streaming.StreamJob:  map 0%  reduce 0%

Furthermore, when I try to connect port 9001 on 10.251.75.165 via telnet from 
the masterhost itself, it connects:
[EMAIL PROTECTED]:~/hadoop-0.16.0$ telnet 10.251.75.165 9001
Trying 10.251.75.165...
Connected to 10.251.75.165.
Escape character is '^]'.
^]
telnet quit
Connection closed.

When I try to do this from other VMs in my cluster, it just hangs. 
(tcpdump on the masterhost shows no activity for tcp port 9001):

[EMAIL PROTECTED]:~/hadoop-0.16.0$ telnet ip-10-251-75-165.ec2.internal 9001
Trying 10.251.75.165...

[EMAIL PROTECTED]:~/hadoop-0.16.0$ telnet ip-10-251-75-165.ec2.internal 22
Trying 10.251.75.165...
Connected to ip-10-251-75-165.ec2.internal.
Escape character is '^]'.
SSH-2.0-OpenSSH_4.3p2 Debian-9
^]
telnet quit
Connection closed.

This is also shown when I connect port 50030, which shows 0 nodes ready to 
process the job.

Furthermore, the slaves show the following messages:
2008-03-10 15:30:11,455 INFO org.apache.hadoop.ipc.RPC: Problem connecting to 
server: ec2-67-202-58-97.compute-1.amazonaws.com/10.251.75.165:9001
2008-03-10 15:31:12,465 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: ec2-67-202-58-97.compute-1.amazonaws.com/10.251.75.165:9001. Already 
tried 1 time(s).
2008-03-10 15:32:13,475 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: ec2-67-202-58-97.compute-1.amazonaws.com/10.251.75.165:9001. Already 
tried 2 time(s).

Last but not least, here is my site conf:
?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?
configuration

property
  namefs.default.name/name
  values3://lookhad/value
  descriptionThe name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem./description
/property

property
  namefs.s3.awsAccessKeyId/name
  value2DFGTTFSDFDSZU5SDSD7S5202/value
/property

property
  namefs.s3.awsSecretAccessKey/name
  valueRUWgsdfsd67SFDfsdflaI9Gjzfsd8789ksd2r1PtG/value
/property

property
  namemapred.job.tracker/name
  valueec2-67-202-58-97.compute-1.amazonaws.com:9001/value
  descriptionThe host and port that the MapReduce job tracker runs
  at.  If local, then jobs are run in-process as a single map
  and reduce task.
  /description
/property
/configuration

The masternode listens not no localhost:
[EMAIL PROTECTED]:~/hadoop-0.16.0$ netstat -an | grep 9001
tcp0  0 10.251.75.165:9001  0.0.0.0:*   LISTEN 

Any ideas? My conclusions thus are:

1.) First, it's not a general connectivity problem, because I can connect port 
22 without any problems.
2.) OTOH, on port 9001, inside the same group, the connectivity seems to be 
limited.
3.) All AWS docs tell me that VMs in one group have no firewalls in place.

So what is happening here? Any ideas?

Andreas


signature.asc
Description: Dies ist ein digital signierter Nachrichtenteil


Re: dynamically adding slaves to hadoop cluster

2008-03-10 Thread Owen O'Malley


On Mar 10, 2008, at 8:22 AM, Jason Venner wrote:

Is there a /proper/ way to bring up the processes on the slave node  
so that the master will recognize them at *stop* time?


yes, you can setup the pid files by using (directly on the newly  
added node!):


% bin/hadoop-daemon.sh start datanode
% bin/hadoop-daemon.sh start tasktracker

then the stop-all will know the pid to shut down. It is unfortunate  
that start-daemon.sh and start-daemons.sh differ only in the s.  
start-daemons.sh should probably be start-slave-daemons.sh or something.


-- Owen


Re: S3/EC2 setup problem: port 9001 unreachable

2008-03-10 Thread Andreas Kostyrka
Found it, was security group setup problem ;(

Andreas

Am Montag, den 10.03.2008, 16:49 +0100 schrieb Andreas Kostyrka:
 Hi!
 
 I'm trying to setup a Hadoop 0.16.0 cluster on EC2/S3. (Manually, not
 using the Hadoop AMIs)
 
 I've got the S3 based HDFS working, but I'm stumped when I try to get a
 test job running:
 
 [EMAIL PROTECTED]:~/hadoop-0.16.0$ time bin/hadoop jar 
 contrib/streaming/hadoop-0.16.0-streaming.jar -mapper /tmp/test.sh -reducer 
 cat -input testlogs/* -output testlogs-output
 additionalConfSpec_:null
 null=@@@userJobConfProps_.get(stream.shipped.hadoopstreaming
 packageJobJar: [/tmp/hadoop-hadoop/hadoop-unjar17969/] [] 
 /tmp/streamjob17970.jar tmpDir=null
 08/03/10 14:01:28 INFO mapred.FileInputFormat: Total input paths to process : 
 152
 08/03/10 14:02:58 INFO streaming.StreamJob: getLocalDirs(): 
 [/tmp/hadoop-hadoop/mapred/local]
 08/03/10 14:02:58 INFO streaming.StreamJob: Running job: job_200803101400_0001
 08/03/10 14:02:58 INFO streaming.StreamJob: To kill this job, run:
 08/03/10 14:02:58 INFO streaming.StreamJob: 
 /home/hadoop/hadoop-0.16.0/bin/../bin/hadoop job  
 -Dmapred.job.tracker=ec2-67-202-58-97.compute-1.amazonaws.com:9001 -kill 
 job_200803101400_0001
 08/03/10 14:02:58 INFO streaming.StreamJob: Tracking URL: 
 http://ip-10-251-75-165.ec2.internal:50030/jobdetails.jsp?jobid=job_200803101400_0001
 08/03/10 14:02:59 INFO streaming.StreamJob:  map 0%  reduce 0%
 
 Furthermore, when I try to connect port 9001 on 10.251.75.165 via telnet from 
 the masterhost itself, it connects:
 [EMAIL PROTECTED]:~/hadoop-0.16.0$ telnet 10.251.75.165 9001
 Trying 10.251.75.165...
 Connected to 10.251.75.165.
 Escape character is '^]'.
 ^]
 telnet quit
 Connection closed.
 
 When I try to do this from other VMs in my cluster, it just hangs. 
 (tcpdump on the masterhost shows no activity for tcp port 9001):
 
 [EMAIL PROTECTED]:~/hadoop-0.16.0$ telnet ip-10-251-75-165.ec2.internal 9001
 Trying 10.251.75.165...
 
 [EMAIL PROTECTED]:~/hadoop-0.16.0$ telnet ip-10-251-75-165.ec2.internal 22
 Trying 10.251.75.165...
 Connected to ip-10-251-75-165.ec2.internal.
 Escape character is '^]'.
 SSH-2.0-OpenSSH_4.3p2 Debian-9
 ^]
 telnet quit
 Connection closed.
 
 This is also shown when I connect port 50030, which shows 0 nodes ready to 
 process the job.
 
 Furthermore, the slaves show the following messages:
 2008-03-10 15:30:11,455 INFO org.apache.hadoop.ipc.RPC: Problem connecting to 
 server: ec2-67-202-58-97.compute-1.amazonaws.com/10.251.75.165:9001
 2008-03-10 15:31:12,465 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: ec2-67-202-58-97.compute-1.amazonaws.com/10.251.75.165:9001. 
 Already tried 1 time(s).
 2008-03-10 15:32:13,475 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: ec2-67-202-58-97.compute-1.amazonaws.com/10.251.75.165:9001. 
 Already tried 2 time(s).
 
 Last but not least, here is my site conf:
 ?xml version=1.0?
 ?xml-stylesheet type=text/xsl href=configuration.xsl?
 configuration
 
 property
   namefs.default.name/name
   values3://lookhad/value
   descriptionThe name of the default file system.  A URI whose
   scheme and authority determine the FileSystem implementation.  The
   uri's scheme determines the config property (fs.SCHEME.impl) naming
   the FileSystem implementation class.  The uri's authority is used to
   determine the host, port, etc. for a filesystem./description
 /property
 
 property
   namefs.s3.awsAccessKeyId/name
   value2DFGTTFSDFDSZU5SDSD7S5202/value
 /property
 
 property
   namefs.s3.awsSecretAccessKey/name
   valueRUWgsdfsd67SFDfsdflaI9Gjzfsd8789ksd2r1PtG/value
 /property
 
 property
   namemapred.job.tracker/name
   valueec2-67-202-58-97.compute-1.amazonaws.com:9001/value
   descriptionThe host and port that the MapReduce job tracker runs
   at.  If local, then jobs are run in-process as a single map
   and reduce task.
   /description
 /property
 /configuration
 
 The masternode listens not no localhost:
 [EMAIL PROTECTED]:~/hadoop-0.16.0$ netstat -an | grep 9001
 tcp0  0 10.251.75.165:9001  0.0.0.0:*   LISTEN
  
 
 Any ideas? My conclusions thus are:
 
 1.) First, it's not a general connectivity problem, because I can connect 
 port 22 without any problems.
 2.) OTOH, on port 9001, inside the same group, the connectivity seems to be 
 limited.
 3.) All AWS docs tell me that VMs in one group have no firewalls in place.
 
 So what is happening here? Any ideas?
 
 Andreas


signature.asc
Description: Dies ist ein digital signierter Nachrichtenteil


Re: S3/EC2 setup problem: port 9001 unreachable

2008-03-10 Thread Chris K Wensel

Andreas

Here are some moderately useful notes on using EC2/S3, mostly learned  
leveraging Hadoop. The groups can't see themselves issue is listed  
grin.


http://www.manamplified.org/archives/2008/03/notes-on-using-ec2-s3.html

enjoy
ckw

On Mar 10, 2008, at 9:51 AM, Andreas Kostyrka wrote:


Found it, was security group setup problem ;(

Andreas

Am Montag, den 10.03.2008, 16:49 +0100 schrieb Andreas Kostyrka:

Hi!

I'm trying to setup a Hadoop 0.16.0 cluster on EC2/S3. (Manually, not
using the Hadoop AMIs)

I've got the S3 based HDFS working, but I'm stumped when I try to  
get a

test job running:

[EMAIL PROTECTED]:~/hadoop-0.16.0$ time bin/hadoop jar  
contrib/streaming/hadoop-0.16.0-streaming.jar -mapper /tmp/test.sh - 
reducer cat -input testlogs/* -output testlogs-output

additionalConfSpec_:null
null=@@@userJobConfProps_.get(stream.shipped.hadoopstreaming
packageJobJar: [/tmp/hadoop-hadoop/hadoop-unjar17969/] [] /tmp/ 
streamjob17970.jar tmpDir=null
08/03/10 14:01:28 INFO mapred.FileInputFormat: Total input paths to  
process : 152
08/03/10 14:02:58 INFO streaming.StreamJob: getLocalDirs(): [/tmp/ 
hadoop-hadoop/mapred/local]
08/03/10 14:02:58 INFO streaming.StreamJob: Running job:  
job_200803101400_0001

08/03/10 14:02:58 INFO streaming.StreamJob: To kill this job, run:
08/03/10 14:02:58 INFO streaming.StreamJob: /home/hadoop/ 
hadoop-0.16.0/bin/../bin/hadoop job  - 
Dmapred.job.tracker=ec2-67-202-58-97.compute-1.amazonaws.com:9001 - 
kill job_200803101400_0001

08/03/10 14:02:58 INFO streaming.StreamJob: Tracking URL: 
http://ip-10-251-75-165.ec2.internal:50030/jobdetails.jsp?jobid=job_200803101400_0001
08/03/10 14:02:59 INFO streaming.StreamJob:  map 0%  reduce 0%

Furthermore, when I try to connect port 9001 on 10.251.75.165 via  
telnet from the masterhost itself, it connects:

[EMAIL PROTECTED]:~/hadoop-0.16.0$ telnet 10.251.75.165 9001
Trying 10.251.75.165...
Connected to 10.251.75.165.
Escape character is '^]'.
^]
telnet quit
Connection closed.

When I try to do this from other VMs in my cluster, it just hangs.
(tcpdump on the masterhost shows no activity for tcp port 9001):

[EMAIL PROTECTED]:~/hadoop-0.16.0$ telnet  
ip-10-251-75-165.ec2.internal 9001

Trying 10.251.75.165...

[EMAIL PROTECTED]:~/hadoop-0.16.0$ telnet  
ip-10-251-75-165.ec2.internal 22

Trying 10.251.75.165...
Connected to ip-10-251-75-165.ec2.internal.
Escape character is '^]'.
SSH-2.0-OpenSSH_4.3p2 Debian-9
^]
telnet quit
Connection closed.

This is also shown when I connect port 50030, which shows 0 nodes  
ready to process the job.


Furthermore, the slaves show the following messages:
2008-03-10 15:30:11,455 INFO org.apache.hadoop.ipc.RPC: Problem  
connecting to server: ec2-67-202-58-97.compute-1.amazonaws.com/ 
10.251.75.165:9001
2008-03-10 15:31:12,465 INFO org.apache.hadoop.ipc.Client: Retrying  
connect to server: ec2-67-202-58-97.compute-1.amazonaws.com/ 
10.251.75.165:9001. Already tried 1 time(s).
2008-03-10 15:32:13,475 INFO org.apache.hadoop.ipc.Client: Retrying  
connect to server: ec2-67-202-58-97.compute-1.amazonaws.com/ 
10.251.75.165:9001. Already tried 2 time(s).


Last but not least, here is my site conf:
?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?
configuration

property
 namefs.default.name/name
 values3://lookhad/value
 descriptionThe name of the default file system.  A URI whose
 scheme and authority determine the FileSystem implementation.  The
 uri's scheme determines the config property (fs.SCHEME.impl) naming
 the FileSystem implementation class.  The uri's authority is used to
 determine the host, port, etc. for a filesystem./description
/property

property
 namefs.s3.awsAccessKeyId/name
 value2DFGTTFSDFDSZU5SDSD7S5202/value
/property

property
 namefs.s3.awsSecretAccessKey/name
 valueRUWgsdfsd67SFDfsdflaI9Gjzfsd8789ksd2r1PtG/value
/property

property
 namemapred.job.tracker/name
 valueec2-67-202-58-97.compute-1.amazonaws.com:9001/value
 descriptionThe host and port that the MapReduce job tracker runs
 at.  If local, then jobs are run in-process as a single map
 and reduce task.
 /description
/property
/configuration

The masternode listens not no localhost:
[EMAIL PROTECTED]:~/hadoop-0.16.0$ netstat -an | grep 9001
tcp0  0 10.251.75.165:9001  0.0.0.0:*
LISTEN


Any ideas? My conclusions thus are:

1.) First, it's not a general connectivity problem, because I can  
connect port 22 without any problems.
2.) OTOH, on port 9001, inside the same group, the connectivity  
seems to be limited.
3.) All AWS docs tell me that VMs in one group have no firewalls in  
place.


So what is happening here? Any ideas?

Andreas


Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/





File Per Column in Hadoop

2008-03-10 Thread Richard K. Turner

I have found that storing each column in its own gzip file can really speed up 
processing time on arbitrary subsets of columns.  For example suppose I have 
two CSV files called csv_file1.gz and csv_file2.gz.  I can create a file for 
each column as follows :

   csv_file1/col1.gz
   csv_file1/col2.gz
   csv_file1/col3.gz
 .
 .
 .
   csv_file1/colN.gz
   csv_file2/col1.gz
   csv_file2/col2.gz
   csv_file2/col3.gz
 .
 .
 .
   csv_file2/colN.gz


I would like to use this approach when writing map reduce jobs in Hadoop.  
Inorder to do this I think I would need to write an input format, which I can 
look into.  However, I want to avoid the situation where a map task reads 
column files from different nodes.   To avoid this situation, all columns files 
derived from the same CSV file must be co-located on the same node(or nodes if 
replication is enabled).  So for my example I would like to ask HDFS to keep 
all files in dir csv_file1 together on the same node(s).  I would also do the 
same for dir csv_file2.  Does anyone know how to do this in Hadoop?

Thanks,

Keith


Re: File Per Column in Hadoop

2008-03-10 Thread Ted Dunning

Have you looked at hbase.  It looks like you are trying to reimplement a
bunch of it.


On 3/10/08 11:01 AM, Richard K. Turner [EMAIL PROTECTED] wrote:

 ... [storing data in columns is nice] ... I would also do the same for dir
csv_file2.  Does anyone know how to do this
 in Hadoop?



unsubscribe

2008-03-10 Thread Boxiong Ding





  

Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ 


Hadoop Quickstart page

2008-03-10 Thread Jason Rennie
I just ran through this as a new user and had trouble w/ the JAVA_HOME
setting.  Per the instructions, I had JAVA_HOME set appropriately (via my
.bashrc), but not in conf/hadoop-env.sh.  Would be good if 1. under
Required Software specified where JAVA_HOME should be set.

http://hadoop.apache.org/core/docs/current/quickstart.html

Cheers,

Jason

P.S. Very nice that there is a quick start like this.  So many projects lack
something like this to get you started...


How to compile fuse-dfs

2008-03-10 Thread xavier.quintuna
Hi everybody,

I'm trying to compile fuse-dfs but I have problems. I don't have a lot
of experience with C++.
I would like to know:
Is it a clear readme file with the instructions to compile, install
fuse-dfs?
Do I need to replace  fuse_dfs.c with the one in
fuse-dfs/src/fuse_dfs.c?
Do I need to set up different flag if I'm using a i386 or 86 machine?
Which one and Where?
Which make file do I need to use to compile the code?



Thanks 

Xavier





RE: What's the best way to get to a single key?

2008-03-10 Thread Xavier Stevens
I was thinking because it would be easier to search a single-index.
Unless I don't have to worry and hadoop searches all my indexes at the
same time.  Is this the case?

-Xavier
 

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED] 
Sent: Monday, March 10, 2008 3:45 PM
To: core-user@hadoop.apache.org
Subject: Re: What's the best way to get to a single key?

Xavier Stevens wrote:
 Thanks for everything so far.  It has been really helpful.  I have one

 more question.  Is there a way to merge MapFile index/data files?

No.

To append text files you can use 'bin/hadoop fs -getmerge'.

To merge sorted SequenceFiles (like MapFile/index files) you can use:

http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/Sequ
enceFile.Sorter.html#merge(org.apache.hadoop.fs.Path[],%20org.apache.had
oop.fs.Path,%20boolean)

But this doesn't generate a MapFile.

Why is a single file preferable?

Doug




Re: Hadoop Quickstart page

2008-03-10 Thread Arun C Murthy


On Mar 10, 2008, at 3:18 PM, Jason Rennie wrote:


I just ran through this as a new user and had trouble w/ the JAVA_HOME
setting.  Per the instructions, I had JAVA_HOME set appropriately  
(via my

.bashrc), but not in conf/hadoop-env.sh.  Would be good if 1. under
Required Software specified where JAVA_HOME should be set.

http://hadoop.apache.org/core/docs/current/quickstart.html



Jason - it is specified a bit lower down in the 'Download' section.  
Point taken, we should clarify it.


Do you want to go ahead and file a documentation request (https:// 
issues.apache.org/jira/secure/CreateIssue!default.jspa) ? Thanks!


Arun



Cheers,

Jason

P.S. Very nice that there is a quick start like this.  So many  
projects lack

something like this to get you started...




RE: Does Hadoop Honor Reserved Space?

2008-03-10 Thread Joydeep Sen Sarma
Filed https://issues.apache.org/jira/browse/HADOOP-2991

-Original Message-
From: Joydeep Sen Sarma [mailto:[EMAIL PROTECTED] 
Sent: Monday, March 10, 2008 12:56 PM
To: core-user@hadoop.apache.org; core-user@hadoop.apache.org
Cc: Pete Wyckoff
Subject: RE: Does Hadoop Honor Reserved Space?

folks - Jimmy is right - as we have unfortunately hit it as well:

https://issues.apache.org/jira/browse/HADOOP-1463 caused a regression.
we have left some comments on the bug - but can't reopen it.

this is going to be affecting all 0.15 and 0.16 deployments!


-Original Message-
From: Hairong Kuang [mailto:[EMAIL PROTECTED]
Sent: Thu 3/6/2008 2:01 PM
To: core-user@hadoop.apache.org
Subject: Re: Does Hadoop Honor Reserved Space?
 
In addition to the version, could you please send us a copy of the
datanode
report by running the command bin/hadoop dfsadmin -report?

Thanks,
Hairong


On 3/6/08 11:56 AM, Joydeep Sen Sarma [EMAIL PROTECTED] wrote:

 but intermediate data is stored in a different directory from dfs/data
 (something like mapred/local by default i think).
 
 what version are u running?
 
 
 -Original Message-
 From: Ashwinder Ahluwalia on behalf of [EMAIL PROTECTED]
 Sent: Thu 3/6/2008 10:14 AM
 To: core-user@hadoop.apache.org
 Subject: RE: Does Hadoop Honor Reserved Space?
  
 I've run into a similar issue in the past. From what I understand,
this
 parameter only controls the HDFS space usage. However, the
intermediate data
 in
 the map reduce job is stored on the local file system (not HDFS) and
is not
 subject to this configuration.
 
 In the past I have used mapred.local.dir.minspacekill and
 mapred.local.dir.minspacestart to control the amount of space that is
 allowable
 for use by this temporary data.
 
 Not sure if that is the best approach though, so I'd love to hear what
other
 people have done. In your case, you have a map-red job that will
consume too
 much space (without setting a limit, you didn't have enough disk
capacity for
 the job), so looking at mapred.output.compress and
mapred.compress.map.output
 might be useful to decrease the job's disk requirements.
 
 --Ash
 
 -Original Message-
 From: Jimmy Wan [mailto:[EMAIL PROTECTED]
 Sent: Thursday, March 06, 2008 9:56 AM
 To: core-user@hadoop.apache.org
 Subject: Does Hadoop Honor Reserved Space?
 
 I've got 2 datanodes setup with the following configuration parameter:
 property
  namedfs.datanode.du.reserved/name
  value429496729600/value
  descriptionReserved space in bytes per volume. Always leave this
 much  
 space free for non dfs use.
  /description
 /property
 
 Both are housed on 800GB volumes, so I thought this would keep about
half
 the volume free for non-HDFS usage.
 
 After some long running jobs last night, both disk volumes were
completely
 filled. The bulk of the data was in:
 ${my.hadoop.tmp.dir}/hadoop-hadoop/dfs/data
 
 This is running as the user hadoop.
 
 Am I interpretting these parameters incorrectly?
 
 I noticed this issue, but it is marked as closed:
 http://issues.apache.org/jira/browse/HADOOP-2549





RE: What's the best way to get to a single key?

2008-03-10 Thread Xavier Stevens
So I read some more through the Javadocs.  I had 11 reducers on my original job 
leaving me 11 MapFile directories.  I am passing in their parent directory here 
as outDir.

MapFile.Reader[] readers = MapFileOutputFormat.getReaders(fileSys, outDir, 
defaults);
Partitioner part = 
(Partitioner)ReflectionUtils.newInstance(conf.getPartitionerClass(), conf);
Text entryValue = (Text)MapFileOutputFormat.getEntry(readers, part, new 
Text(mykey), null);
System.out.println(My Entry's Value: );
System.out.println(entryValue.toString());

But I am getting an exception:

Exception in thread main java.lang.ArithmeticException: / by zero
at 
org.apache.hadoop.mapred.lib.HashPartitioner.getPartition(HashPartitioner.java:35)
at 
org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:85)
at mypackage.MyClass.main(ProfileReader.java:110)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

I am assuming I am doing something wrong, but I'm not sure what it is yet.  Any 
ideas?


-Xavier


-Original Message-
From: Xavier Stevens
Sent: Mon 3/10/2008 3:49 PM
To: core-user@hadoop.apache.org
Subject: RE: What's the best way to get to a single key?
 
I was thinking because it would be easier to search a single-index.
Unless I don't have to worry and hadoop searches all my indexes at the
same time.  Is this the case?

-Xavier
 

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED] 
Sent: Monday, March 10, 2008 3:45 PM
To: core-user@hadoop.apache.org
Subject: Re: What's the best way to get to a single key?

Xavier Stevens wrote:
 Thanks for everything so far.  It has been really helpful.  I have one

 more question.  Is there a way to merge MapFile index/data files?

No.

To append text files you can use 'bin/hadoop fs -getmerge'.

To merge sorted SequenceFiles (like MapFile/index files) you can use:

http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/Sequ
enceFile.Sorter.html#merge(org.apache.hadoop.fs.Path[], org.apache.had
oop.fs.Path, boolean)

But this doesn't generate a MapFile.

Why is a single file preferable?

Doug




 


Re: Does Hadoop Honor Reserved Space?

2008-03-10 Thread Hairong Kuang
I think you have a misunderstanding of the reserved parameter. As I
commented on hadoop-1463, remember that dfs.du.reserve is the space for
non-dfs usage, including the space for map/reduce, other application, fs
meta-data etc. In your case since /usr already takes 45GB, it far exceeds
the reserved limit 1G. You should set the reserved space to be 50G.

Hairong


On 3/10/08 4:54 PM, Joydeep Sen Sarma [EMAIL PROTECTED] wrote:

 Filed https://issues.apache.org/jira/browse/HADOOP-2991
 
 -Original Message-
 From: Joydeep Sen Sarma [mailto:[EMAIL PROTECTED]
 Sent: Monday, March 10, 2008 12:56 PM
 To: core-user@hadoop.apache.org; core-user@hadoop.apache.org
 Cc: Pete Wyckoff
 Subject: RE: Does Hadoop Honor Reserved Space?
 
 folks - Jimmy is right - as we have unfortunately hit it as well:
 
 https://issues.apache.org/jira/browse/HADOOP-1463 caused a regression.
 we have left some comments on the bug - but can't reopen it.
 
 this is going to be affecting all 0.15 and 0.16 deployments!
 
 
 -Original Message-
 From: Hairong Kuang [mailto:[EMAIL PROTECTED]
 Sent: Thu 3/6/2008 2:01 PM
 To: core-user@hadoop.apache.org
 Subject: Re: Does Hadoop Honor Reserved Space?
  
 In addition to the version, could you please send us a copy of the
 datanode
 report by running the command bin/hadoop dfsadmin -report?
 
 Thanks,
 Hairong
 
 
 On 3/6/08 11:56 AM, Joydeep Sen Sarma [EMAIL PROTECTED] wrote:
 
 but intermediate data is stored in a different directory from dfs/data
 (something like mapred/local by default i think).
 
 what version are u running?
 
 
 -Original Message-
 From: Ashwinder Ahluwalia on behalf of [EMAIL PROTECTED]
 Sent: Thu 3/6/2008 10:14 AM
 To: core-user@hadoop.apache.org
 Subject: RE: Does Hadoop Honor Reserved Space?
  
 I've run into a similar issue in the past. From what I understand,
 this
 parameter only controls the HDFS space usage. However, the
 intermediate data
 in
 the map reduce job is stored on the local file system (not HDFS) and
 is not
 subject to this configuration.
 
 In the past I have used mapred.local.dir.minspacekill and
 mapred.local.dir.minspacestart to control the amount of space that is
 allowable
 for use by this temporary data.
 
 Not sure if that is the best approach though, so I'd love to hear what
 other
 people have done. In your case, you have a map-red job that will
 consume too
 much space (without setting a limit, you didn't have enough disk
 capacity for
 the job), so looking at mapred.output.compress and
 mapred.compress.map.output
 might be useful to decrease the job's disk requirements.
 
 --Ash
 
 -Original Message-
 From: Jimmy Wan [mailto:[EMAIL PROTECTED]
 Sent: Thursday, March 06, 2008 9:56 AM
 To: core-user@hadoop.apache.org
 Subject: Does Hadoop Honor Reserved Space?
 
 I've got 2 datanodes setup with the following configuration parameter:
 property
  namedfs.datanode.du.reserved/name
  value429496729600/value
  descriptionReserved space in bytes per volume. Always leave this
 much  
 space free for non dfs use.
  /description
 /property
 
 Both are housed on 800GB volumes, so I thought this would keep about
 half
 the volume free for non-HDFS usage.
 
 After some long running jobs last night, both disk volumes were
 completely
 filled. The bulk of the data was in:
 ${my.hadoop.tmp.dir}/hadoop-hadoop/dfs/data
 
 This is running as the user hadoop.
 
 Am I interpretting these parameters incorrectly?
 
 I noticed this issue, but it is marked as closed:
 http://issues.apache.org/jira/browse/HADOOP-2549
 
 
 



Re: Hadoop Quickstart page

2008-03-10 Thread Jason Rennie
On Mon, Mar 10, 2008 at 7:04 PM, Arun C Murthy [EMAIL PROTECTED] wrote:

 Jason - it is specified a bit lower down in the 'Download' section.
 Point taken, we should clarify it.


Ah, I see.  Probably missed that since I though I had already set JAVA_HOME
properly.


 Do you want to go ahead and file a documentation request (https://
 issues.apache.org/jira/secure/CreateIssue!default.jspahttp://issues.apache.org/jira/secure/CreateIssue%21default.jspa)
 ? Thanks!


Just did that.  Thanks for the pointer.

Jason


RE: Does Hadoop Honor Reserved Space?

2008-03-10 Thread Joydeep Sen Sarma
I have left some comments behind on the jira.

We could argue over what's the right thing to do (and we will on the
Jira) - but the higher level problem is that this is another case where
backwards compatibility with existing semantics of this option was not
carried over. Neither was there any notification to admins about this
change. The change notes just do not convey the import of this change to
existing deployments (incidentally 1463 was classified as 'Bug Fix' -
not that putting under 'Incompatible Fix' would have helped imho).

Would request the board/committers to consider setting up something
along the lines of:

1. have something better than Change Notes to convey interface changes
2. a field in the JIRA that marks it out as important from interface
change point of view (with notes on what's changing). This could be used
to auto-populate #1
3. Some way of auto-subscribing to bugs that are causing interface
changes (even an email filter on the jira mails would do).

As Hadoop user base keeps growing - and gets used for 'production' tasks
- I think it's absolutely essential that users/admins can keep in tune
with changes that affect their deployments. Otherwise - any organization
other than Yahoo would have tough time upgrading.

(I am new to open-source - but surely this has been solved before?)

Joydeep

-Original Message-
From: Hairong Kuang [mailto:[EMAIL PROTECTED] 
Sent: Monday, March 10, 2008 5:17 PM
To: core-user@hadoop.apache.org
Subject: Re: Does Hadoop Honor Reserved Space?

I think you have a misunderstanding of the reserved parameter. As I
commented on hadoop-1463, remember that dfs.du.reserve is the space for
non-dfs usage, including the space for map/reduce, other application, fs
meta-data etc. In your case since /usr already takes 45GB, it far
exceeds
the reserved limit 1G. You should set the reserved space to be 50G.

Hairong


On 3/10/08 4:54 PM, Joydeep Sen Sarma [EMAIL PROTECTED] wrote:

 Filed https://issues.apache.org/jira/browse/HADOOP-2991
 
 -Original Message-
 From: Joydeep Sen Sarma [mailto:[EMAIL PROTECTED]
 Sent: Monday, March 10, 2008 12:56 PM
 To: core-user@hadoop.apache.org; core-user@hadoop.apache.org
 Cc: Pete Wyckoff
 Subject: RE: Does Hadoop Honor Reserved Space?
 
 folks - Jimmy is right - as we have unfortunately hit it as well:
 
 https://issues.apache.org/jira/browse/HADOOP-1463 caused a regression.
 we have left some comments on the bug - but can't reopen it.
 
 this is going to be affecting all 0.15 and 0.16 deployments!
 
 
 -Original Message-
 From: Hairong Kuang [mailto:[EMAIL PROTECTED]
 Sent: Thu 3/6/2008 2:01 PM
 To: core-user@hadoop.apache.org
 Subject: Re: Does Hadoop Honor Reserved Space?
  
 In addition to the version, could you please send us a copy of the
 datanode
 report by running the command bin/hadoop dfsadmin -report?
 
 Thanks,
 Hairong
 
 
 On 3/6/08 11:56 AM, Joydeep Sen Sarma [EMAIL PROTECTED] wrote:
 
 but intermediate data is stored in a different directory from
dfs/data
 (something like mapred/local by default i think).
 
 what version are u running?
 
 
 -Original Message-
 From: Ashwinder Ahluwalia on behalf of [EMAIL PROTECTED]
 Sent: Thu 3/6/2008 10:14 AM
 To: core-user@hadoop.apache.org
 Subject: RE: Does Hadoop Honor Reserved Space?
  
 I've run into a similar issue in the past. From what I understand,
 this
 parameter only controls the HDFS space usage. However, the
 intermediate data
 in
 the map reduce job is stored on the local file system (not HDFS) and
 is not
 subject to this configuration.
 
 In the past I have used mapred.local.dir.minspacekill and
 mapred.local.dir.minspacestart to control the amount of space that is
 allowable
 for use by this temporary data.
 
 Not sure if that is the best approach though, so I'd love to hear
what
 other
 people have done. In your case, you have a map-red job that will
 consume too
 much space (without setting a limit, you didn't have enough disk
 capacity for
 the job), so looking at mapred.output.compress and
 mapred.compress.map.output
 might be useful to decrease the job's disk requirements.
 
 --Ash
 
 -Original Message-
 From: Jimmy Wan [mailto:[EMAIL PROTECTED]
 Sent: Thursday, March 06, 2008 9:56 AM
 To: core-user@hadoop.apache.org
 Subject: Does Hadoop Honor Reserved Space?
 
 I've got 2 datanodes setup with the following configuration
parameter:
 property
  namedfs.datanode.du.reserved/name
  value429496729600/value
  descriptionReserved space in bytes per volume. Always leave this
 much  
 space free for non dfs use.
  /description
 /property
 
 Both are housed on 800GB volumes, so I thought this would keep about
 half
 the volume free for non-HDFS usage.
 
 After some long running jobs last night, both disk volumes were
 completely
 filled. The bulk of the data was in:
 ${my.hadoop.tmp.dir}/hadoop-hadoop/dfs/data
 
 This is running as the user hadoop.
 
 Am I interpretting these parameters incorrectly?
 
 I noticed this issue, but 

zombie data nodes, not alive but not dead

2008-03-10 Thread Tim Nelson
I've got to be doing something stupid cause I can't find any mention of 
others having this problem. Here's what's happening. I had a cluster of 
nine nodes running (1 namenode and 8 datanodes) the 0.15.3 release. I've 
been running the mapred samples and reformatting filesystems, just 
getting comfortable with the software. When I upgraded to 0.16.0 release 
I reformatted (mke2fs) all of my data partitions (including the namenode 
data). I ran a hadoop namenode -format which ran fine. Then I brought 
them back up, the only slave to connect to the master was the master 
itself acting as a datanode. The dfs daemon was started on the slave 
nodes but it just doesn't seem to connect to the master.


I know that the slaves are doing *something* with the master because if 
i start them before the namenode is running then I get lots of log 
messages about attempting to reconnect. Below is my site config and logs 
from the namenode and a zombie datanode.


 hadoop-site.xml (same across all nodes) 
?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?
configuration
property
 namehadoop.tmp.dir/name
 value/mnt/sda1/hadoop-datastore-0.15.3/hadoop-${user.name}/value
 description.../description
/property
property
 namefs.default.name/name
 valuehdfs://head00:54310/value
 description.../description
/property
property
 namemapred.job.tracker/name
 valuehead00:54311/value
 description.../description
/property
property
 namedfs.replication/name
 value2/value
 description.../description
/property
/configuration


 namenode log 
2008-03-10 19:32:53,186 INFO org.apache.hadoop.dfs.NameNode: STARTUP_MSG:
/
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = head00/192.168.16.48
STARTUP_MSG:   args = []
/
2008-03-10 19:32:54,260 INFO org.apache.hadoop.dfs.NameNode: Namenode up 
at: head00/192.168.16.48:54310
2008-03-10 19:32:54,267 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: 
Initializing JVM Metrics with processName=NameNode, sessionId=null
2008-03-10 19:32:54,628 INFO org.apache.hadoop.dfs.StateChange: STATE* 
Network topology has 0 racks and 0 datanodes
2008-03-10 19:32:54,629 INFO org.apache.hadoop.dfs.StateChange: STATE* 
UnderReplicatedBlocks has 0 blocks
2008-03-10 19:32:55,421 INFO org.mortbay.util.Credential: Checking 
Resource aliases
2008-03-10 19:32:56,048 INFO org.mortbay.http.HttpServer: Version 
Jetty/5.1.4
2008-03-10 19:32:56,051 INFO org.mortbay.util.Container: Started 
HttpContext[/static,/static]
2008-03-10 19:32:56,052 INFO org.mortbay.util.Container: Started 
HttpContext[/logs,/logs]
2008-03-10 19:32:57,493 INFO org.mortbay.util.Container: Started 
[EMAIL PROTECTED]
2008-03-10 19:32:57,826 INFO org.mortbay.util.Container: Started 
WebApplicationContext[/,/]
2008-03-10 19:32:58,112 INFO org.mortbay.http.SocketListener: Started 
SocketListener on 0.0.0.0:50070
2008-03-10 19:32:58,112 INFO org.mortbay.util.Container: Started 
[EMAIL PROTECTED]
2008-03-10 19:32:58,112 INFO org.apache.hadoop.fs.FSNamesystem: 
Web-server up at: 50070
2008-03-10 19:32:58,116 INFO org.apache.hadoop.ipc.Server: IPC Server 
listener on 54310: starting
2008-03-10 19:32:58,139 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 0 on 54310: starting
2008-03-10 19:32:58,140 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 1 on 54310: starting

...
2008-03-10 19:32:58,626 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 9 on 54310: starting
2008-03-10 19:32:58,626 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 4 on 54310: starting
2008-03-10 19:33:02,684 INFO org.apache.hadoop.dfs.StateChange: BLOCK* 
NameSystem.registerDatanode: node registration from 192.168.16.48:50010 
storage DS-437400207-192.168.16.48-50010-1205199182672   * this is 
the namenode connecting to itself as a data node *
2008-03-10 19:33:02,693 INFO org.apache.hadoop.net.NetworkTopology: 
Adding a new node: /default-rack/192.168.16.48:50010
2008-03-10 19:38:01,248 INFO org.apache.hadoop.fs.FSNamesystem: Roll 
Edit Log from 192.168.16.48
2008-03-10 19:38:01,249 INFO org.apache.hadoop.fs.FSNamesystem: Number 
of transactions: 0 Total time for transactions(ms): 0 Number of syncs: 0 
SyncTimes(ms): 0
2008-03-10 19:43:02,227 INFO org.apache.hadoop.fs.FSNamesystem: Roll 
Edit Log from 192.168.16.48
2008-03-10 19:48:02,374 INFO org.apache.hadoop.fs.FSNamesystem: Roll 
Edit Log from 192.168.16.48


* datanode log *
2008-03-10 20:30:31,392 INFO org.apache.hadoop.dfs.DataNode: STARTUP_MSG:
/
STARTUP_MSG: Starting DataNode
STARTUP_MSG:   host = node05/192.168.16.55
STARTUP_MSG:   args = []
/
2008-03-10 20:30:31,786 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: 
Initializing JVM Metrics with processName=DataNode, sessionId=null
2008-03-10 20:30:32,000 INFO 

Re: How to compile fuse-dfs

2008-03-10 Thread Pete Wyckoff

Hi Xavier,

If you run ./bootsrap.sh does it not create a Makefile for you?  There is a
bug in the Makefile that hardcodes it to amd64. I will look at this.

What kernel are you using and what HW?

--pete


On 3/10/08 2:23 PM, [EMAIL PROTECTED]
[EMAIL PROTECTED] wrote:

 Hi everybody,
 
 I'm trying to compile fuse-dfs but I have problems. I don't have a lot
 of experience with C++.
 I would like to know:
 Is it a clear readme file with the instructions to compile, install
 fuse-dfs?
 Do I need to replace  fuse_dfs.c with the one in
 fuse-dfs/src/fuse_dfs.c?
 Do I need to set up different flag if I'm using a i386 or 86 machine?
 Which one and Where?
 Which make file do I need to use to compile the code?
 
 
 
 Thanks 
 
 Xavier
 
 
 



Re: Does Hadoop Honor Reserved Space?

2008-03-10 Thread Pete Wyckoff

+1

(obviously :))


On 3/10/08 5:26 PM, Joydeep Sen Sarma [EMAIL PROTECTED] wrote:

 I have left some comments behind on the jira.
 
 We could argue over what's the right thing to do (and we will on the
 Jira) - but the higher level problem is that this is another case where
 backwards compatibility with existing semantics of this option was not
 carried over. Neither was there any notification to admins about this
 change. The change notes just do not convey the import of this change to
 existing deployments (incidentally 1463 was classified as 'Bug Fix' -
 not that putting under 'Incompatible Fix' would have helped imho).
 
 Would request the board/committers to consider setting up something
 along the lines of:
 
 1. have something better than Change Notes to convey interface changes
 2. a field in the JIRA that marks it out as important from interface
 change point of view (with notes on what's changing). This could be used
 to auto-populate #1
 3. Some way of auto-subscribing to bugs that are causing interface
 changes (even an email filter on the jira mails would do).
 
 As Hadoop user base keeps growing - and gets used for 'production' tasks
 - I think it's absolutely essential that users/admins can keep in tune
 with changes that affect their deployments. Otherwise - any organization
 other than Yahoo would have tough time upgrading.
 
 (I am new to open-source - but surely this has been solved before?)
 
 Joydeep
 
 -Original Message-
 From: Hairong Kuang [mailto:[EMAIL PROTECTED]
 Sent: Monday, March 10, 2008 5:17 PM
 To: core-user@hadoop.apache.org
 Subject: Re: Does Hadoop Honor Reserved Space?
 
 I think you have a misunderstanding of the reserved parameter. As I
 commented on hadoop-1463, remember that dfs.du.reserve is the space for
 non-dfs usage, including the space for map/reduce, other application, fs
 meta-data etc. In your case since /usr already takes 45GB, it far
 exceeds
 the reserved limit 1G. You should set the reserved space to be 50G.
 
 Hairong
 
 
 On 3/10/08 4:54 PM, Joydeep Sen Sarma [EMAIL PROTECTED] wrote:
 
 Filed https://issues.apache.org/jira/browse/HADOOP-2991
 
 -Original Message-
 From: Joydeep Sen Sarma [mailto:[EMAIL PROTECTED]
 Sent: Monday, March 10, 2008 12:56 PM
 To: core-user@hadoop.apache.org; core-user@hadoop.apache.org
 Cc: Pete Wyckoff
 Subject: RE: Does Hadoop Honor Reserved Space?
 
 folks - Jimmy is right - as we have unfortunately hit it as well:
 
 https://issues.apache.org/jira/browse/HADOOP-1463 caused a regression.
 we have left some comments on the bug - but can't reopen it.
 
 this is going to be affecting all 0.15 and 0.16 deployments!
 
 
 -Original Message-
 From: Hairong Kuang [mailto:[EMAIL PROTECTED]
 Sent: Thu 3/6/2008 2:01 PM
 To: core-user@hadoop.apache.org
 Subject: Re: Does Hadoop Honor Reserved Space?
  
 In addition to the version, could you please send us a copy of the
 datanode
 report by running the command bin/hadoop dfsadmin -report?
 
 Thanks,
 Hairong
 
 
 On 3/6/08 11:56 AM, Joydeep Sen Sarma [EMAIL PROTECTED] wrote:
 
 but intermediate data is stored in a different directory from
 dfs/data
 (something like mapred/local by default i think).
 
 what version are u running?
 
 
 -Original Message-
 From: Ashwinder Ahluwalia on behalf of [EMAIL PROTECTED]
 Sent: Thu 3/6/2008 10:14 AM
 To: core-user@hadoop.apache.org
 Subject: RE: Does Hadoop Honor Reserved Space?
  
 I've run into a similar issue in the past. From what I understand,
 this
 parameter only controls the HDFS space usage. However, the
 intermediate data
 in
 the map reduce job is stored on the local file system (not HDFS) and
 is not
 subject to this configuration.
 
 In the past I have used mapred.local.dir.minspacekill and
 mapred.local.dir.minspacestart to control the amount of space that is
 allowable
 for use by this temporary data.
 
 Not sure if that is the best approach though, so I'd love to hear
 what
 other
 people have done. In your case, you have a map-red job that will
 consume too
 much space (without setting a limit, you didn't have enough disk
 capacity for
 the job), so looking at mapred.output.compress and
 mapred.compress.map.output
 might be useful to decrease the job's disk requirements.
 
 --Ash
 
 -Original Message-
 From: Jimmy Wan [mailto:[EMAIL PROTECTED]
 Sent: Thursday, March 06, 2008 9:56 AM
 To: core-user@hadoop.apache.org
 Subject: Does Hadoop Honor Reserved Space?
 
 I've got 2 datanodes setup with the following configuration
 parameter:
 property
  namedfs.datanode.du.reserved/name
  value429496729600/value
  descriptionReserved space in bytes per volume. Always leave this
 much  
 space free for non dfs use.
  /description
 /property
 
 Both are housed on 800GB volumes, so I thought this would keep about
 half
 the volume free for non-HDFS usage.
 
 After some long running jobs last night, both disk volumes were
 completely
 filled. The bulk of the data was in:
 

Re: zombie data nodes, not alive but not dead

2008-03-10 Thread Dave Coyle
On 2008-03-10 23:37:36 -0400, [EMAIL PROTECTED] wrote:
 I can leave the cluster running  for hours and this slave will never 
 register itself with the namenode. I've been messing with this problem 
 for three days now and I'm out of ideas. Any suggestions?

I had a similar-sounding problem with a 0.16.0 setup I had...
namenode thinks datanodes are dead, but the datanodes complain if
namenode is unreachable so there must be *some* connectivity.
Admittedly I haven't had the time yet to recreate what I did to see if
I had just mangled some config somewhere, but I was eventually able to
sort out my problem by...and yes, this sounds a bit wacky... running
a given datanode interactively, suspending it, then bringing it back
to the foreground.  E.g. (assuming your namenode is already running):

$ bin/hadoop datanode
ctrl-Z
$ fg

and the datanode then magically registered with the namenode.

Give it a shot... I'm curious to hear if it works for you, too.

-Coyle