Question about Hadoop filesystem

2009-06-04 Thread Harold Lim

How do I remove a datanode? Do I simply destroy my datanode and the namenode 
will automatically detect it? Is there a more elegent way to do it?

Also, when I remove a datanode, does hadoop automatically re-replicate the data 
right away?



Thanks,
Harold


  


Re: Question about Hadoop filesystem

2009-06-04 Thread Brian Bockelman

It's in the FAQ:

http://wiki.apache.org/hadoop/FAQ#17

Brian

On Jun 4, 2009, at 6:26 PM, Harold Lim wrote:



How do I remove a datanode? Do I simply destroy my datanode and  
the namenode will automatically detect it? Is there a more elegent  
way to do it?


Also, when I remove a datanode, does hadoop automatically re- 
replicate the data right away?




Thanks,
Harold







question about hadoop and amazon ec2 ?

2009-02-15 Thread buddha1021

hi:
What is the relationship between the hadoop and the amazon ec2 ?
Can hadoop run on the common pc (but not server ) directly ? 
Why someone says hadoop run on the amazon ec2 ?
thanks!
-- 
View this message in context: 
http://www.nabble.com/question-about-hadoop-and-amazon-ec2---tp22020652p22020652.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: question about hadoop and amazon ec2 ?

2009-02-15 Thread nitesh bhatia
1. They are related as one can use EC2 as a to serve computation part
for hadoop.
Refer: http://wiki.apache.org/hadoop/AmazonEC2

2. yes
Refer: 
http://wiki.apache.org/hadoop/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster)

3. you can use EC2 as a to serve computation part for hadoop.

--nitesh

On Sun, Feb 15, 2009 at 2:18 PM, buddha1021 buddha1...@yahoo.cn wrote:

 hi:
 What is the relationship between the hadoop and the amazon ec2 ?
 Can hadoop run on the common pc (but not server ) directly ?
 Why someone says hadoop run on the amazon ec2 ?
 thanks!
 --
 View this message in context: 
 http://www.nabble.com/question-about-hadoop-and-amazon-ec2---tp22020652p22020652.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.





-- 
Nitesh Bhatia
Dhirubhai Ambani Institute of Information  Communication Technology
Gandhinagar
Gujarat

Life is never perfect. It just depends where you draw the line.

visit:
http://www.awaaaz.com - connecting through music
http://www.volstreet.com - lets volunteer for better tomorrow
http://www.instibuzz.com - Voice opinions, Transact easily, Have fun


Re: Question about Hadoop 's Feature(s)

2008-09-30 Thread Steve Loughran

Jason Rutherglen wrote:

I implemented an RMI protocol using Hadoop IPC and implemented basic
HMAC signing.  It is I believe faster than public key private key
because it uses a secret key and does not require public key
provisioning like PKI would.  Perhaps it would be a baseline way to
sign the data.


That should work for authenticating messages between (trusted) nodes. 
Presumably the ipc.key value could be set in the Conf and all would be well.


External job submitters shouldn't be given those keys; they'd need an 
HTTP(S) front end that could authenticate them however the organisation 
worked.


Yes, that would be simpler. I am not enough of a security expert to say 
if it will work, but the keys should be easier to work with. As long as 
the configuration files are kept secure, your cluster will be locked.


However, HDFS uses HTTP to serve blocks up -that needs to be locked down 
 too. Would the signing work there?


-steve


Re: Question about Hadoop 's Feature(s)

2008-09-30 Thread Jason Rutherglen
 However, HDFS uses HTTP to serve blocks up -that needs to be locked down
  too. Would the signing work there?

I am not familiar with HDFS over HTTP.  Could it simply sign the
stream and include the signature at the end of the HTTP message
returned?

On Tue, Sep 30, 2008 at 8:56 AM, Steve Loughran [EMAIL PROTECTED] wrote:
 Jason Rutherglen wrote:

 I implemented an RMI protocol using Hadoop IPC and implemented basic
 HMAC signing.  It is I believe faster than public key private key
 because it uses a secret key and does not require public key
 provisioning like PKI would.  Perhaps it would be a baseline way to
 sign the data.

 That should work for authenticating messages between (trusted) nodes.
 Presumably the ipc.key value could be set in the Conf and all would be well.

 External job submitters shouldn't be given those keys; they'd need an
 HTTP(S) front end that could authenticate them however the organisation
 worked.

 Yes, that would be simpler. I am not enough of a security expert to say if
 it will work, but the keys should be easier to work with. As long as the
 configuration files are kept secure, your cluster will be locked.

 However, HDFS uses HTTP to serve blocks up -that needs to be locked down
  too. Would the signing work there?

 -steve



Re: Question about Hadoop 's Feature(s)

2008-09-25 Thread Steve Loughran

Owen O'Malley wrote:

On Sep 24, 2008, at 1:50 AM, Trinh Tuan Cuong wrote:

We are developing a project and we are intend to use Hadoop to handle 
the processing vast amount of data. But to convince our customers 
about the using of Hadoop in our project, we must show them the 
advantages ( and maybe ? the disadvantage ) when deploy the project 
with Hadoop compare to Oracle Database Platform.


The primary advantage of Hadoop is scalability. On an equivalent 
hardware budget, Hadoop can handle much much larger databases. We had a 
process that was run once a week on Oracle that is now run once an hour 
on Hadoop. Additionally, Hadoop scales out much much farther. We can 
store petabytes of data in a single Hadoop cluster and have jobs that 
read and generate 100's of terabytes.


That said, what a database gives you -on the right hardware- is very 
fast responses, especially if the indices are set up right and the data 
denormalised when appropriate. There is also really good integration 
with tools and application servers, with things like Java EE designed to 
make running code against a database easy.


Not using Oracle means you don't have to work with an Oracle DBA, which, 
in my experience, can only be a good thing. DBAs and developers never 
seem to see eye-to-eye.





 Hadoop only has very primitive 
security at the moment, although I expect that to change in the next 6 
months.




Right now you need to trust everyone else on the network where you run 
hadoop to not be malicious; the filesystem and job tracker interfaces 
are insecure. The forthcoming 0.19 release will ask who you are, but the 
far end trusts you to be who you say you are. In that respect, it's as 
secure as NFS over UDP.


To secure Hadoop you'd probably need to
 -sign every IPC request, with a CPU time cost at both ends.
 -require some form of authentication for the HTTP exported parts of 
the system, such as digest authentication, or issue lots of HTTPS 
private keys and use that instead. Giving everyone a key management 
problem as well as extra communications overhead.


What is easier would be to lock down remote access to the filesystem/job 
submission so that only authenticated users would be able to upload jobs 
and data. The cluster would continue to trust everything else on its 
network, but the system doesn't trust people to submit work unless they 
could prove who they were.




Re: Question about Hadoop 's Feature(s)

2008-09-24 Thread Owen O'Malley

On Sep 24, 2008, at 1:50 AM, Trinh Tuan Cuong wrote:

We are developing a project and we are intend to use Hadoop to  
handle the processing vast amount of data. But to convince our  
customers about the using of Hadoop in our project, we must show  
them the advantages ( and maybe ? the disadvantage ) when deploy the  
project with Hadoop compare to Oracle Database Platform.


The primary advantage of Hadoop is scalability. On an equivalent  
hardware budget, Hadoop can handle much much larger databases. We had  
a process that was run once a week on Oracle that is now run once an  
hour on Hadoop. Additionally, Hadoop scales out much much farther. We  
can store petabytes of data in a single Hadoop cluster and have jobs  
that read and generate 100's of terabytes.


The disadvantage of Hadoop is that it is still relatively young and  
growing fast, so there are growing pains. Hadoop has recently gotten  
higher level query languages like SQL (Pig, Hive, and Jaql), but still  
doesn't have any fancy report generators. Hadoop only has very  
primitive security at the moment, although I expect that to change in  
the next 6 months.


-- Owen


RE: Question about Hadoop 's Feature(s)

2008-09-24 Thread Trinh Tuan Cuong
Dear Mr/Mrs Owen O'Malley,

First I would like to thank you much for your reply, it was somehow the
exact answer which I expected. As I read about the Query Language of
Hadoop, it is a combination of Pig_Pig Latin, Have,HBase,Jaql and
more... and I could see that Hadoop have an advantage SQL-like query
language. The most thing I was curous bout is Hadoop's security level
which is hard to find in any documents I searched. Like many of your
organization, we are believing in the fast growing of Hadoop and intend
to use it in our serious projects. Once again, thanks for the reply, now
I could tell our clients clearly about Hadoop.

Best Regards.

Tuan Cuong, Trinh.
[EMAIL PROTECTED]
Luvina Software Company.
Website : www.luvina.net

-Original Message-
From: Owen O'Malley [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 24, 2008 11:27 PM
To: core-user@hadoop.apache.org
Subject: Re: Question about Hadoop 's Feature(s)

On Sep 24, 2008, at 1:50 AM, Trinh Tuan Cuong wrote:

 We are developing a project and we are intend to use Hadoop to  
 handle the processing vast amount of data. But to convince our  
 customers about the using of Hadoop in our project, we must show  
 them the advantages ( and maybe ? the disadvantage ) when deploy the  
 project with Hadoop compare to Oracle Database Platform.

The primary advantage of Hadoop is scalability. On an equivalent  
hardware budget, Hadoop can handle much much larger databases. We had  
a process that was run once a week on Oracle that is now run once an  
hour on Hadoop. Additionally, Hadoop scales out much much farther. We  
can store petabytes of data in a single Hadoop cluster and have jobs  
that read and generate 100's of terabytes.

The disadvantage of Hadoop is that it is still relatively young and  
growing fast, so there are growing pains. Hadoop has recently gotten  
higher level query languages like SQL (Pig, Hive, and Jaql), but still  
doesn't have any fancy report generators. Hadoop only has very  
primitive security at the moment, although I expect that to change in  
the next 6 months.

-- Owen




Re: Question about Hadoop

2008-06-14 Thread Chanchal James
Thank you very much for explaining it to me, Ted.. Thats a great deal of
info!
I guess that could be how Yahoo Webmap is designed..

And for anyone trying to figure out the massiveness of Hadoop computing,
http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/should
give a good picture of a practical case. I was for a moment
flabbergasted, and instantly fell in love with Hadoop! ;)


On Sat, Jun 14, 2008 at 12:11 AM, Ted Dunning [EMAIL PROTECTED] wrote:

 Usually hadoop programs are not used interactively since what they excel at
 is batch operations on very large collections of data.

 It is quite reasonable to store resulting data in hadoop and access those
 results using hadoop.  The cleanest way to do that is to have a
 presentation
 layer web server that has all of the UI on it and use http to access the
 results file from hadoop via the namenodes data access URL.  This works
 well
 where the results are not particularly voluminous.

 For large quantities of data such as the output of a web-crawl, it is
 usually better to copy the output out of hadoop and into a clustered system
 that supports high speed querying of the data.  This clustered system might
 be as simple as a redundant memcache or mySql farm or as fancy as a sharded
 and replicated farm of text retrieval engines running under Solr.  What
 works for you will vary by what you need to do.

 You should keep in mind that hadoop was designed for very long MTBF (for a
 cluster), but not designed for zero downtime operation.  At the very least,
 you will occasionally want to upgrade the cluster software and that
 currently can't be done during normal operations.  Combining hadoop (for
 heavy duty computations) with a separate persistence layer (for high
 availability web service) is a good hybrid.

 On Thu, Jun 12, 2008 at 9:53 PM, Chanchal James [EMAIL PROTECTED]
 wrote:

  Thank you all for the responses.
 
  So in order to run a web-based application, I just need to put the part
 of
  the application that needs to make use of distributed computation in
 HDFS,
  and have the other web site related files access it via Hadoop streaming
 ?
 
  Is that how Hadoop is used ?
 
  Sorry the question may sound too silly.
 
  Thank you.
 
 

 --
 ted



Re: Question about Hadoop

2008-06-13 Thread Ted Dunning
Usually hadoop programs are not used interactively since what they excel at
is batch operations on very large collections of data.

It is quite reasonable to store resulting data in hadoop and access those
results using hadoop.  The cleanest way to do that is to have a presentation
layer web server that has all of the UI on it and use http to access the
results file from hadoop via the namenodes data access URL.  This works well
where the results are not particularly voluminous.

For large quantities of data such as the output of a web-crawl, it is
usually better to copy the output out of hadoop and into a clustered system
that supports high speed querying of the data.  This clustered system might
be as simple as a redundant memcache or mySql farm or as fancy as a sharded
and replicated farm of text retrieval engines running under Solr.  What
works for you will vary by what you need to do.

You should keep in mind that hadoop was designed for very long MTBF (for a
cluster), but not designed for zero downtime operation.  At the very least,
you will occasionally want to upgrade the cluster software and that
currently can't be done during normal operations.  Combining hadoop (for
heavy duty computations) with a separate persistence layer (for high
availability web service) is a good hybrid.

On Thu, Jun 12, 2008 at 9:53 PM, Chanchal James [EMAIL PROTECTED] wrote:

 Thank you all for the responses.

 So in order to run a web-based application, I just need to put the part of
 the application that needs to make use of distributed computation in HDFS,
 and have the other web site related files access it via Hadoop streaming ?

 Is that how Hadoop is used ?

 Sorry the question may sound too silly.

 Thank you.


 On Thu, Jun 12, 2008 at 7:49 PM, Ted Dunning [EMAIL PROTECTED]
 wrote:

  Once it is in HDFS, you already have backups (due to the replicated file
  system).
 
  Your problems with deleting the dfs data directory are likely
 configuration
  problems combined with versioning of the data store (done to avoid
  confusion, but usually causes confusion).  Once you get the configuration
  and operational issues sorted out, you shouldn't lose any data.
 
  On Thu, Jun 12, 2008 at 10:15 AM, Chanchal James [EMAIL PROTECTED]
  wrote:
 
  
   If I keep all data in HDFS, is there anyway I can back it up regularly.
  
  
 




-- 
ted


Re: Question about Hadoop

2008-06-12 Thread lohit
Ideally what you would want is your data to be on HDFS and run your map/reduce 
jobs on that data. Hadoop framework splits you data and feeds in those splits 
to each map or reduce task. One problem with Image files is that you will not 
be able to split them. Alternatively people have done this, they wrap Image 
files within xml and create huge files which has multiple image files in them. 
Hadoop offers something called streaming using which you will be able to split 
the files at xml boundry and feed it to your map/reduce tasks. Streaming also 
enables you to use any code like perl/php/c++. 
Check info about streaming here 
http://hadoop.apache.org/core/docs/r0.17.0/streaming.html
And information about parsing XML files in streaming in here 
http://hadoop.apache.org/core/docs/r0.17.0/streaming.html#How+do+I+parse+XML+documents+using+streaming%3F

Thanks,
Lohit

- Original Message 
From: Chanchal James [EMAIL PROTECTED]
To: core-user@hadoop.apache.org
Sent: Thursday, June 12, 2008 9:42:46 AM
Subject: Question about Hadoop

Hi,



I have a question about Hadoop. I am a beginner and just testing Hadoop.

Would like to know how a php application would benefit from this, say an

application that needs to work on a large number of image files. Do I have
to

store the application in HDFS always, or do I just copy it to HDFS when

needed, do the processing, and then copy it back to the local file system ?

Is that the case with the data files too ? Once I have Hadoop running, do I

keep all data  application files in HDFS always, and not use local file

system storage ?



Thank you.



Re: Question about Hadoop

2008-06-12 Thread Ted Dunning
Once it is in HDFS, you already have backups (due to the replicated file
system).

Your problems with deleting the dfs data directory are likely configuration
problems combined with versioning of the data store (done to avoid
confusion, but usually causes confusion).  Once you get the configuration
and operational issues sorted out, you shouldn't lose any data.

On Thu, Jun 12, 2008 at 10:15 AM, Chanchal James [EMAIL PROTECTED] wrote:


 If I keep all data in HDFS, is there anyway I can back it up regularly.




Re: Question about Hadoop

2008-06-12 Thread Chanchal James
Thank you all for the responses.

So in order to run a web-based application, I just need to put the part of
the application that needs to make use of distributed computation in HDFS,
and have the other web site related files access it via Hadoop streaming ?

Is that how Hadoop is used ?

Sorry the question may sound too silly.

Thank you.


On Thu, Jun 12, 2008 at 7:49 PM, Ted Dunning [EMAIL PROTECTED] wrote:

 Once it is in HDFS, you already have backups (due to the replicated file
 system).

 Your problems with deleting the dfs data directory are likely configuration
 problems combined with versioning of the data store (done to avoid
 confusion, but usually causes confusion).  Once you get the configuration
 and operational issues sorted out, you shouldn't lose any data.

 On Thu, Jun 12, 2008 at 10:15 AM, Chanchal James [EMAIL PROTECTED]
 wrote:

 
  If I keep all data in HDFS, is there anyway I can back it up regularly.
 
 



question about hadoop 0.17 upgrade

2008-05-25 Thread 志远
 

upgrade 0.16.3 to 0.17, error appears when start dfs and jobtracker. How can
I do with it? Thanks!

 

I have use the “start-dfs.sh �Cupgrade” command to upgrade the filesystem

 

below is the error log:

 

2008-05-26 09:14:33,463 INFO org.apache.hadoop.mapred.JobTracker:
STARTUP_MSG: 

/

STARTUP_MSG: Starting JobTracker

STARTUP_MSG:   host = test180.sqa/192.168.207.180

STARTUP_MSG:   args = []

STARTUP_MSG:   version = 0.17.0

STARTUP_MSG:   build =
http://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.17 -r 656523;
compiled by 'hadoopqa' on Thu May 15 07:22:55 UTC 2008

/

2008-05-26 09:14:33,567 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
Initializing RPC Metrics with hostName=JobTracker, port=9001

2008-05-26 09:14:33,610 INFO org.apache.hadoop.ipc.Server: IPC Server
Responder: starting

2008-05-26 09:14:33,611 INFO org.apache.hadoop.ipc.Server: IPC Server
listener on 9001: starting

2008-05-26 09:14:33,611 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 0 on 9001: starting

2008-05-26 09:14:33,611 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 1 on 9001: starting

2008-05-26 09:14:33,612 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 2 on 9001: starting

2008-05-26 09:14:33,612 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 3 on 9001: starting

2008-05-26 09:14:33,612 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 4 on 9001: starting

2008-05-26 09:14:33,612 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 5 on 9001: starting

2008-05-26 09:14:33,612 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 6 on 9001: starting

2008-05-26 09:14:33,612 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 7 on 9001: starting

2008-05-26 09:14:33,613 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 8 on 9001: starting

2008-05-26 09:14:33,613 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 9 on 9001: starting

2008-05-26 09:14:33,664 INFO org.mortbay.util.Credential: Checking Resource
aliases

2008-05-26 09:14:33,733 INFO org.mortbay.http.HttpServer: Version Jetty/5.1.
4

2008-05-26 09:14:33,734 INFO org.mortbay.util.Container: Started
HttpContext[/static,/static]

2008-05-26 09:14:33,734 INFO org.mortbay.util.Container: Started
HttpContext[/logs,/logs]

2008-05-26 09:14:33,962 INFO org.mortbay.util.Container: Started
[EMAIL PROTECTED]

2008-05-26 09:14:33,998 INFO org.mortbay.util.Container: Started
WebApplicationContext[/,/]

2008-05-26 09:14:34,000 INFO org.mortbay.http.SocketListener: Started
SocketListener on 0.0.0.0:50030

2008-05-26 09:14:34,000 INFO org.mortbay.util.Container: Started
[EMAIL PROTECTED]

2008-05-26 09:14:34,002 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=JobTracker, sessionId=

2008-05-26 09:14:34,003 INFO org.apache.hadoop.mapred.JobTracker: JobTracker
up at: 9001

2008-05-26 09:14:34,003 INFO org.apache.hadoop.mapred.JobTracker: JobTracker
webserver: 50030

2008-05-26 09:14:34,096 INFO org.apache.hadoop.mapred.JobTracker: problem
cleaning system directory: /home/hadoop/HadoopInstall/tmp/mapred/system

org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.dfs.SafeModeException: Cannot delete
/home/hadoop/HadoopInstall/tmp/mapred/system. Name node is in safe mode.

The ratio of reported blocks 0. has not reached the threshold 0.9990.
Safe mode will be turned off automatically.

at
org.apache.hadoop.dfs.FSNamesystem.deleteInternal(FSNamesystem.java:1519)

at org.apache.hadoop.dfs.FSNamesystem.delete(FSNamesystem.java:1498)

at org.apache.hadoop.dfs.NameNode.delete(NameNode.java:383)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)

at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)

at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)

 

at org.apache.hadoop.ipc.Client.call(Client.java:557)

at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)

at org.apache.hadoop.dfs.$Proxy4.delete(Unknown Source)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)

at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocati
onHandler.java:82)

at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHand
ler.java:59)

at org.apache.hadoop.dfs.$Proxy4.delete(Unknown Source)

at 

Re: One Simple Question About Hadoop DFS

2008-03-23 Thread Amar Kamat
On Sun, 23 Mar 2008, Chaman Singh Verma wrote:

 Hello,

 I am exploring Hadoop and MapReduce and I have one very simple question.

 I have 500GB dataset on my local disk and I have written both Map-Reduce 
 functions. Now how should I start ?

 1.  I copy the data from local disk to DFS. I have configured DFS with 100 
 machines. I hope that it will split the file on 100 nodes ( With some 
 replications).

Yes. You need to copy the data from your local disk to the DFS. It will
split the files based on the dfs block size (dfs.block.size). Default
block size is 64MB and hence there would be 8000 blocks.
 2. For MapReduce should I specify 100 nodes for SetMaxMapTask(). If I specify
less than 100 then, will be blocks migrate ? If the blocks don't migrate 
 then
why this functions is provided to the users ? Why number of Tasks is not
taken from the startup script ?

Again here the max number of maps is bounded by the dfs block size. Hence
in the default case you would have 8000 maps (unless you have your own
input format).
 3.  If I specify more than 100, then will load balancing be done automatically
 or user have to specify that also.

In short its the dfs block size along with the input format that controls
the number of maps. The number of maps given to the framework is used as
a hint. Sometimes it doesn't matter what value is passed.
Amar
 Perhaps these are very simple questions, but I think that MapReduce 
 simplifies lots of things ( Compared to MPI Based Programming ) that for 
 beginners like me have difficult time to understand the model.

 csv




 -
 Never miss a thing.   Make Yahoo your homepage.