Question about Hadoop filesystem
How do I remove a datanode? Do I simply destroy my datanode and the namenode will automatically detect it? Is there a more elegent way to do it? Also, when I remove a datanode, does hadoop automatically re-replicate the data right away? Thanks, Harold
Re: Question about Hadoop filesystem
It's in the FAQ: http://wiki.apache.org/hadoop/FAQ#17 Brian On Jun 4, 2009, at 6:26 PM, Harold Lim wrote: How do I remove a datanode? Do I simply destroy my datanode and the namenode will automatically detect it? Is there a more elegent way to do it? Also, when I remove a datanode, does hadoop automatically re- replicate the data right away? Thanks, Harold
question about hadoop and amazon ec2 ?
hi: What is the relationship between the hadoop and the amazon ec2 ? Can hadoop run on the common pc (but not server ) directly ? Why someone says hadoop run on the amazon ec2 ? thanks! -- View this message in context: http://www.nabble.com/question-about-hadoop-and-amazon-ec2---tp22020652p22020652.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: question about hadoop and amazon ec2 ?
1. They are related as one can use EC2 as a to serve computation part for hadoop. Refer: http://wiki.apache.org/hadoop/AmazonEC2 2. yes Refer: http://wiki.apache.org/hadoop/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster) 3. you can use EC2 as a to serve computation part for hadoop. --nitesh On Sun, Feb 15, 2009 at 2:18 PM, buddha1021 buddha1...@yahoo.cn wrote: hi: What is the relationship between the hadoop and the amazon ec2 ? Can hadoop run on the common pc (but not server ) directly ? Why someone says hadoop run on the amazon ec2 ? thanks! -- View this message in context: http://www.nabble.com/question-about-hadoop-and-amazon-ec2---tp22020652p22020652.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- Nitesh Bhatia Dhirubhai Ambani Institute of Information Communication Technology Gandhinagar Gujarat Life is never perfect. It just depends where you draw the line. visit: http://www.awaaaz.com - connecting through music http://www.volstreet.com - lets volunteer for better tomorrow http://www.instibuzz.com - Voice opinions, Transact easily, Have fun
Re: Question about Hadoop 's Feature(s)
Jason Rutherglen wrote: I implemented an RMI protocol using Hadoop IPC and implemented basic HMAC signing. It is I believe faster than public key private key because it uses a secret key and does not require public key provisioning like PKI would. Perhaps it would be a baseline way to sign the data. That should work for authenticating messages between (trusted) nodes. Presumably the ipc.key value could be set in the Conf and all would be well. External job submitters shouldn't be given those keys; they'd need an HTTP(S) front end that could authenticate them however the organisation worked. Yes, that would be simpler. I am not enough of a security expert to say if it will work, but the keys should be easier to work with. As long as the configuration files are kept secure, your cluster will be locked. However, HDFS uses HTTP to serve blocks up -that needs to be locked down too. Would the signing work there? -steve
Re: Question about Hadoop 's Feature(s)
However, HDFS uses HTTP to serve blocks up -that needs to be locked down too. Would the signing work there? I am not familiar with HDFS over HTTP. Could it simply sign the stream and include the signature at the end of the HTTP message returned? On Tue, Sep 30, 2008 at 8:56 AM, Steve Loughran [EMAIL PROTECTED] wrote: Jason Rutherglen wrote: I implemented an RMI protocol using Hadoop IPC and implemented basic HMAC signing. It is I believe faster than public key private key because it uses a secret key and does not require public key provisioning like PKI would. Perhaps it would be a baseline way to sign the data. That should work for authenticating messages between (trusted) nodes. Presumably the ipc.key value could be set in the Conf and all would be well. External job submitters shouldn't be given those keys; they'd need an HTTP(S) front end that could authenticate them however the organisation worked. Yes, that would be simpler. I am not enough of a security expert to say if it will work, but the keys should be easier to work with. As long as the configuration files are kept secure, your cluster will be locked. However, HDFS uses HTTP to serve blocks up -that needs to be locked down too. Would the signing work there? -steve
Re: Question about Hadoop 's Feature(s)
Owen O'Malley wrote: On Sep 24, 2008, at 1:50 AM, Trinh Tuan Cuong wrote: We are developing a project and we are intend to use Hadoop to handle the processing vast amount of data. But to convince our customers about the using of Hadoop in our project, we must show them the advantages ( and maybe ? the disadvantage ) when deploy the project with Hadoop compare to Oracle Database Platform. The primary advantage of Hadoop is scalability. On an equivalent hardware budget, Hadoop can handle much much larger databases. We had a process that was run once a week on Oracle that is now run once an hour on Hadoop. Additionally, Hadoop scales out much much farther. We can store petabytes of data in a single Hadoop cluster and have jobs that read and generate 100's of terabytes. That said, what a database gives you -on the right hardware- is very fast responses, especially if the indices are set up right and the data denormalised when appropriate. There is also really good integration with tools and application servers, with things like Java EE designed to make running code against a database easy. Not using Oracle means you don't have to work with an Oracle DBA, which, in my experience, can only be a good thing. DBAs and developers never seem to see eye-to-eye. Hadoop only has very primitive security at the moment, although I expect that to change in the next 6 months. Right now you need to trust everyone else on the network where you run hadoop to not be malicious; the filesystem and job tracker interfaces are insecure. The forthcoming 0.19 release will ask who you are, but the far end trusts you to be who you say you are. In that respect, it's as secure as NFS over UDP. To secure Hadoop you'd probably need to -sign every IPC request, with a CPU time cost at both ends. -require some form of authentication for the HTTP exported parts of the system, such as digest authentication, or issue lots of HTTPS private keys and use that instead. Giving everyone a key management problem as well as extra communications overhead. What is easier would be to lock down remote access to the filesystem/job submission so that only authenticated users would be able to upload jobs and data. The cluster would continue to trust everything else on its network, but the system doesn't trust people to submit work unless they could prove who they were.
Re: Question about Hadoop 's Feature(s)
On Sep 24, 2008, at 1:50 AM, Trinh Tuan Cuong wrote: We are developing a project and we are intend to use Hadoop to handle the processing vast amount of data. But to convince our customers about the using of Hadoop in our project, we must show them the advantages ( and maybe ? the disadvantage ) when deploy the project with Hadoop compare to Oracle Database Platform. The primary advantage of Hadoop is scalability. On an equivalent hardware budget, Hadoop can handle much much larger databases. We had a process that was run once a week on Oracle that is now run once an hour on Hadoop. Additionally, Hadoop scales out much much farther. We can store petabytes of data in a single Hadoop cluster and have jobs that read and generate 100's of terabytes. The disadvantage of Hadoop is that it is still relatively young and growing fast, so there are growing pains. Hadoop has recently gotten higher level query languages like SQL (Pig, Hive, and Jaql), but still doesn't have any fancy report generators. Hadoop only has very primitive security at the moment, although I expect that to change in the next 6 months. -- Owen
RE: Question about Hadoop 's Feature(s)
Dear Mr/Mrs Owen O'Malley, First I would like to thank you much for your reply, it was somehow the exact answer which I expected. As I read about the Query Language of Hadoop, it is a combination of Pig_Pig Latin, Have,HBase,Jaql and more... and I could see that Hadoop have an advantage SQL-like query language. The most thing I was curous bout is Hadoop's security level which is hard to find in any documents I searched. Like many of your organization, we are believing in the fast growing of Hadoop and intend to use it in our serious projects. Once again, thanks for the reply, now I could tell our clients clearly about Hadoop. Best Regards. Tuan Cuong, Trinh. [EMAIL PROTECTED] Luvina Software Company. Website : www.luvina.net -Original Message- From: Owen O'Malley [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 24, 2008 11:27 PM To: core-user@hadoop.apache.org Subject: Re: Question about Hadoop 's Feature(s) On Sep 24, 2008, at 1:50 AM, Trinh Tuan Cuong wrote: We are developing a project and we are intend to use Hadoop to handle the processing vast amount of data. But to convince our customers about the using of Hadoop in our project, we must show them the advantages ( and maybe ? the disadvantage ) when deploy the project with Hadoop compare to Oracle Database Platform. The primary advantage of Hadoop is scalability. On an equivalent hardware budget, Hadoop can handle much much larger databases. We had a process that was run once a week on Oracle that is now run once an hour on Hadoop. Additionally, Hadoop scales out much much farther. We can store petabytes of data in a single Hadoop cluster and have jobs that read and generate 100's of terabytes. The disadvantage of Hadoop is that it is still relatively young and growing fast, so there are growing pains. Hadoop has recently gotten higher level query languages like SQL (Pig, Hive, and Jaql), but still doesn't have any fancy report generators. Hadoop only has very primitive security at the moment, although I expect that to change in the next 6 months. -- Owen
Re: Question about Hadoop
Thank you very much for explaining it to me, Ted.. Thats a great deal of info! I guess that could be how Yahoo Webmap is designed.. And for anyone trying to figure out the massiveness of Hadoop computing, http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/should give a good picture of a practical case. I was for a moment flabbergasted, and instantly fell in love with Hadoop! ;) On Sat, Jun 14, 2008 at 12:11 AM, Ted Dunning [EMAIL PROTECTED] wrote: Usually hadoop programs are not used interactively since what they excel at is batch operations on very large collections of data. It is quite reasonable to store resulting data in hadoop and access those results using hadoop. The cleanest way to do that is to have a presentation layer web server that has all of the UI on it and use http to access the results file from hadoop via the namenodes data access URL. This works well where the results are not particularly voluminous. For large quantities of data such as the output of a web-crawl, it is usually better to copy the output out of hadoop and into a clustered system that supports high speed querying of the data. This clustered system might be as simple as a redundant memcache or mySql farm or as fancy as a sharded and replicated farm of text retrieval engines running under Solr. What works for you will vary by what you need to do. You should keep in mind that hadoop was designed for very long MTBF (for a cluster), but not designed for zero downtime operation. At the very least, you will occasionally want to upgrade the cluster software and that currently can't be done during normal operations. Combining hadoop (for heavy duty computations) with a separate persistence layer (for high availability web service) is a good hybrid. On Thu, Jun 12, 2008 at 9:53 PM, Chanchal James [EMAIL PROTECTED] wrote: Thank you all for the responses. So in order to run a web-based application, I just need to put the part of the application that needs to make use of distributed computation in HDFS, and have the other web site related files access it via Hadoop streaming ? Is that how Hadoop is used ? Sorry the question may sound too silly. Thank you. -- ted
Re: Question about Hadoop
Usually hadoop programs are not used interactively since what they excel at is batch operations on very large collections of data. It is quite reasonable to store resulting data in hadoop and access those results using hadoop. The cleanest way to do that is to have a presentation layer web server that has all of the UI on it and use http to access the results file from hadoop via the namenodes data access URL. This works well where the results are not particularly voluminous. For large quantities of data such as the output of a web-crawl, it is usually better to copy the output out of hadoop and into a clustered system that supports high speed querying of the data. This clustered system might be as simple as a redundant memcache or mySql farm or as fancy as a sharded and replicated farm of text retrieval engines running under Solr. What works for you will vary by what you need to do. You should keep in mind that hadoop was designed for very long MTBF (for a cluster), but not designed for zero downtime operation. At the very least, you will occasionally want to upgrade the cluster software and that currently can't be done during normal operations. Combining hadoop (for heavy duty computations) with a separate persistence layer (for high availability web service) is a good hybrid. On Thu, Jun 12, 2008 at 9:53 PM, Chanchal James [EMAIL PROTECTED] wrote: Thank you all for the responses. So in order to run a web-based application, I just need to put the part of the application that needs to make use of distributed computation in HDFS, and have the other web site related files access it via Hadoop streaming ? Is that how Hadoop is used ? Sorry the question may sound too silly. Thank you. On Thu, Jun 12, 2008 at 7:49 PM, Ted Dunning [EMAIL PROTECTED] wrote: Once it is in HDFS, you already have backups (due to the replicated file system). Your problems with deleting the dfs data directory are likely configuration problems combined with versioning of the data store (done to avoid confusion, but usually causes confusion). Once you get the configuration and operational issues sorted out, you shouldn't lose any data. On Thu, Jun 12, 2008 at 10:15 AM, Chanchal James [EMAIL PROTECTED] wrote: If I keep all data in HDFS, is there anyway I can back it up regularly. -- ted
Re: Question about Hadoop
Ideally what you would want is your data to be on HDFS and run your map/reduce jobs on that data. Hadoop framework splits you data and feeds in those splits to each map or reduce task. One problem with Image files is that you will not be able to split them. Alternatively people have done this, they wrap Image files within xml and create huge files which has multiple image files in them. Hadoop offers something called streaming using which you will be able to split the files at xml boundry and feed it to your map/reduce tasks. Streaming also enables you to use any code like perl/php/c++. Check info about streaming here http://hadoop.apache.org/core/docs/r0.17.0/streaming.html And information about parsing XML files in streaming in here http://hadoop.apache.org/core/docs/r0.17.0/streaming.html#How+do+I+parse+XML+documents+using+streaming%3F Thanks, Lohit - Original Message From: Chanchal James [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Thursday, June 12, 2008 9:42:46 AM Subject: Question about Hadoop Hi, I have a question about Hadoop. I am a beginner and just testing Hadoop. Would like to know how a php application would benefit from this, say an application that needs to work on a large number of image files. Do I have to store the application in HDFS always, or do I just copy it to HDFS when needed, do the processing, and then copy it back to the local file system ? Is that the case with the data files too ? Once I have Hadoop running, do I keep all data application files in HDFS always, and not use local file system storage ? Thank you.
Re: Question about Hadoop
Once it is in HDFS, you already have backups (due to the replicated file system). Your problems with deleting the dfs data directory are likely configuration problems combined with versioning of the data store (done to avoid confusion, but usually causes confusion). Once you get the configuration and operational issues sorted out, you shouldn't lose any data. On Thu, Jun 12, 2008 at 10:15 AM, Chanchal James [EMAIL PROTECTED] wrote: If I keep all data in HDFS, is there anyway I can back it up regularly.
Re: Question about Hadoop
Thank you all for the responses. So in order to run a web-based application, I just need to put the part of the application that needs to make use of distributed computation in HDFS, and have the other web site related files access it via Hadoop streaming ? Is that how Hadoop is used ? Sorry the question may sound too silly. Thank you. On Thu, Jun 12, 2008 at 7:49 PM, Ted Dunning [EMAIL PROTECTED] wrote: Once it is in HDFS, you already have backups (due to the replicated file system). Your problems with deleting the dfs data directory are likely configuration problems combined with versioning of the data store (done to avoid confusion, but usually causes confusion). Once you get the configuration and operational issues sorted out, you shouldn't lose any data. On Thu, Jun 12, 2008 at 10:15 AM, Chanchal James [EMAIL PROTECTED] wrote: If I keep all data in HDFS, is there anyway I can back it up regularly.
question about hadoop 0.17 upgrade
upgrade 0.16.3 to 0.17, error appears when start dfs and jobtracker. How can I do with it? Thanks! I have use the “start-dfs.sh �Cupgrade” command to upgrade the filesystem below is the error log: 2008-05-26 09:14:33,463 INFO org.apache.hadoop.mapred.JobTracker: STARTUP_MSG: / STARTUP_MSG: Starting JobTracker STARTUP_MSG: host = test180.sqa/192.168.207.180 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.17.0 STARTUP_MSG: build = http://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.17 -r 656523; compiled by 'hadoopqa' on Thu May 15 07:22:55 UTC 2008 / 2008-05-26 09:14:33,567 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=JobTracker, port=9001 2008-05-26 09:14:33,610 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting 2008-05-26 09:14:33,611 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 9001: starting 2008-05-26 09:14:33,611 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 9001: starting 2008-05-26 09:14:33,611 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 9001: starting 2008-05-26 09:14:33,612 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 9001: starting 2008-05-26 09:14:33,612 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 9001: starting 2008-05-26 09:14:33,612 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 9001: starting 2008-05-26 09:14:33,612 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 9001: starting 2008-05-26 09:14:33,612 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 9001: starting 2008-05-26 09:14:33,612 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 9001: starting 2008-05-26 09:14:33,613 INFO org.apache.hadoop.ipc.Server: IPC Server handler 8 on 9001: starting 2008-05-26 09:14:33,613 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 9001: starting 2008-05-26 09:14:33,664 INFO org.mortbay.util.Credential: Checking Resource aliases 2008-05-26 09:14:33,733 INFO org.mortbay.http.HttpServer: Version Jetty/5.1. 4 2008-05-26 09:14:33,734 INFO org.mortbay.util.Container: Started HttpContext[/static,/static] 2008-05-26 09:14:33,734 INFO org.mortbay.util.Container: Started HttpContext[/logs,/logs] 2008-05-26 09:14:33,962 INFO org.mortbay.util.Container: Started [EMAIL PROTECTED] 2008-05-26 09:14:33,998 INFO org.mortbay.util.Container: Started WebApplicationContext[/,/] 2008-05-26 09:14:34,000 INFO org.mortbay.http.SocketListener: Started SocketListener on 0.0.0.0:50030 2008-05-26 09:14:34,000 INFO org.mortbay.util.Container: Started [EMAIL PROTECTED] 2008-05-26 09:14:34,002 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 2008-05-26 09:14:34,003 INFO org.apache.hadoop.mapred.JobTracker: JobTracker up at: 9001 2008-05-26 09:14:34,003 INFO org.apache.hadoop.mapred.JobTracker: JobTracker webserver: 50030 2008-05-26 09:14:34,096 INFO org.apache.hadoop.mapred.JobTracker: problem cleaning system directory: /home/hadoop/HadoopInstall/tmp/mapred/system org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.dfs.SafeModeException: Cannot delete /home/hadoop/HadoopInstall/tmp/mapred/system. Name node is in safe mode. The ratio of reported blocks 0. has not reached the threshold 0.9990. Safe mode will be turned off automatically. at org.apache.hadoop.dfs.FSNamesystem.deleteInternal(FSNamesystem.java:1519) at org.apache.hadoop.dfs.FSNamesystem.delete(FSNamesystem.java:1498) at org.apache.hadoop.dfs.NameNode.delete(NameNode.java:383) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39 ) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl .java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896) at org.apache.hadoop.ipc.Client.call(Client.java:557) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212) at org.apache.hadoop.dfs.$Proxy4.delete(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39 ) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl .java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocati onHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHand ler.java:59) at org.apache.hadoop.dfs.$Proxy4.delete(Unknown Source) at
Re: One Simple Question About Hadoop DFS
On Sun, 23 Mar 2008, Chaman Singh Verma wrote: Hello, I am exploring Hadoop and MapReduce and I have one very simple question. I have 500GB dataset on my local disk and I have written both Map-Reduce functions. Now how should I start ? 1. I copy the data from local disk to DFS. I have configured DFS with 100 machines. I hope that it will split the file on 100 nodes ( With some replications). Yes. You need to copy the data from your local disk to the DFS. It will split the files based on the dfs block size (dfs.block.size). Default block size is 64MB and hence there would be 8000 blocks. 2. For MapReduce should I specify 100 nodes for SetMaxMapTask(). If I specify less than 100 then, will be blocks migrate ? If the blocks don't migrate then why this functions is provided to the users ? Why number of Tasks is not taken from the startup script ? Again here the max number of maps is bounded by the dfs block size. Hence in the default case you would have 8000 maps (unless you have your own input format). 3. If I specify more than 100, then will load balancing be done automatically or user have to specify that also. In short its the dfs block size along with the input format that controls the number of maps. The number of maps given to the framework is used as a hint. Sometimes it doesn't matter what value is passed. Amar Perhaps these are very simple questions, but I think that MapReduce simplifies lots of things ( Compared to MPI Based Programming ) that for beginners like me have difficult time to understand the model. csv - Never miss a thing. Make Yahoo your homepage.