Run Hadoop Code itself on eclipse
Hi ! I wanted to build and run the hadoop code itself from eclipse . I followed the instructions in http://wiki.apache.org/hadoop/EclipseEnvironment as : 1) I have checked out the hadoop release from http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.21.0/ from eclipse europa. 2) I have got common, hdfs and mapreduce folders under the project main directory. 3) Since i am behind a proxy i added proxy stuff to build.xml files in these 3 folders and did ant builds in eclipse which were successful. 4) Some additional things were mentioned to run tests in eclipse: selecting build dir(i have selected $Workspace/$project/build dir) and added tools.jar. When i try to run as Java Application i get error: Illegal option: a Usage: jar {ctxui}[vfm0Me] [jar-file] [manifest-file] [entry-point] [-C dir] files ... I don't know how to go about further ? Any help ? Thanks, Arun
How big data and/or how many machines do I need to take advantage of Hadoop?
Hadoop newbie here. I wrapped my company's entity extraction product in a Hadoop task, and give it a large file of the magnitude of 100MB. I have 4 VMs running on a 24-core CPU server, and made two of them the slave nodes, one namenode and another job tracker. It turned out that processing the same data size takes longer using Hadoop than processing it in serial. I am curious that how I can experience the advantage of Hadoop. Is having many physical machines essential? Would I need to process Terabytes of data? What would be the minimum set up where I can experience the advantage of Hadoop? T. Kuro Kurosaka
Re: How big data and/or how many machines do I need to take advantage of Hadoop?
Hi Kuro, A 100MB file should take 1 second to read; typically, MR jobs get scheduled on the order of seconds. So, it's unlikely you'll see any benefit. You'll probably want to have a look at Amdahl's law: http://en.wikipedia.org/wiki/Amdahl%27s_law Brian On Aug 31, 2011, at 3:48 AM, Teruhiko Kurosaka wrote: Hadoop newbie here. I wrapped my company's entity extraction product in a Hadoop task, and give it a large file of the magnitude of 100MB. I have 4 VMs running on a 24-core CPU server, and made two of them the slave nodes, one namenode and another job tracker. It turned out that processing the same data size takes longer using Hadoop than processing it in serial. I am curious that how I can experience the advantage of Hadoop. Is having many physical machines essential? Would I need to process Terabytes of data? What would be the minimum set up where I can experience the advantage of Hadoop? T. Kuro Kurosaka smime.p7s Description: S/MIME cryptographic signature
Re: Distributed cluster filesystem on EC2
Dmitry, It sounds like an interesting idea, but I have not really heard of anyone doing it before. It would make for a good feature to have tiered file systems all mapped into the same namespace, but that would be a lot of work and complexity. The quick solution would be to know what data you want to process before hand and then run distcp to copy it from S3 into HDFS before launching the other map/reduce jobs. I don't think there is anything automatic out there. --Bobby Evans On 8/29/11 4:56 PM, Dmitry Pushkarev u...@stanford.edu wrote: Dear hadoop users, Sorry for the off-topic. We're slowly migrating our hadoop cluster to EC2, and one thing that I'm trying to explore is whether we can use alternative scheduling systems like SGE with shared FS for non data intensive tasks, since they are easier to work with for lay users. One problem for now is how to create shared cluster filesystem similar to HDFS, distributed with high-performance, somewhat POSIX compliant (symlinks and permissions), that will use amazon EC2 local nonpersistent storage. Idea is to keep original data on S3, then as needed fire up a bunch of nodes, start shared filesystem, and quickly copy data from S3 to that FS, run the analysis with SGE, save results and shut down that filesystem. I tried things like S3FS and similar native S3 implementation but speed is too bad. Currently I just have a FS on my master node that is shared via NFS to all the rest, but I pretty much saturate 1GB bandwidth as soon as I start more than 10 nodes. Thank you. I'd appreciate any suggestions and links to relevant resources!. Dmitry
Binary content
Does map-reduce work well with binary contents in the file? This binary content is basically some CAD files and map reduce program need to read these files using some proprietry tool extract values and do some processing. Wondering if there are others doing similar type of processing. Best practices etc.
Re: Distributed cluster filesystem on EC2
You might consider Apache Whirr (http://whirr.apache.org/) for bringing up Hadoop clusters on EC2. Cheers, Tom On Wed, Aug 31, 2011 at 8:22 AM, Robert Evans ev...@yahoo-inc.com wrote: Dmitry, It sounds like an interesting idea, but I have not really heard of anyone doing it before. It would make for a good feature to have tiered file systems all mapped into the same namespace, but that would be a lot of work and complexity. The quick solution would be to know what data you want to process before hand and then run distcp to copy it from S3 into HDFS before launching the other map/reduce jobs. I don't think there is anything automatic out there. --Bobby Evans On 8/29/11 4:56 PM, Dmitry Pushkarev u...@stanford.edu wrote: Dear hadoop users, Sorry for the off-topic. We're slowly migrating our hadoop cluster to EC2, and one thing that I'm trying to explore is whether we can use alternative scheduling systems like SGE with shared FS for non data intensive tasks, since they are easier to work with for lay users. One problem for now is how to create shared cluster filesystem similar to HDFS, distributed with high-performance, somewhat POSIX compliant (symlinks and permissions), that will use amazon EC2 local nonpersistent storage. Idea is to keep original data on S3, then as needed fire up a bunch of nodes, start shared filesystem, and quickly copy data from S3 to that FS, run the analysis with SGE, save results and shut down that filesystem. I tried things like S3FS and similar native S3 implementation but speed is too bad. Currently I just have a FS on my master node that is shared via NFS to all the rest, but I pretty much saturate 1GB bandwidth as soon as I start more than 10 nodes. Thank you. I'd appreciate any suggestions and links to relevant resources!. Dmitry
Re: Distributed cluster filesystem on EC2
Thank you for your suggestion, I have years of experience with HDFS and if I could I'd gladly use it as filesystem, however our developers require fast random access and symlinks, so I was wondering if there are other options. I'm not sensitive too much to data locality, but we need to eliminate network bottlenecks in the FS. Has anyone trieds filesystems like GlusterFS, GFS, and so on? Thanks!. On Wed, Aug 31, 2011 at 8:50 AM, Tom White t...@cloudera.com wrote: You might consider Apache Whirr (http://whirr.apache.org/) for bringing up Hadoop clusters on EC2. Cheers, Tom On Wed, Aug 31, 2011 at 8:22 AM, Robert Evans ev...@yahoo-inc.com wrote: Dmitry, It sounds like an interesting idea, but I have not really heard of anyone doing it before. It would make for a good feature to have tiered file systems all mapped into the same namespace, but that would be a lot of work and complexity. The quick solution would be to know what data you want to process before hand and then run distcp to copy it from S3 into HDFS before launching the other map/reduce jobs. I don't think there is anything automatic out there. --Bobby Evans On 8/29/11 4:56 PM, Dmitry Pushkarev u...@stanford.edu wrote: Dear hadoop users, Sorry for the off-topic. We're slowly migrating our hadoop cluster to EC2, and one thing that I'm trying to explore is whether we can use alternative scheduling systems like SGE with shared FS for non data intensive tasks, since they are easier to work with for lay users. One problem for now is how to create shared cluster filesystem similar to HDFS, distributed with high-performance, somewhat POSIX compliant (symlinks and permissions), that will use amazon EC2 local nonpersistent storage. Idea is to keep original data on S3, then as needed fire up a bunch of nodes, start shared filesystem, and quickly copy data from S3 to that FS, run the analysis with SGE, save results and shut down that filesystem. I tried things like S3FS and similar native S3 implementation but speed is too bad. Currently I just have a FS on my master node that is shared via NFS to all the rest, but I pretty much saturate 1GB bandwidth as soon as I start more than 10 nodes. Thank you. I'd appreciate any suggestions and links to relevant resources!. Dmitry
Re: Starting datanode in secure mode
Thanks Ravi. This has brought my local hadoop cluster to life! The two things I was missing: 1) Have to use privileged ports !-- secure setup requires privileged ports -- property namedfs.datanode.address/name value0.0.0.0:1004/value /property property namedfs.datanode.http.address/name value0.0.0.0:1006/value /property 2) implied by 1) sudo required to launch datanode Clearly, this is geared towards the production system. For development, having the ability to run with Kerberos but w/o the need for privileged resources would be desirable. On Aug 30, 2011, at 9:00 PM, Ravi Prakash wrote: In short you MUST use priviledged resourced. In long: Here's what I did to setup a secure single node cluster. I'm sure there's other ways, but here's how I did it. 1.Install krb5-server 2.Setup the kerberos configuration (files attached). /var/kerberos/krb5kdc/kdc.conf and /etc/krb5.conf http://yahoo.github.com/hadoop-common/installing.html 3.To clean up everything : http://mailman.mit.edu/pipermail/kerberos/2003-June/003312.html 4.Create Kerberos database $ sudo kdb5_util create -s 5.Start Kerberos $ sudo /etc/rc.d/init.d/kadmin start $ sudo /etc/rc.d/init.d/krb5kdc start 6.Create principal raviprak/localhost.localdomain@localdomain http://web.mit.edu/kerberos/krb5-1.5/krb5-1.5.4/doc/krb5-admin/Adding-or-Modifying-Principals.html 7.Create keytab fiie using “xst -k /home/raviprak/raviprak.keytab raviprak/localhost.localdomain@localdomain” 8.Setup hdfs-site.xml and core-site.xml (files attached) 9.sudo hostname localhost.localdomain 10.hadoop-daemon.sh start namenode 11.sudo bash. Then export HADOOP_SECURE_DN_USER=raviprak . Then hadoop-daemon.sh start datanode CORE-SITE.XML ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namefs.default.name/name valuehdfs://localhost:9001/value /property property namehadoop.security.authorization/name valuetrue/value /property property namehadoop.security.authentication/name valuekerberos/value /property property namedfs.namenode.kerberos.principal/name valueraviprak/localhost.localdomain/value /property property namedfs.datanode.kerberos.principal/name valueraviprak/localhost.localdomain/value /property property namedfs.secondary.namenode.kerberos.principal/name valueraviprak/localhost.localdomain/value /property /configuration = HDFS-SITE.XML = ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namedfs.replication/name value1/value /property property namedfs.name.dir.restore/name valuefalse/value /property property namedfs.namenode.checkpoint.period/name value10/value /property property namedfs.namenode.keytab.file/name value/home/raviprak/raviprak.keytab/value /property property namedfs.secondary.namenode.keytab.file/name value/home/raviprak/raviprak.keytab/value /property property namedfs.datanode.keytab.file/name value/home/raviprak/raviprak.keytab/value /property property namedfs.datanode.address/name value0.0.0.0:1004/value /property property namedfs.datanode.http.address/name value0.0.0.0:1006/value /property property namedfs.namenode.kerberos.principal/name valueraviprak/localhost.localdomain@localdomain/value /property property namedfs.secondary.namenode.kerberos.principal/name valueraviprak/localhost.localdomain@localdomain/value /property property namedfs.datanode.kerberos.principal/name valueraviprak/localhost.localdomain@localdomain/value /property property namedfs.namenode.kerberos.https.principal/name valueraviprak/localhost.localdomain@localdomain/value /property property namedfs.secondary.namenode.kerberos.https.principal/name valueraviprak/localhost.localdomain@localdomain/value /property property namedfs.datanode.kerberos.https.principal/name valueraviprak/localhost.localdomain@localdomain/value /property /configuration = On Tue, Aug 30, 2011 at 8:08 PM, Thomas Weise t...@yahoo-inc.com wrote: I'm configuring a local hadoop cluster in secure mode for development/experimental purposes on Ubuntu 11.04 with the hadoop-0.20.203.0
Re: How big data and/or how many machines do I need to take advantage of Hadoop?
Brian, This particular task takes time in computation, in the order of minutes. T. Kuro Kurosaka From: Brian Bockelman bbock...@cse.unl.edumailto:bbock...@cse.unl.edu Reply-To: common-user@hadoop.apache.orgmailto:common-user@hadoop.apache.org common-user@hadoop.apache.orgmailto:common-user@hadoop.apache.org Date: Wed, 31 Aug 2011 08:04:07 -0400 To: common-user@hadoop.apache.orgmailto:common-user@hadoop.apache.org common-user@hadoop.apache.orgmailto:common-user@hadoop.apache.org Subject: Re: How big data and/or how many machines do I need to take advantage of Hadoop? Hi Kuro, A 100MB file should take 1 second to read; typically, MR jobs get scheduled on the order of seconds. So, it's unlikely you'll see any benefit. You'll probably want to have a look at Amdahl's law: http://en.wikipedia.org/wiki/Amdahl%27s_lawhttp://en.wikipedia.org/wiki/Amdahl's_law Brian On Aug 31, 2011, at 3:48 AM, Teruhiko Kurosaka wrote: Hadoop newbie here. I wrapped my company's entity extraction product in a Hadoop task, and give it a large file of the magnitude of 100MB. I have 4 VMs running on a 24-core CPU server, and made two of them the slave nodes, one namenode and another job tracker. It turned out that processing the same data size takes longer using Hadoop than processing it in serial. I am curious that how I can experience the advantage of Hadoop. Is having many physical machines essential? Would I need to process Terabytes of data? What would be the minimum set up where I can experience the advantage of Hadoop? T. Kuro Kurosaka
tutorial on Hadoop/Hbase utility classes
Here is a tutorial on some handy Hadoop classes - with sample source code. http://sujee.net/tech/articles/hadoop-useful-classes/ Would appreciate any feedback / suggestions. thanks all Sujee Maniyam http://sujee.net
Re: tutorial on Hadoop/Hbase utility classes
Thank you, Sujee. StringUtils are useful, but so is Guava Mark On Wed, Aug 31, 2011 at 6:57 PM, Sujee Maniyam su...@sujee.net wrote: Here is a tutorial on some handy Hadoop classes - with sample source code. http://sujee.net/tech/articles/hadoop-useful-classes/ Would appreciate any feedback / suggestions. thanks all Sujee Maniyam http://sujee.net