Run Hadoop Code itself on eclipse

2011-08-31 Thread arun k
 Hi !

I wanted to build and run the hadoop code itself from eclipse .
I followed the instructions in
http://wiki.apache.org/hadoop/EclipseEnvironment as :
 1) I have checked out the hadoop release from
http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.21.0/ from
eclipse europa.
 2) I have got common, hdfs and mapreduce folders under the project
main directory.
 3) Since i am behind a proxy i added proxy stuff to build.xml files
in these 3 folders and did ant builds in eclipse which were successful.
 4) Some additional things were mentioned to run tests in eclipse:
selecting build dir(i have selected $Workspace/$project/build dir) and added
tools.jar.

 When i try to run as Java Application i get error:
Illegal option: a
Usage: jar {ctxui}[vfm0Me] [jar-file] [manifest-file] [entry-point] [-C dir]
files ...

I don't know how to go about further ? Any help ?

Thanks,
Arun


How big data and/or how many machines do I need to take advantage of Hadoop?

2011-08-31 Thread Teruhiko Kurosaka
Hadoop newbie here.

I wrapped my company's entity extraction product in a Hadoop task,
and give it a large file of the magnitude of 100MB.
I have 4 VMs running on a 24-core CPU server, and made two of
them the slave nodes, one namenode and another job tracker.
It turned out that processing the same data size takes longer
using Hadoop than processing it in serial.

I am curious that how I can experience the advantage of
Hadoop.  Is having many physical machines essential?
Would I need to process Terabytes of data? What would be
the minimum set up where I can experience the advantage
of Hadoop?

T. Kuro Kurosaka



Re: How big data and/or how many machines do I need to take advantage of Hadoop?

2011-08-31 Thread Brian Bockelman
Hi Kuro,

A 100MB file should take 1 second to read; typically, MR jobs get scheduled on 
the order of seconds.  So, it's unlikely you'll see any benefit.

You'll probably want to have a look at Amdahl's law:

http://en.wikipedia.org/wiki/Amdahl%27s_law

Brian

On Aug 31, 2011, at 3:48 AM, Teruhiko Kurosaka wrote:

 Hadoop newbie here.
 
 I wrapped my company's entity extraction product in a Hadoop task,
 and give it a large file of the magnitude of 100MB.
 I have 4 VMs running on a 24-core CPU server, and made two of
 them the slave nodes, one namenode and another job tracker.
 It turned out that processing the same data size takes longer
 using Hadoop than processing it in serial.
 
 I am curious that how I can experience the advantage of
 Hadoop.  Is having many physical machines essential?
 Would I need to process Terabytes of data? What would be
 the minimum set up where I can experience the advantage
 of Hadoop?
 
 T. Kuro Kurosaka



smime.p7s
Description: S/MIME cryptographic signature


Re: Distributed cluster filesystem on EC2

2011-08-31 Thread Robert Evans
Dmitry,

It sounds like an interesting idea, but I have not really heard of anyone doing 
it before.  It would make for a good feature to have tiered file systems all 
mapped into the same namespace, but that would be a lot of work and complexity.

The quick solution would be to know what data you want to process before hand 
and then run distcp to copy it from S3 into HDFS before launching the other 
map/reduce jobs.  I don't think there is anything automatic out there.

--Bobby Evans

On 8/29/11 4:56 PM, Dmitry Pushkarev u...@stanford.edu wrote:

Dear hadoop users,

Sorry for the off-topic. We're slowly migrating our hadoop cluster to EC2,
and one thing that I'm trying to explore is whether we can use alternative
scheduling systems like SGE with shared FS for non data intensive tasks,
since they are easier to work with for lay users.

One problem for now is how to create shared cluster filesystem similar to
HDFS, distributed with high-performance, somewhat POSIX compliant (symlinks
and permissions), that will use amazon EC2 local nonpersistent storage.

Idea is to keep original data on S3, then as needed fire up a bunch of
nodes, start shared filesystem, and quickly copy data from S3 to that FS,
run the analysis with SGE, save results and shut down that filesystem.
I tried things like S3FS and similar native S3 implementation but speed is
too bad. Currently I just have a FS on my master node that is shared via NFS
to all the rest, but I pretty much saturate 1GB bandwidth as soon as I start
more than 10 nodes.

Thank you. I'd appreciate any suggestions and links to relevant resources!.


Dmitry



Binary content

2011-08-31 Thread Mohit Anchlia
Does map-reduce work well with binary contents in the file? This
binary content is basically some CAD files and map reduce program need
to read these files using some proprietry tool extract values and do
some processing. Wondering if there are others doing similar type of
processing. Best practices etc.


Re: Distributed cluster filesystem on EC2

2011-08-31 Thread Tom White
You might consider Apache Whirr (http://whirr.apache.org/) for
bringing up Hadoop clusters on EC2.

Cheers,
Tom

On Wed, Aug 31, 2011 at 8:22 AM, Robert Evans ev...@yahoo-inc.com wrote:
 Dmitry,

 It sounds like an interesting idea, but I have not really heard of anyone 
 doing it before.  It would make for a good feature to have tiered file 
 systems all mapped into the same namespace, but that would be a lot of work 
 and complexity.

 The quick solution would be to know what data you want to process before hand 
 and then run distcp to copy it from S3 into HDFS before launching the other 
 map/reduce jobs.  I don't think there is anything automatic out there.

 --Bobby Evans

 On 8/29/11 4:56 PM, Dmitry Pushkarev u...@stanford.edu wrote:

 Dear hadoop users,

 Sorry for the off-topic. We're slowly migrating our hadoop cluster to EC2,
 and one thing that I'm trying to explore is whether we can use alternative
 scheduling systems like SGE with shared FS for non data intensive tasks,
 since they are easier to work with for lay users.

 One problem for now is how to create shared cluster filesystem similar to
 HDFS, distributed with high-performance, somewhat POSIX compliant (symlinks
 and permissions), that will use amazon EC2 local nonpersistent storage.

 Idea is to keep original data on S3, then as needed fire up a bunch of
 nodes, start shared filesystem, and quickly copy data from S3 to that FS,
 run the analysis with SGE, save results and shut down that filesystem.
 I tried things like S3FS and similar native S3 implementation but speed is
 too bad. Currently I just have a FS on my master node that is shared via NFS
 to all the rest, but I pretty much saturate 1GB bandwidth as soon as I start
 more than 10 nodes.

 Thank you. I'd appreciate any suggestions and links to relevant resources!.


 Dmitry




Re: Distributed cluster filesystem on EC2

2011-08-31 Thread Dmitry Pushkarev
Thank you for your suggestion, I have years of experience with HDFS and if I
could I'd gladly use it as filesystem, however our developers require fast
random access and symlinks, so I was wondering if there are other options.
I'm not sensitive too much to data locality, but we need to eliminate
network bottlenecks in the FS.  Has anyone trieds filesystems like
GlusterFS, GFS, and so on?

Thanks!.

On Wed, Aug 31, 2011 at 8:50 AM, Tom White t...@cloudera.com wrote:

 You might consider Apache Whirr (http://whirr.apache.org/) for
 bringing up Hadoop clusters on EC2.

 Cheers,
 Tom

 On Wed, Aug 31, 2011 at 8:22 AM, Robert Evans ev...@yahoo-inc.com wrote:
  Dmitry,
 
  It sounds like an interesting idea, but I have not really heard of anyone
 doing it before.  It would make for a good feature to have tiered file
 systems all mapped into the same namespace, but that would be a lot of work
 and complexity.
 
  The quick solution would be to know what data you want to process before
 hand and then run distcp to copy it from S3 into HDFS before launching the
 other map/reduce jobs.  I don't think there is anything automatic out there.
 
  --Bobby Evans
 
  On 8/29/11 4:56 PM, Dmitry Pushkarev u...@stanford.edu wrote:
 
  Dear hadoop users,
 
  Sorry for the off-topic. We're slowly migrating our hadoop cluster to
 EC2,
  and one thing that I'm trying to explore is whether we can use
 alternative
  scheduling systems like SGE with shared FS for non data intensive tasks,
  since they are easier to work with for lay users.
 
  One problem for now is how to create shared cluster filesystem similar to
  HDFS, distributed with high-performance, somewhat POSIX compliant
 (symlinks
  and permissions), that will use amazon EC2 local nonpersistent storage.
 
  Idea is to keep original data on S3, then as needed fire up a bunch of
  nodes, start shared filesystem, and quickly copy data from S3 to that FS,
  run the analysis with SGE, save results and shut down that filesystem.
  I tried things like S3FS and similar native S3 implementation but speed
 is
  too bad. Currently I just have a FS on my master node that is shared via
 NFS
  to all the rest, but I pretty much saturate 1GB bandwidth as soon as I
 start
  more than 10 nodes.
 
  Thank you. I'd appreciate any suggestions and links to relevant
 resources!.
 
 
  Dmitry
 
 



Re: Starting datanode in secure mode

2011-08-31 Thread Thomas Weise
Thanks Ravi. This has brought my local hadoop cluster to life!

The two things I was missing:

1) Have to use privileged ports

!-- secure setup requires privileged ports --
property
  namedfs.datanode.address/name
  value0.0.0.0:1004/value
/property
property
  namedfs.datanode.http.address/name
  value0.0.0.0:1006/value
/property

2) implied by 1) sudo required to launch datanode

Clearly, this is geared towards the production system. For development, having 
the ability to run with Kerberos but w/o the need for privileged resources 
would be desirable.


On Aug 30, 2011, at 9:00 PM, Ravi Prakash wrote:

 In short you MUST use priviledged resourced.
 
 In long:
 
 Here's what I did to setup a secure single node cluster. I'm sure there's
 other ways, but here's how I did it.
 
1.Install krb5-server
2.Setup the kerberos configuration (files attached).
 /var/kerberos/krb5kdc/kdc.conf and /etc/krb5.conf
 http://yahoo.github.com/hadoop-common/installing.html
3.To clean up everything :
 http://mailman.mit.edu/pipermail/kerberos/2003-June/003312.html
4.Create Kerberos database $ sudo kdb5_util create -s
5.Start Kerberos $ sudo /etc/rc.d/init.d/kadmin start $ sudo
 /etc/rc.d/init.d/krb5kdc start
6.Create principal raviprak/localhost.localdomain@localdomain
 http://web.mit.edu/kerberos/krb5-1.5/krb5-1.5.4/doc/krb5-admin/Adding-or-Modifying-Principals.html
7.Create keytab fiie using “xst -k /home/raviprak/raviprak.keytab
 raviprak/localhost.localdomain@localdomain”
8.Setup hdfs-site.xml and core-site.xml (files attached)
9.sudo hostname localhost.localdomain
10.hadoop-daemon.sh start namenode
11.sudo bash. Then export HADOOP_SECURE_DN_USER=raviprak . Then
 hadoop-daemon.sh start datanode
 
 
 
 CORE-SITE.XML
 
 ?xml version=1.0?
 ?xml-stylesheet type=text/xsl href=configuration.xsl?
 
 !-- Put site-specific property overrides in this file. --
 
 configuration
property
namefs.default.name/name
valuehdfs://localhost:9001/value
/property
property
namehadoop.security.authorization/name
valuetrue/value
/property
property
namehadoop.security.authentication/name
valuekerberos/value
/property
  property
namedfs.namenode.kerberos.principal/name
valueraviprak/localhost.localdomain/value
  /property
  property
namedfs.datanode.kerberos.principal/name
valueraviprak/localhost.localdomain/value
  /property
  property
namedfs.secondary.namenode.kerberos.principal/name
valueraviprak/localhost.localdomain/value
  /property
 /configuration
 
 =
 
 
 
 HDFS-SITE.XML
 =
 ?xml version=1.0?
 ?xml-stylesheet type=text/xsl href=configuration.xsl?
 
 !-- Put site-specific property overrides in this file. --
 
 configuration
property
namedfs.replication/name
value1/value
/property
 
property
namedfs.name.dir.restore/name
valuefalse/value
/property
 
property
namedfs.namenode.checkpoint.period/name
value10/value
/property
 
property
namedfs.namenode.keytab.file/name
value/home/raviprak/raviprak.keytab/value
/property
 
property
namedfs.secondary.namenode.keytab.file/name
value/home/raviprak/raviprak.keytab/value
/property
 
property
namedfs.datanode.keytab.file/name
value/home/raviprak/raviprak.keytab/value
/property
 
property
namedfs.datanode.address/name
value0.0.0.0:1004/value
/property
 
property
namedfs.datanode.http.address/name
value0.0.0.0:1006/value
/property
 
property
namedfs.namenode.kerberos.principal/name
valueraviprak/localhost.localdomain@localdomain/value
/property
 
property
namedfs.secondary.namenode.kerberos.principal/name
valueraviprak/localhost.localdomain@localdomain/value
/property
 
property
namedfs.datanode.kerberos.principal/name
valueraviprak/localhost.localdomain@localdomain/value
/property
 
property
namedfs.namenode.kerberos.https.principal/name
valueraviprak/localhost.localdomain@localdomain/value
/property
 
property
namedfs.secondary.namenode.kerberos.https.principal/name
valueraviprak/localhost.localdomain@localdomain/value
/property
 
property
namedfs.datanode.kerberos.https.principal/name
valueraviprak/localhost.localdomain@localdomain/value
/property
 
 /configuration
 =
 
 
 On Tue, Aug 30, 2011 at 8:08 PM, Thomas Weise t...@yahoo-inc.com wrote:
 
 I'm configuring a local hadoop cluster in secure mode for
 development/experimental purposes on Ubuntu 11.04 with the hadoop-0.20.203.0
 

Re: How big data and/or how many machines do I need to take advantage of Hadoop?

2011-08-31 Thread Teruhiko Kurosaka
Brian,
This particular task takes time in computation, in the order of minutes.

T. Kuro Kurosaka
From: Brian Bockelman bbock...@cse.unl.edumailto:bbock...@cse.unl.edu
Reply-To: common-user@hadoop.apache.orgmailto:common-user@hadoop.apache.org 
common-user@hadoop.apache.orgmailto:common-user@hadoop.apache.org
Date: Wed, 31 Aug 2011 08:04:07 -0400
To: common-user@hadoop.apache.orgmailto:common-user@hadoop.apache.org 
common-user@hadoop.apache.orgmailto:common-user@hadoop.apache.org
Subject: Re: How big data and/or how many machines do I need to take advantage 
of Hadoop?

Hi Kuro,

A 100MB file should take 1 second to read; typically, MR jobs get scheduled on 
the order of seconds.  So, it's unlikely you'll see any benefit.

You'll probably want to have a look at Amdahl's law:

http://en.wikipedia.org/wiki/Amdahl%27s_lawhttp://en.wikipedia.org/wiki/Amdahl's_law

Brian

On Aug 31, 2011, at 3:48 AM, Teruhiko Kurosaka wrote:

Hadoop newbie here.

I wrapped my company's entity extraction product in a Hadoop task,
and give it a large file of the magnitude of 100MB.
I have 4 VMs running on a 24-core CPU server, and made two of
them the slave nodes, one namenode and another job tracker.
It turned out that processing the same data size takes longer
using Hadoop than processing it in serial.

I am curious that how I can experience the advantage of
Hadoop.  Is having many physical machines essential?
Would I need to process Terabytes of data? What would be
the minimum set up where I can experience the advantage
of Hadoop?

T. Kuro Kurosaka



tutorial on Hadoop/Hbase utility classes

2011-08-31 Thread Sujee Maniyam
Here is a tutorial on some handy Hadoop classes - with sample source code.

http://sujee.net/tech/articles/hadoop-useful-classes/

Would appreciate any feedback / suggestions.

thanks  all
Sujee Maniyam
http://sujee.net


Re: tutorial on Hadoop/Hbase utility classes

2011-08-31 Thread Mark Kerzner
Thank you, Sujee.

StringUtils are useful, but so is Guava

Mark

On Wed, Aug 31, 2011 at 6:57 PM, Sujee Maniyam su...@sujee.net wrote:

 Here is a tutorial on some handy Hadoop classes - with sample source code.

 http://sujee.net/tech/articles/hadoop-useful-classes/

 Would appreciate any feedback / suggestions.

 thanks  all
 Sujee Maniyam
 http://sujee.net