Re: Multiple cores vs multiple nodes

2012-07-02 Thread Michael Segel
Hi,
First, you have to explain what you mean by 'equivalent' .


The short answer is that it depends.
The longer answer is that you have to consider cost in your design. 

The whole issue of design is to maintain the correct ratio of cores to memory 
and cores to spindles while optimizing the box within the cost, space and 
hardware (box configurations) limitations. 

Note that you can sacrifice some of the ratio, however, you will leave some of 
the performance on the table. 


On Jul 1, 2012, at 6:13 AM, Safdar Kureishy wrote:

 Hi,
 
 I have a reasonably simple question that I thought I'd post to this list
 because I don't have enough experience with hardware to figure this out
 myself.
 
 Let's assume that I have 2 separate cluster setups for slave nodes. The
 master node is a separate machine *outside* these clusters:
 *Setup A*: 28 nodes, each with a 2-core CPU, 8 GB RAM and 1 SATA drives (1
 TB each)
 *Setup B*: 7 nodes, each with a 8-core CPU, 32 GB Ram and 4 SATA drives (1
 TB each)
 
 Note that I have maintained the same *core:memory:spindle* ratio above. In
 essence, setup B has the same overall processing + memory + spindle
 capacity, but achieved with 4 times fewer nodes.
 
 Ignoring the* cost* of each node above, and assuming a 10Gb Ethernet
 connectivity and the same speed-per-core across nodes in both the scenarios
 above, are Setup A and Setup B equivalent to each other in the context of
 setting up a Hadoop cluster? Or will the relative performance be different?
 Excluding the network connectivity between the nodes, what would be some
 other criteria that might give one setup an edge over the other, for
 regular Hadoop jobs?
 
 Also, assuming the same type of Hadoop jobs on both clusters, how different
 would the load experienced by the master node be for each setup above?
 
 Thanks in advance,
 Safdar



Re: hadoop security API (repost)

2012-07-02 Thread Ivan Frain
Hi Tony,

I am currently working on this to access HDFS securely and programmaticaly.
What I have found so far may help even if I am not 100% sure this is the
right way to proceed.

If you have already obtained a TGT from the kinit command, hadoop library
will locate it automatically if the name of the ticket cache corresponds
to default location. On Linux it is located /tmp/krb5cc_uid-number.

For example, with my linux user hdfs, I get a TGT for hadoop user 'ivan'
meaning you can impersonate ivan from hdfs linux user:
--
hdfs@mitkdc:~$ klist
Ticket cache: FILE:/tmp/krb5cc_10003
Default principal: i...@hadoop.lan

Valid startingExpires   Service principal
02/07/2012 13:59  02/07/2012 23:59  krbtgt/hadoop@hadoop.lan
renew until 03/07/2012 13:59
---

Then, you just have to set the right security options in your hadoop client
in java and the identity will be i...@hadoop.lan for our example. In my
tests, I only use HDFS and here a snippet of code to have access to a
secure hdfs cluster assuming the previous TGT (ivan's impersonation):


 val conf: HdfsConfiguration = new HdfsConfiguration()
 conf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION,
kerberos)
 conf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHORIZATION,
true)
 conf.set(DFSConfigKeys.DFS_NAMENODE_USER_NAME_KEY, serverPrincipal)

 UserGroupInformation.setConfiguration(conf)

 val fs = FileSystem.get(new URI(hdfsUri), conf)


Using this 'fs' is a handler to access hdfs securely as user 'ivan' even if
ivan does not appear in the hadoop client code.

Anyway, I also see two other options:
  * Setting the KRB5CCNAME environment variable to point to the right
ticketCache file
  * Specifying the keytab file you want to use from the
UserGroupInformation singleton API:
UserGroupInformation.loginUserFromKeytab(user, keytabFile)

If you want to understand the auth process and the different options to
login, I guess you need to have a look to the UserGroupInformation.java
source code (release 0.23.1 link: http://bit.ly/NVzBKL). The private class
HadoopConfiguration line 347 is of major interest in our case.

Another point is that I did not find any easy way to prompt the user for a
password at runtim using the actual hadoop API. It appears to be somehow
hardcoded in the UserGroupInformation singleton. I guess it could be nice
to have a new function to give to the UserGroupInformation an authenticated
'Subject' which could override all default configurations. If someone have
better ideas it could be nice to discuss on it as well.


BR,
Ivan

2012/7/1 Tony Dean tony.d...@sas.com

 Hi,

 The security documentation specifies how to test a secure cluster by using
 kinit and thus adding the Kerberos principal TGT to the ticket cache in
 which
 the hadoop client code uses to acquire service tickets for use in the
 cluster.
 What if I created an application that used the hadoop API to communicate
 with
 hdfs and/or mapred protocols, is there a programmatic way to inform hadoop
 to
 use a particular Kerberos principal name with a keytab that contains its
 password key?  I didn't see a way to integrate with JAAS KrbLoginModule.
 I was thinking that if I could inject a callbackHandler, I could pass the
 principal name and the KrbLoginModule already has options to specify
 keytab.
 Is this something that is possible?  Or is this just not the right way to
 do things?

 I read about impersonation where authentication is performed with a system
 user such
 as oozie and then it just impersonates other users so that permissions
 are based on
 the impersonated user instead of the system user.

 Please help me understand my options for executing hadoop tasks in a
 multi-tenant application.

 Thank you!





-- 
Ivan Frain
11, route de Grenade
31530 Saint-Paul-sur-Save
mobile: +33 (0)6 52 52 47 07


RE: hadoop security API (repost)

2012-07-02 Thread Tony Dean
Yes, but this will not work in a multi-tenant environment.  I need to be able 
to create a Kerberos TGT per execution thread.

I was hoping through JAAS that I could inject the name of the current principal 
and authenticate against it.  I'm sure there is a best practice for 
hadoop/hbase client API authentication, just not sure what it is.

Thank you for your comment.  The solution may well be associated with the 
UserGroupInformation class.  Hopefully, other ideas will come from this thread.

Thanks.

-Tony

-Original Message-
From: Ivan Frain [mailto:ivan.fr...@gmail.com] 
Sent: Monday, July 02, 2012 8:14 AM
To: common-user@hadoop.apache.org
Subject: Re: hadoop security API (repost)

Hi Tony,

I am currently working on this to access HDFS securely and programmaticaly.
What I have found so far may help even if I am not 100% sure this is the right 
way to proceed.

If you have already obtained a TGT from the kinit command, hadoop library will 
locate it automatically if the name of the ticket cache corresponds to 
default location. On Linux it is located /tmp/krb5cc_uid-number.

For example, with my linux user hdfs, I get a TGT for hadoop user 'ivan'
meaning you can impersonate ivan from hdfs linux user:
--
hdfs@mitkdc:~$ klist
Ticket cache: FILE:/tmp/krb5cc_10003
Default principal: i...@hadoop.lan

Valid startingExpires   Service principal
02/07/2012 13:59  02/07/2012 23:59  krbtgt/hadoop@hadoop.lan renew until 
03/07/2012 13:59
---

Then, you just have to set the right security options in your hadoop client in 
java and the identity will be i...@hadoop.lan for our example. In my tests, I 
only use HDFS and here a snippet of code to have access to a secure hdfs 
cluster assuming the previous TGT (ivan's impersonation):


 val conf: HdfsConfiguration = new HdfsConfiguration()
 conf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION,
kerberos)
 conf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHORIZATION,
true)
 conf.set(DFSConfigKeys.DFS_NAMENODE_USER_NAME_KEY, serverPrincipal)

 UserGroupInformation.setConfiguration(conf)

 val fs = FileSystem.get(new URI(hdfsUri), conf)


Using this 'fs' is a handler to access hdfs securely as user 'ivan' even if 
ivan does not appear in the hadoop client code.

Anyway, I also see two other options:
  * Setting the KRB5CCNAME environment variable to point to the right 
ticketCache file
  * Specifying the keytab file you want to use from the UserGroupInformation 
singleton API:
UserGroupInformation.loginUserFromKeytab(user, keytabFile)

If you want to understand the auth process and the different options to login, 
I guess you need to have a look to the UserGroupInformation.java source code 
(release 0.23.1 link: http://bit.ly/NVzBKL). The private class 
HadoopConfiguration line 347 is of major interest in our case.

Another point is that I did not find any easy way to prompt the user for a 
password at runtim using the actual hadoop API. It appears to be somehow 
hardcoded in the UserGroupInformation singleton. I guess it could be nice to 
have a new function to give to the UserGroupInformation an authenticated 
'Subject' which could override all default configurations. If someone have 
better ideas it could be nice to discuss on it as well.


BR,
Ivan

2012/7/1 Tony Dean tony.d...@sas.com

 Hi,

 The security documentation specifies how to test a secure cluster by 
 using kinit and thus adding the Kerberos principal TGT to the ticket 
 cache in which the hadoop client code uses to acquire service tickets 
 for use in the cluster.
 What if I created an application that used the hadoop API to 
 communicate with hdfs and/or mapred protocols, is there a programmatic 
 way to inform hadoop to use a particular Kerberos principal name with 
 a keytab that contains its password key?  I didn't see a way to 
 integrate with JAAS KrbLoginModule.
 I was thinking that if I could inject a callbackHandler, I could pass 
 the principal name and the KrbLoginModule already has options to 
 specify keytab.
 Is this something that is possible?  Or is this just not the right way 
 to do things?

 I read about impersonation where authentication is performed with a 
 system user such as oozie and then it just impersonates other users 
 so that permissions are based on the impersonated user instead of the 
 system user.

 Please help me understand my options for executing hadoop tasks in a 
 multi-tenant application.

 Thank you!





--
Ivan Frain
11, route de Grenade
31530 Saint-Paul-sur-Save
mobile: +33 (0)6 52 52 47 07



Re: hadoop security API (repost)

2012-07-02 Thread Alejandro Abdelnur
Tony,

If you are doing a server app that interacts with the cluster on
behalf of different users (like Ooize, as you mentioned in your
email), then you should use the proxyuser capabilities of Hadoop.

* Configure user MYSERVERUSER as proxyuser in Hadoop core-site.xml
(this requires 2 properties settings, HOSTS and GROUPS).
* Run your server app as MYSERVERUSER and have a Kerberos principal
MYSERVERUSER/MYSERVERHOST
* Initialize your server app loading the MYSERVERUSER/MYSERVERHOST keytab
* Use the UGI.doAs() to create JobClient/Filesystem instances using
the user you want to do something on behalf
* Keep in mind that all the users you need to do something on behalf
should be valid Unix users in the cluster
* If those users need direct access to the cluster, they'll have to be
also defined in in the KDC user database.

Hope this helps.

Thx

On Mon, Jul 2, 2012 at 6:22 AM, Tony Dean tony.d...@sas.com wrote:
 Yes, but this will not work in a multi-tenant environment.  I need to be able 
 to create a Kerberos TGT per execution thread.

 I was hoping through JAAS that I could inject the name of the current 
 principal and authenticate against it.  I'm sure there is a best practice for 
 hadoop/hbase client API authentication, just not sure what it is.

 Thank you for your comment.  The solution may well be associated with the 
 UserGroupInformation class.  Hopefully, other ideas will come from this 
 thread.

 Thanks.

 -Tony

 -Original Message-
 From: Ivan Frain [mailto:ivan.fr...@gmail.com]
 Sent: Monday, July 02, 2012 8:14 AM
 To: common-user@hadoop.apache.org
 Subject: Re: hadoop security API (repost)

 Hi Tony,

 I am currently working on this to access HDFS securely and programmaticaly.
 What I have found so far may help even if I am not 100% sure this is the 
 right way to proceed.

 If you have already obtained a TGT from the kinit command, hadoop library 
 will locate it automatically if the name of the ticket cache corresponds to 
 default location. On Linux it is located /tmp/krb5cc_uid-number.

 For example, with my linux user hdfs, I get a TGT for hadoop user 'ivan'
 meaning you can impersonate ivan from hdfs linux user:
 --
 hdfs@mitkdc:~$ klist
 Ticket cache: FILE:/tmp/krb5cc_10003
 Default principal: i...@hadoop.lan

 Valid startingExpires   Service principal
 02/07/2012 13:59  02/07/2012 23:59  krbtgt/hadoop@hadoop.lan renew until 
 03/07/2012 13:59
 ---

 Then, you just have to set the right security options in your hadoop client 
 in java and the identity will be i...@hadoop.lan for our example. In my 
 tests, I only use HDFS and here a snippet of code to have access to a secure 
 hdfs cluster assuming the previous TGT (ivan's impersonation):

 
  val conf: HdfsConfiguration = new HdfsConfiguration()
  conf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION,
 kerberos)
  conf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHORIZATION,
 true)
  conf.set(DFSConfigKeys.DFS_NAMENODE_USER_NAME_KEY, serverPrincipal)

  UserGroupInformation.setConfiguration(conf)

  val fs = FileSystem.get(new URI(hdfsUri), conf)
 

 Using this 'fs' is a handler to access hdfs securely as user 'ivan' even if 
 ivan does not appear in the hadoop client code.

 Anyway, I also see two other options:
   * Setting the KRB5CCNAME environment variable to point to the right 
 ticketCache file
   * Specifying the keytab file you want to use from the UserGroupInformation 
 singleton API:
 UserGroupInformation.loginUserFromKeytab(user, keytabFile)

 If you want to understand the auth process and the different options to 
 login, I guess you need to have a look to the UserGroupInformation.java 
 source code (release 0.23.1 link: http://bit.ly/NVzBKL). The private class 
 HadoopConfiguration line 347 is of major interest in our case.

 Another point is that I did not find any easy way to prompt the user for a 
 password at runtim using the actual hadoop API. It appears to be somehow 
 hardcoded in the UserGroupInformation singleton. I guess it could be nice to 
 have a new function to give to the UserGroupInformation an authenticated 
 'Subject' which could override all default configurations. If someone have 
 better ideas it could be nice to discuss on it as well.


 BR,
 Ivan

 2012/7/1 Tony Dean tony.d...@sas.com

 Hi,

 The security documentation specifies how to test a secure cluster by
 using kinit and thus adding the Kerberos principal TGT to the ticket
 cache in which the hadoop client code uses to acquire service tickets
 for use in the cluster.
 What if I created an application that used the hadoop API to
 communicate with hdfs and/or mapred protocols, is there a programmatic
 way to inform hadoop to use a particular Kerberos principal name with
 a keytab that contains its password 

RE: hadoop security API (repost)

2012-07-02 Thread Tony Dean
Alejandro,

Thanks for the reply.  My intent is to also be able to scan/get/put hbase 
tables under a specified identity as well.  What options do I have to perform 
the same multi-tenant  authorization for these operations?  I have posted this 
to hbase users distribution list as well, but thought you might have insight.  
Since hbase security authentication is so dependent upon hadoop, it would be 
nice if your suggestion worked for hbase as well.

Getting back to your suggestion... when configuring 
hadoop.proxyuser.myserveruser.hosts, host1 would be where I'm making the 
ugi.doAs() privileged call and host2 is the hadoop namenode?

Also, an another option, is there not a way for an application to pass 
hadoop/hbase authentication the name of a Kerberos principal to use?  In this 
case, no proxy, just execute as the designated user.

Thanks.

-Tony

-Original Message-
From: Alejandro Abdelnur [mailto:t...@cloudera.com] 
Sent: Monday, July 02, 2012 11:40 AM
To: common-user@hadoop.apache.org
Subject: Re: hadoop security API (repost)

Tony,

If you are doing a server app that interacts with the cluster on behalf of 
different users (like Ooize, as you mentioned in your email), then you should 
use the proxyuser capabilities of Hadoop.

* Configure user MYSERVERUSER as proxyuser in Hadoop core-site.xml (this 
requires 2 properties settings, HOSTS and GROUPS).
* Run your server app as MYSERVERUSER and have a Kerberos principal 
MYSERVERUSER/MYSERVERHOST
* Initialize your server app loading the MYSERVERUSER/MYSERVERHOST keytab
* Use the UGI.doAs() to create JobClient/Filesystem instances using the user 
you want to do something on behalf
* Keep in mind that all the users you need to do something on behalf should be 
valid Unix users in the cluster
* If those users need direct access to the cluster, they'll have to be also 
defined in in the KDC user database.

Hope this helps.

Thx

On Mon, Jul 2, 2012 at 6:22 AM, Tony Dean tony.d...@sas.com wrote:
 Yes, but this will not work in a multi-tenant environment.  I need to be able 
 to create a Kerberos TGT per execution thread.

 I was hoping through JAAS that I could inject the name of the current 
 principal and authenticate against it.  I'm sure there is a best practice for 
 hadoop/hbase client API authentication, just not sure what it is.

 Thank you for your comment.  The solution may well be associated with the 
 UserGroupInformation class.  Hopefully, other ideas will come from this 
 thread.

 Thanks.

 -Tony

 -Original Message-
 From: Ivan Frain [mailto:ivan.fr...@gmail.com]
 Sent: Monday, July 02, 2012 8:14 AM
 To: common-user@hadoop.apache.org
 Subject: Re: hadoop security API (repost)

 Hi Tony,

 I am currently working on this to access HDFS securely and programmaticaly.
 What I have found so far may help even if I am not 100% sure this is the 
 right way to proceed.

 If you have already obtained a TGT from the kinit command, hadoop library 
 will locate it automatically if the name of the ticket cache corresponds to 
 default location. On Linux it is located /tmp/krb5cc_uid-number.

 For example, with my linux user hdfs, I get a TGT for hadoop user 'ivan'
 meaning you can impersonate ivan from hdfs linux user:
 --
 hdfs@mitkdc:~$ klist
 Ticket cache: FILE:/tmp/krb5cc_10003
 Default principal: i...@hadoop.lan

 Valid startingExpires   Service principal
 02/07/2012 13:59  02/07/2012 23:59  krbtgt/hadoop@hadoop.lan renew 
 until 03/07/2012 13:59
 ---

 Then, you just have to set the right security options in your hadoop client 
 in java and the identity will be i...@hadoop.lan for our example. In my 
 tests, I only use HDFS and here a snippet of code to have access to a secure 
 hdfs cluster assuming the previous TGT (ivan's impersonation):

 
  val conf: HdfsConfiguration = new HdfsConfiguration()
  
 conf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION,
 kerberos)
  
 conf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHORIZATION,
 true)
  conf.set(DFSConfigKeys.DFS_NAMENODE_USER_NAME_KEY, 
 serverPrincipal)

  UserGroupInformation.setConfiguration(conf)

  val fs = FileSystem.get(new URI(hdfsUri), conf)
 

 Using this 'fs' is a handler to access hdfs securely as user 'ivan' even if 
 ivan does not appear in the hadoop client code.

 Anyway, I also see two other options:
   * Setting the KRB5CCNAME environment variable to point to the right 
 ticketCache file
   * Specifying the keytab file you want to use from the UserGroupInformation 
 singleton API:
 UserGroupInformation.loginUserFromKeytab(user, keytabFile)

 If you want to understand the auth process and the different options to 
 login, I guess you need to have a look to the UserGroupInformation.java 
 source code (release 0.23.1 link: 

Re: hadoop security API (repost)

2012-07-02 Thread Alejandro Abdelnur
On Mon, Jul 2, 2012 at 9:15 AM, Tony Dean tony.d...@sas.com wrote:
 Alejandro,

 Thanks for the reply.  My intent is to also be able to scan/get/put hbase 
 tables under a specified identity as well.  What options do I have to perform 
 the same multi-tenant  authorization for these operations?  I have posted 
 this to hbase users distribution list as well, but thought you might have 
 insight.  Since hbase security authentication is so dependent upon hadoop, it 
 would be nice if your suggestion worked for hbase as well.

 Getting back to your suggestion... when configuring 
 hadoop.proxyuser.myserveruser.hosts, host1 would be where I'm making the 
 ugi.doAs() privileged call and host2 is the hadoop namenode?


host1 in that case.

 Also, an another option, is there not a way for an application to pass 
 hadoop/hbase authentication the name of a Kerberos principal to use?  In this 
 case, no proxy, just execute as the designated user.

You could do that, but that means your app will have to have keytabs
for all the users want to act as. Proxyuser will be much easier to
manage. Maybe getting proxyuser support in hbase if it is not there
yet


 Thanks.

 -Tony

 -Original Message-
 From: Alejandro Abdelnur [mailto:t...@cloudera.com]
 Sent: Monday, July 02, 2012 11:40 AM
 To: common-user@hadoop.apache.org
 Subject: Re: hadoop security API (repost)

 Tony,

 If you are doing a server app that interacts with the cluster on behalf of 
 different users (like Ooize, as you mentioned in your email), then you should 
 use the proxyuser capabilities of Hadoop.

 * Configure user MYSERVERUSER as proxyuser in Hadoop core-site.xml (this 
 requires 2 properties settings, HOSTS and GROUPS).
 * Run your server app as MYSERVERUSER and have a Kerberos principal 
 MYSERVERUSER/MYSERVERHOST
 * Initialize your server app loading the MYSERVERUSER/MYSERVERHOST keytab
 * Use the UGI.doAs() to create JobClient/Filesystem instances using the user 
 you want to do something on behalf
 * Keep in mind that all the users you need to do something on behalf should 
 be valid Unix users in the cluster
 * If those users need direct access to the cluster, they'll have to be also 
 defined in in the KDC user database.

 Hope this helps.

 Thx

 On Mon, Jul 2, 2012 at 6:22 AM, Tony Dean tony.d...@sas.com wrote:
 Yes, but this will not work in a multi-tenant environment.  I need to be 
 able to create a Kerberos TGT per execution thread.

 I was hoping through JAAS that I could inject the name of the current 
 principal and authenticate against it.  I'm sure there is a best practice 
 for hadoop/hbase client API authentication, just not sure what it is.

 Thank you for your comment.  The solution may well be associated with the 
 UserGroupInformation class.  Hopefully, other ideas will come from this 
 thread.

 Thanks.

 -Tony

 -Original Message-
 From: Ivan Frain [mailto:ivan.fr...@gmail.com]
 Sent: Monday, July 02, 2012 8:14 AM
 To: common-user@hadoop.apache.org
 Subject: Re: hadoop security API (repost)

 Hi Tony,

 I am currently working on this to access HDFS securely and programmaticaly.
 What I have found so far may help even if I am not 100% sure this is the 
 right way to proceed.

 If you have already obtained a TGT from the kinit command, hadoop library 
 will locate it automatically if the name of the ticket cache corresponds 
 to default location. On Linux it is located /tmp/krb5cc_uid-number.

 For example, with my linux user hdfs, I get a TGT for hadoop user 'ivan'
 meaning you can impersonate ivan from hdfs linux user:
 --
 hdfs@mitkdc:~$ klist
 Ticket cache: FILE:/tmp/krb5cc_10003
 Default principal: i...@hadoop.lan

 Valid startingExpires   Service principal
 02/07/2012 13:59  02/07/2012 23:59  krbtgt/hadoop@hadoop.lan renew
 until 03/07/2012 13:59
 ---

 Then, you just have to set the right security options in your hadoop client 
 in java and the identity will be i...@hadoop.lan for our example. In my 
 tests, I only use HDFS and here a snippet of code to have access to a secure 
 hdfs cluster assuming the previous TGT (ivan's impersonation):

 
  val conf: HdfsConfiguration = new HdfsConfiguration()

 conf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION,
 kerberos)

 conf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHORIZATION,
 true)
  conf.set(DFSConfigKeys.DFS_NAMENODE_USER_NAME_KEY,
 serverPrincipal)

  UserGroupInformation.setConfiguration(conf)

  val fs = FileSystem.get(new URI(hdfsUri), conf)
 

 Using this 'fs' is a handler to access hdfs securely as user 'ivan' even if 
 ivan does not appear in the hadoop client code.

 Anyway, I also see two other options:
   * Setting the KRB5CCNAME environment variable to point to the right 
 ticketCache file
   * Specifying the keytab 

Re: Multiple cores vs multiple nodes

2012-07-02 Thread Matt Foley
This is actually a very complex question.  Without trying to answer
completely, the high points, as I see it, are:
a) [Most important] Different kinds of nodes require different Hadoop
configurations.  In particular, the number of simultaneous tasks per node
should presumably be set higher for a many-core node than for a few-core
node.
b) More nodes (potentially) give you more disk controllers, and more memory
bus bandwidth shared by the disk controllers and RAM and CPUs.
c) More nodes give you (potentially, in a flat network fabric) more network
bandwidth between cores.
d) You can't always assume the cores are equivalent.

Details:

a) If all other issues were indeed equal, you'd configure
mapred.tasktracker.map.tasks.maximum and
mapred.tasktracker.reduce.tasks.maximum four times larger on an 8-core
system than a 2-core system.  In the real world, you'll want to experiment
to optimize the settings for the actual hardware and actual job streams
you're running.

b) If you're running modern server hardware, you've got DMA disk
controllers and a multi-GByte/sec memory bus, as well as bus controllers
that do a great job of multiplexing all the demands that share the FSB.
 However, as the disk count goes up and the budget goes down, you need to
look at whether you're going to saturate either the controller(s) or the
bus, given the local i/o access patterns of your particular workload.

c) Similarly, given the NIC cards in your servers and your rack/switch
topology, you need to ask whether your network i/o access patterns,
especially during shuffle/sort, will risk saturating your network bandwidth.

d) Make sure that the particular CPUs you are comparing actually have
comparable cores, because there's a world of difference between the
different cores included in dozens of different CPUs available!

Hope this helps.  Cheers,
--Matt

On Sun, Jul 1, 2012 at 4:13 AM, Safdar Kureishy
safdar.kurei...@gmail.comwrote:

 Hi,

 I have a reasonably simple question that I thought I'd post to this list
 because I don't have enough experience with hardware to figure this out
 myself.

 Let's assume that I have 2 separate cluster setups for slave nodes. The
 master node is a separate machine *outside* these clusters:
 *Setup A*: 28 nodes, each with a 2-core CPU, 8 GB RAM and 1 SATA drives (1
 TB each)
 *Setup B*: 7 nodes, each with a 8-core CPU, 32 GB Ram and 4 SATA drives (1
 TB each)

 Note that I have maintained the same *core:memory:spindle* ratio above. In
 essence, setup B has the same overall processing + memory + spindle
 capacity, but achieved with 4 times fewer nodes.

 Ignoring the* cost* of each node above, and assuming a 10Gb Ethernet
 connectivity and the same speed-per-core across nodes in both the scenarios
 above, are Setup A and Setup B equivalent to each other in the context of
 setting up a Hadoop cluster? Or will the relative performance be different?
 Excluding the network connectivity between the nodes, what would be some
 other criteria that might give one setup an edge over the other, for
 regular Hadoop jobs?

 Also, assuming the same type of Hadoop jobs on both clusters, how different
 would the load experienced by the master node be for each setup above?

 Thanks in advance,
 Safdar



Re: hadoop security API (repost)

2012-07-02 Thread Andrew Purtell
 You could do that, but that means your app will have to have keytabs
 for all the users want to act as. Proxyuser will be much easier to
 manage. Maybe getting proxyuser support in hbase if it is not there
 yet

I don't think proxy auth is what the OP is after. Do I have that
right? Implies the presence of a node somewhere to act as the proxy.
For HBase, there is https://issues.apache.org/jira/browse/HBASE-5050
which would enable proxyuser support via the REST gateway as simple
follow on work.

On Mon, Jul 2, 2012 at 9:21 AM, Alejandro Abdelnur t...@cloudera.com wrote:
 On Mon, Jul 2, 2012 at 9:15 AM, Tony Dean tony.d...@sas.com wrote:
 Alejandro,

 Thanks for the reply.  My intent is to also be able to scan/get/put hbase 
 tables under a specified identity as well.  What options do I have to 
 perform the same multi-tenant  authorization for these operations?  I have 
 posted this to hbase users distribution list as well, but thought you might 
 have insight.  Since hbase security authentication is so dependent upon 
 hadoop, it would be nice if your suggestion worked for hbase as well.

 Getting back to your suggestion... when configuring 
 hadoop.proxyuser.myserveruser.hosts, host1 would be where I'm making the 
 ugi.doAs() privileged call and host2 is the hadoop namenode?


 host1 in that case.

 Also, an another option, is there not a way for an application to pass 
 hadoop/hbase authentication the name of a Kerberos principal to use?  In 
 this case, no proxy, just execute as the designated user.

 You could do that, but that means your app will have to have keytabs
 for all the users want to act as. Proxyuser will be much easier to
 manage. Maybe getting proxyuser support in hbase if it is not there
 yet


 Thanks.

 -Tony

 -Original Message-
 From: Alejandro Abdelnur [mailto:t...@cloudera.com]
 Sent: Monday, July 02, 2012 11:40 AM
 To: common-user@hadoop.apache.org
 Subject: Re: hadoop security API (repost)

 Tony,

 If you are doing a server app that interacts with the cluster on behalf of 
 different users (like Ooize, as you mentioned in your email), then you 
 should use the proxyuser capabilities of Hadoop.

 * Configure user MYSERVERUSER as proxyuser in Hadoop core-site.xml (this 
 requires 2 properties settings, HOSTS and GROUPS).
 * Run your server app as MYSERVERUSER and have a Kerberos principal 
 MYSERVERUSER/MYSERVERHOST
 * Initialize your server app loading the MYSERVERUSER/MYSERVERHOST keytab
 * Use the UGI.doAs() to create JobClient/Filesystem instances using the user 
 you want to do something on behalf
 * Keep in mind that all the users you need to do something on behalf should 
 be valid Unix users in the cluster
 * If those users need direct access to the cluster, they'll have to be also 
 defined in in the KDC user database.

 Hope this helps.

 Thx

 On Mon, Jul 2, 2012 at 6:22 AM, Tony Dean tony.d...@sas.com wrote:
 Yes, but this will not work in a multi-tenant environment.  I need to be 
 able to create a Kerberos TGT per execution thread.

 I was hoping through JAAS that I could inject the name of the current 
 principal and authenticate against it.  I'm sure there is a best practice 
 for hadoop/hbase client API authentication, just not sure what it is.

 Thank you for your comment.  The solution may well be associated with the 
 UserGroupInformation class.  Hopefully, other ideas will come from this 
 thread.

 Thanks.

 -Tony

 -Original Message-
 From: Ivan Frain [mailto:ivan.fr...@gmail.com]
 Sent: Monday, July 02, 2012 8:14 AM
 To: common-user@hadoop.apache.org
 Subject: Re: hadoop security API (repost)

 Hi Tony,

 I am currently working on this to access HDFS securely and programmaticaly.
 What I have found so far may help even if I am not 100% sure this is the 
 right way to proceed.

 If you have already obtained a TGT from the kinit command, hadoop library 
 will locate it automatically if the name of the ticket cache corresponds 
 to default location. On Linux it is located /tmp/krb5cc_uid-number.

 For example, with my linux user hdfs, I get a TGT for hadoop user 'ivan'
 meaning you can impersonate ivan from hdfs linux user:
 --
 hdfs@mitkdc:~$ klist
 Ticket cache: FILE:/tmp/krb5cc_10003
 Default principal: i...@hadoop.lan

 Valid startingExpires   Service principal
 02/07/2012 13:59  02/07/2012 23:59  krbtgt/hadoop@hadoop.lan renew
 until 03/07/2012 13:59
 ---

 Then, you just have to set the right security options in your hadoop client 
 in java and the identity will be i...@hadoop.lan for our example. In my 
 tests, I only use HDFS and here a snippet of code to have access to a 
 secure hdfs cluster assuming the previous TGT (ivan's impersonation):

 
  val conf: HdfsConfiguration = new HdfsConfiguration()

 conf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION,
 kerberos)

 

Dealing with changing file format

2012-07-02 Thread Mohit Anchlia
I am wondering what's the right way to go about designing reading input and
output where file format may change over period. For instance we might
start with field1,field2,field3 but at some point we add new field4 in
the input. What's the best way to deal with such scenarios? Keep a catalog
of changes that timestamped?


Re: Dealing with changing file format

2012-07-02 Thread Robert Evans
There are several different ways.  One of the ways is to use something
like Hcatalog to track the format and location of the dataset.  This may
be overkill for your problem, but it will grow with you.  Another is to
store the scheme with the data when it is written out.  Your code may need
to the dynamically adjust to when the field is there and when it is not.

--Bobby Evans

On 7/2/12 4:09 PM, Mohit Anchlia mohitanch...@gmail.com wrote:

I am wondering what's the right way to go about designing reading input
and
output where file format may change over period. For instance we might
start with field1,field2,field3 but at some point we add new field4 in
the input. What's the best way to deal with such scenarios? Keep a catalog
of changes that timestamped?



Re: force minimum number of nodes to run a job?

2012-07-02 Thread Harsh J
If you're talking in per-machine-slot terms, it is possible to do if
you use the Capacity Scheduler, and set a memory requirement worthy of
4 slots for your job. This way CS will reserve 4 slots for running a
single task (on a single task tracker).

If you are instead asking for a way to not run tasks one by one, but
rather run them all in parallel (across machine) but otherwise not run
at all, thats not directly possible to do via the MR framework, but
you may hang your task with your own conditions and only invoke all to
begin if all have entered running modes. Using ZK should let you do
this. Alternatively, consider the YARN framework that gives you more
granular control on flow execution of tasks if you need that.

On Tue, Jul 3, 2012 at 5:48 AM, Yang tedd...@gmail.com wrote:
 let's say my job can run on 4 mapper slots, but if there is only 1 slot
 available,
 I don't want them to run one by one, and have to wait till the time that at
 least 4 slots are available.

 is it possible to force hadoop to do this?

 thanks!
 yang



-- 
Harsh J


Re: Dealing with changing file format

2012-07-02 Thread Harsh J
In addition to what Robert says, using a schema-based approach such as
Apache Avro can also help here. The schemas in Avro can evolve over
time if done right, while not breaking old readers.

On Tue, Jul 3, 2012 at 2:47 AM, Robert Evans ev...@yahoo-inc.com wrote:
 There are several different ways.  One of the ways is to use something
 like Hcatalog to track the format and location of the dataset.  This may
 be overkill for your problem, but it will grow with you.  Another is to
 store the scheme with the data when it is written out.  Your code may need
 to the dynamically adjust to when the field is there and when it is not.

 --Bobby Evans

 On 7/2/12 4:09 PM, Mohit Anchlia mohitanch...@gmail.com wrote:

I am wondering what's the right way to go about designing reading input
and
output where file format may change over period. For instance we might
start with field1,field2,field3 but at some point we add new field4 in
the input. What's the best way to deal with such scenarios? Keep a catalog
of changes that timestamped?




-- 
Harsh J