Re: Spark config option 'expression language' feedback request

2015-03-31 Thread Reynold Xin
Reviving this to see if others would like to chime in about this
expression language for config options.


On Fri, Mar 13, 2015 at 7:57 PM, Dale Richardson dale...@hotmail.com
wrote:

 Mridul,I may have added some confusion by giving examples in completely
 different areas. For example the number of cores available for tasking on
 each worker machine is a resource-controller level configuration variable.
 In standalone mode (ie using Spark's home-grown resource manager) the
 configuration variable SPARK_WORKER_CORES is an item that spark admins can
 set (and we can use expressions for). The equivalent variable for YARN
 (Yarn.nodemanager.resource.cpu-vcores) is only used by Yarn's node manager
 setup and is set by Yarn administrators and outside of control of spark
 (and most users).  If you are not a cluster administrator then both
 variables are irrelevant to you. The same goes for SPARK_WORKER_MEMORY.

 As for spark.executor.memory,  As there is no way to know the attributes
 of a machine before a task is allocated to it, we cannot use any of the
 JVMInfo functions. For options like that the expression parser can easily
 be limited to supporting different byte units of scale (kb/mb/gb etc) and
 other configuration variables only.
 Regards,Dale.




  Date: Fri, 13 Mar 2015 17:30:51 -0700
  Subject: Re: Spark config option 'expression language' feedback request
  From: mri...@gmail.com
  To: dale...@hotmail.com
  CC: dev@spark.apache.org
 
  Let me try to rephrase my query.
  How can a user specify, for example, what the executor memory should
  be or number of cores should be.
 
  I dont want a situation where some variables can be specified using
  one set of idioms (from this PR for example) and another set cannot
  be.
 
 
  Regards,
  Mridul
 
 
 
 
  On Fri, Mar 13, 2015 at 4:06 PM, Dale Richardson dale...@hotmail.com
 wrote:
  
  
  
   Thanks for your questions Mridul.
   I assume you are referring to how the functionality to query system
 state works in Yarn and Mesos?
   The API's used are the standard JVM API's so the functionality will
 work without change. There is no real use case for using
 'physicalMemoryBytes' in these cases though, as the JVM size has already
 been limited by the resource manager.
   Regards,Dale.
   Date: Fri, 13 Mar 2015 08:20:33 -0700
   Subject: Re: Spark config option 'expression language' feedback
 request
   From: mri...@gmail.com
   To: dale...@hotmail.com
   CC: dev@spark.apache.org
  
   I am curious how you are going to support these over mesos and yarn.
   Any configure change like this should be applicable to all of them,
 not
   just local and standalone modes.
  
   Regards
   Mridul
  
   On Friday, March 13, 2015, Dale Richardson dale...@hotmail.com
 wrote:
  
   
   
   
   
   
   
   
   
   
   
   
PR#4937 ( https://github.com/apache/spark/pull/4937) is a feature
 to
allow for Spark configuration options (whether on command line,
 environment
variable or a configuration file) to be specified via a simple
 expression
language.
   
   
Such a feature has the following end-user benefits:
- Allows for the flexibility in specifying time intervals or byte
quantities in appropriate and easy to follow units e.g. 1 week
 rather
rather then 604800 seconds
   
- Allows for the scaling of a configuration option in relation to a
 system
attributes. e.g.
   
SPARK_WORKER_CORES = numCores - 1
   
SPARK_WORKER_MEMORY = physicalMemoryBytes - 1.5 GB
   
- Gives the ability to scale multiple configuration options
 together eg:
   
spark.driver.memory = 0.75 * physicalMemoryBytes
   
spark.driver.maxResultSize = spark.driver.memory * 0.8
   
   
The following functions are currently supported by this PR:
NumCores: Number of cores assigned to the JVM (usually
 ==
Physical machine cores)
PhysicalMemoryBytes:  Memory size of hosting machine
   
JVMTotalMemoryBytes:  Current bytes of memory allocated to the JVM
   
JVMMaxMemoryBytes:Maximum number of bytes of memory available
 to the
JVM
   
JVMFreeMemoryBytes:   maxMemoryBytes - totalMemoryBytes
   
   
I was wondering if anybody on the mailing list has any further
 ideas on
other functions that could be useful to have when specifying spark
configuration options?
Regards,Dale.
   
  
  
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 




Re: Spark config option 'expression language' feedback request

2015-03-31 Thread Mike Hynes
Hi,
This is just a thought from my experience setting up Spark to run on a
linux cluster. I found it a bit unusual that some parameters could be
specified as command line args to spark-submit, others as env variables,
and some in a configuration file. What I ended up doing was writing my own
bash script that exported all the variables and other scripts to call
spark-submit with the arguments I wanted.

I think that the expressive language idea would be doable by using an
entirely env variable based approach, or as commandline parameters. That
way there is only one configuration, which is easily scriptable,  and you
are still able to express relations like:
spark.driver.maxResultSize = spark.driver.memory * 0.8
in your config as
export SPARK_DRIVER_MAXRESULTSIZE = $(bc -l  0.8 *
$SPARK_DRIVER_MEMORY)

It may not look as nice, but it does allow for everything to be in one
place, and to have separate config files for certain jobs. Admittedly, if
you want something like 0.8 * 2G, you first write a bash function to expand
all the G M k symbols,  but that's not too painful.
On Mar 31, 2015 2:39 AM, Reynold Xin r...@databricks.com wrote:

 Reviving this to see if others would like to chime in about this
 expression language for config options.


 On Fri, Mar 13, 2015 at 7:57 PM, Dale Richardson dale...@hotmail.com
 wrote:

  Mridul,I may have added some confusion by giving examples in completely
  different areas. For example the number of cores available for tasking on
  each worker machine is a resource-controller level configuration
 variable.
  In standalone mode (ie using Spark's home-grown resource manager) the
  configuration variable SPARK_WORKER_CORES is an item that spark admins
 can
  set (and we can use expressions for). The equivalent variable for YARN
  (Yarn.nodemanager.resource.cpu-vcores) is only used by Yarn's node
 manager
  setup and is set by Yarn administrators and outside of control of spark
  (and most users).  If you are not a cluster administrator then both
  variables are irrelevant to you. The same goes for SPARK_WORKER_MEMORY.
 
  As for spark.executor.memory,  As there is no way to know the attributes
  of a machine before a task is allocated to it, we cannot use any of the
  JVMInfo functions. For options like that the expression parser can easily
  be limited to supporting different byte units of scale (kb/mb/gb etc) and
  other configuration variables only.
  Regards,Dale.
 
 
 
 
   Date: Fri, 13 Mar 2015 17:30:51 -0700
   Subject: Re: Spark config option 'expression language' feedback request
   From: mri...@gmail.com
   To: dale...@hotmail.com
   CC: dev@spark.apache.org
  
   Let me try to rephrase my query.
   How can a user specify, for example, what the executor memory should
   be or number of cores should be.
  
   I dont want a situation where some variables can be specified using
   one set of idioms (from this PR for example) and another set cannot
   be.
  
  
   Regards,
   Mridul
  
  
  
  
   On Fri, Mar 13, 2015 at 4:06 PM, Dale Richardson dale...@hotmail.com
  wrote:
   
   
   
Thanks for your questions Mridul.
I assume you are referring to how the functionality to query system
  state works in Yarn and Mesos?
The API's used are the standard JVM API's so the functionality will
  work without change. There is no real use case for using
  'physicalMemoryBytes' in these cases though, as the JVM size has already
  been limited by the resource manager.
Regards,Dale.
Date: Fri, 13 Mar 2015 08:20:33 -0700
Subject: Re: Spark config option 'expression language' feedback
  request
From: mri...@gmail.com
To: dale...@hotmail.com
CC: dev@spark.apache.org
   
I am curious how you are going to support these over mesos and yarn.
Any configure change like this should be applicable to all of them,
  not
just local and standalone modes.
   
Regards
Mridul
   
On Friday, March 13, 2015, Dale Richardson dale...@hotmail.com
  wrote:
   











 PR#4937 ( https://github.com/apache/spark/pull/4937) is a feature
  to
 allow for Spark configuration options (whether on command line,
  environment
 variable or a configuration file) to be specified via a simple
  expression
 language.


 Such a feature has the following end-user benefits:
 - Allows for the flexibility in specifying time intervals or byte
 quantities in appropriate and easy to follow units e.g. 1 week
  rather
 rather then 604800 seconds

 - Allows for the scaling of a configuration option in relation to
 a
  system
 attributes. e.g.

 SPARK_WORKER_CORES = numCores - 1

 SPARK_WORKER_MEMORY = physicalMemoryBytes - 1.5 GB

 - Gives the ability to scale multiple configuration options
  together eg:

 spark.driver.memory = 0.75 * physicalMemoryBytes

 spark.driver.maxResultSize = spark.driver.memory * 0.8


 The 

Re: should we add a start-masters.sh script in sbin?

2015-03-31 Thread Ted Yu
Sounds good to me.

On Tue, Mar 31, 2015 at 6:12 PM, sequoiadb mailing-list-r...@sequoiadb.com
wrote:

 Hey,

 start-slaves.sh script is able to read from slaves file and start slaves
 node in multiple boxes.
 However in standalone mode if I want to use multiple masters, I’ll have to
 start masters in each individual box, and also need to provide the list of
 masters’ hostname+port to each worker. ( start-slaves.sh only take 1 master
 ip+port for now)
 I wonder should we create a new script called start-masters.sh to read
 conf/masters file? Also start-slaves.sh script may need to change a little
 bit so that master list can be passed to worker nodes.

 Thanks

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org