Re: Spark config option 'expression language' feedback request
Reviving this to see if others would like to chime in about this expression language for config options. On Fri, Mar 13, 2015 at 7:57 PM, Dale Richardson dale...@hotmail.com wrote: Mridul,I may have added some confusion by giving examples in completely different areas. For example the number of cores available for tasking on each worker machine is a resource-controller level configuration variable. In standalone mode (ie using Spark's home-grown resource manager) the configuration variable SPARK_WORKER_CORES is an item that spark admins can set (and we can use expressions for). The equivalent variable for YARN (Yarn.nodemanager.resource.cpu-vcores) is only used by Yarn's node manager setup and is set by Yarn administrators and outside of control of spark (and most users). If you are not a cluster administrator then both variables are irrelevant to you. The same goes for SPARK_WORKER_MEMORY. As for spark.executor.memory, As there is no way to know the attributes of a machine before a task is allocated to it, we cannot use any of the JVMInfo functions. For options like that the expression parser can easily be limited to supporting different byte units of scale (kb/mb/gb etc) and other configuration variables only. Regards,Dale. Date: Fri, 13 Mar 2015 17:30:51 -0700 Subject: Re: Spark config option 'expression language' feedback request From: mri...@gmail.com To: dale...@hotmail.com CC: dev@spark.apache.org Let me try to rephrase my query. How can a user specify, for example, what the executor memory should be or number of cores should be. I dont want a situation where some variables can be specified using one set of idioms (from this PR for example) and another set cannot be. Regards, Mridul On Fri, Mar 13, 2015 at 4:06 PM, Dale Richardson dale...@hotmail.com wrote: Thanks for your questions Mridul. I assume you are referring to how the functionality to query system state works in Yarn and Mesos? The API's used are the standard JVM API's so the functionality will work without change. There is no real use case for using 'physicalMemoryBytes' in these cases though, as the JVM size has already been limited by the resource manager. Regards,Dale. Date: Fri, 13 Mar 2015 08:20:33 -0700 Subject: Re: Spark config option 'expression language' feedback request From: mri...@gmail.com To: dale...@hotmail.com CC: dev@spark.apache.org I am curious how you are going to support these over mesos and yarn. Any configure change like this should be applicable to all of them, not just local and standalone modes. Regards Mridul On Friday, March 13, 2015, Dale Richardson dale...@hotmail.com wrote: PR#4937 ( https://github.com/apache/spark/pull/4937) is a feature to allow for Spark configuration options (whether on command line, environment variable or a configuration file) to be specified via a simple expression language. Such a feature has the following end-user benefits: - Allows for the flexibility in specifying time intervals or byte quantities in appropriate and easy to follow units e.g. 1 week rather rather then 604800 seconds - Allows for the scaling of a configuration option in relation to a system attributes. e.g. SPARK_WORKER_CORES = numCores - 1 SPARK_WORKER_MEMORY = physicalMemoryBytes - 1.5 GB - Gives the ability to scale multiple configuration options together eg: spark.driver.memory = 0.75 * physicalMemoryBytes spark.driver.maxResultSize = spark.driver.memory * 0.8 The following functions are currently supported by this PR: NumCores: Number of cores assigned to the JVM (usually == Physical machine cores) PhysicalMemoryBytes: Memory size of hosting machine JVMTotalMemoryBytes: Current bytes of memory allocated to the JVM JVMMaxMemoryBytes:Maximum number of bytes of memory available to the JVM JVMFreeMemoryBytes: maxMemoryBytes - totalMemoryBytes I was wondering if anybody on the mailing list has any further ideas on other functions that could be useful to have when specifying spark configuration options? Regards,Dale. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Spark config option 'expression language' feedback request
Hi, This is just a thought from my experience setting up Spark to run on a linux cluster. I found it a bit unusual that some parameters could be specified as command line args to spark-submit, others as env variables, and some in a configuration file. What I ended up doing was writing my own bash script that exported all the variables and other scripts to call spark-submit with the arguments I wanted. I think that the expressive language idea would be doable by using an entirely env variable based approach, or as commandline parameters. That way there is only one configuration, which is easily scriptable, and you are still able to express relations like: spark.driver.maxResultSize = spark.driver.memory * 0.8 in your config as export SPARK_DRIVER_MAXRESULTSIZE = $(bc -l 0.8 * $SPARK_DRIVER_MEMORY) It may not look as nice, but it does allow for everything to be in one place, and to have separate config files for certain jobs. Admittedly, if you want something like 0.8 * 2G, you first write a bash function to expand all the G M k symbols, but that's not too painful. On Mar 31, 2015 2:39 AM, Reynold Xin r...@databricks.com wrote: Reviving this to see if others would like to chime in about this expression language for config options. On Fri, Mar 13, 2015 at 7:57 PM, Dale Richardson dale...@hotmail.com wrote: Mridul,I may have added some confusion by giving examples in completely different areas. For example the number of cores available for tasking on each worker machine is a resource-controller level configuration variable. In standalone mode (ie using Spark's home-grown resource manager) the configuration variable SPARK_WORKER_CORES is an item that spark admins can set (and we can use expressions for). The equivalent variable for YARN (Yarn.nodemanager.resource.cpu-vcores) is only used by Yarn's node manager setup and is set by Yarn administrators and outside of control of spark (and most users). If you are not a cluster administrator then both variables are irrelevant to you. The same goes for SPARK_WORKER_MEMORY. As for spark.executor.memory, As there is no way to know the attributes of a machine before a task is allocated to it, we cannot use any of the JVMInfo functions. For options like that the expression parser can easily be limited to supporting different byte units of scale (kb/mb/gb etc) and other configuration variables only. Regards,Dale. Date: Fri, 13 Mar 2015 17:30:51 -0700 Subject: Re: Spark config option 'expression language' feedback request From: mri...@gmail.com To: dale...@hotmail.com CC: dev@spark.apache.org Let me try to rephrase my query. How can a user specify, for example, what the executor memory should be or number of cores should be. I dont want a situation where some variables can be specified using one set of idioms (from this PR for example) and another set cannot be. Regards, Mridul On Fri, Mar 13, 2015 at 4:06 PM, Dale Richardson dale...@hotmail.com wrote: Thanks for your questions Mridul. I assume you are referring to how the functionality to query system state works in Yarn and Mesos? The API's used are the standard JVM API's so the functionality will work without change. There is no real use case for using 'physicalMemoryBytes' in these cases though, as the JVM size has already been limited by the resource manager. Regards,Dale. Date: Fri, 13 Mar 2015 08:20:33 -0700 Subject: Re: Spark config option 'expression language' feedback request From: mri...@gmail.com To: dale...@hotmail.com CC: dev@spark.apache.org I am curious how you are going to support these over mesos and yarn. Any configure change like this should be applicable to all of them, not just local and standalone modes. Regards Mridul On Friday, March 13, 2015, Dale Richardson dale...@hotmail.com wrote: PR#4937 ( https://github.com/apache/spark/pull/4937) is a feature to allow for Spark configuration options (whether on command line, environment variable or a configuration file) to be specified via a simple expression language. Such a feature has the following end-user benefits: - Allows for the flexibility in specifying time intervals or byte quantities in appropriate and easy to follow units e.g. 1 week rather rather then 604800 seconds - Allows for the scaling of a configuration option in relation to a system attributes. e.g. SPARK_WORKER_CORES = numCores - 1 SPARK_WORKER_MEMORY = physicalMemoryBytes - 1.5 GB - Gives the ability to scale multiple configuration options together eg: spark.driver.memory = 0.75 * physicalMemoryBytes spark.driver.maxResultSize = spark.driver.memory * 0.8 The
Re: should we add a start-masters.sh script in sbin?
Sounds good to me. On Tue, Mar 31, 2015 at 6:12 PM, sequoiadb mailing-list-r...@sequoiadb.com wrote: Hey, start-slaves.sh script is able to read from slaves file and start slaves node in multiple boxes. However in standalone mode if I want to use multiple masters, I’ll have to start masters in each individual box, and also need to provide the list of masters’ hostname+port to each worker. ( start-slaves.sh only take 1 master ip+port for now) I wonder should we create a new script called start-masters.sh to read conf/masters file? Also start-slaves.sh script may need to change a little bit so that master list can be passed to worker nodes. Thanks - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org