great reply everyone. just confining to the current subject matter Spark and the use of CPU allocation. We have Spark-submit parameters:
Local mode ${SPARK_HOME}/bin/spark-submit \ --num-executors 1 \ --master local[2] \ ## two cores And that --master[k] on my box comes from cat /proc/cpuinfo|grep processor processor : 0 processor : 1 processor : 2 processor : 3 processor : 4 processor : 5 processor : 6 processor : 7 processor : 8 processor : 9 processor : 10 processor : 11 so there are 12 processors 0-12 And 12 core-id cat /proc/cpuinfo|grep 'core id' core id : 0 core id : 1 core id : 2 core id : 8 core id : 9 core id : 10 core id : 0 core id : 1 core id : 2 core id : 8 core id : 9 core id : 10 So in spark-submit I can put ${SPARK_HOME}/bin/spark-submit \ --num-executors 1 \ --master local[12] \ ## Max cores Actually this is what Spark doc <http://spark.apache.org/docs/latest/submitting-applications.html>says *Run application locally on 8 cores* ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master local[8] \ That resolves our usage. Now I mentioned earlier the licensing charges. So if I run any SAP product they are going to charge us with cores on this host for their software ./cpuinfo License hostid: 00e04c69159a 0050b60fd1e7 *Detected 12 logical processor(s), 6 core(s), in 1 chip(s)* They charge by core(s) so we will have to pay for 6 cores not 12 logical processors. I am sure if they knew that they could charge for 12 cores they would have done it by now :) Cheers Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 17 June 2016 at 12:01, Robin East <robin.e...@xense.co.uk> wrote: > Agreed it’s a worthwhile discussion (and interesting IMO) > > This is a section from your original post: > > It is about the terminology or interpretation of that in Spark doc. >>>>> >>>>> This is my understanding of cores and threads. >>>>> >>>>> Cores are physical cores. Threads are virtual cores. >>>>> >>>> > At least as far as Spark doc is concerned Threads are not synonymous with > virtual cores; they are closely related concepts of course. So any time we > want to have a discussion about architecture, performance, tuning, > configuration etc we do need to be clear about the concepts and how they > are defined. > > Granted CPU hardware implementation can also refer to ’threads’. In fact > Oracle/Sun seem unclear as to what they mean by thread - in various > documents they define threads as: > > A software entity that can be executed on hardware (e.g. Oracle SPARC > Architecture 2011) > > At other times as: > > A thread is a hardware strand. Each thread, or strand, enjoys a unique set > of resources in support of its … (e.g. OpenSPARC T1 Microarchitecture > Specification) > > So unless the documentation you are writing is very specific to your > environment, and the idea that a thread is a logical processor is generally > accepted, I would not be inclined to treat threads as if they are logical > processors. > > > > On 16 Jun 2016, at 15:45, Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > > Thanks all. > > I think we are diverging but IMO it is a worthwhile discussion > > Actually, threads are a hardware implementation - hence the whole notion > of “multi-threaded cores”. What happens is that the cores often have > duplicate registers, etc. for holding execution state. While it is > correct that only a single process is executing at a time, a single core > will have execution states of multiple processes preserved in these > registers. In addition, it is the core (not the OS) that determines when > the thread is executed. The approach often varies according to the CPU > manufacturer, but the most simple approach is when one thread of execution > executes a multi-cycle operation (e.g. a fetch from main memory, etc.), the > core simply stops processing that thread saves the execution state to a set > of registers, loads instructions from the other set of registers and goes > on. On the Oracle SPARC chips, it will actually check the next thread to > see if the reason it was ‘parked’ has completed and if not, skip it for the > subsequent thread. The OS is only aware of what are cores and what are > logical processors - and dispatches accordingly. *Execution is up to the > cores*. . > > Cheers > > > > > Dr Mich Talebzadeh > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > http://talebzadehmich.wordpress.com > > > > On 16 June 2016 at 13:02, Robin East <robin.e...@xense.co.uk> wrote: > >> Mich >> >> >> A core may have one or more threads >> It would be more accurate to say that a core could *run* one or more >> threads scheduled for execution. Threads are a software/OS concept that >> represent executable code that is scheduled to run by the OS; A CPU, core >> or virtual core/virtual processor execute that code. Threads are not CPUs >> or cores whether physical or logical - any Spark documentation that implies >> this is mistaken. I’ve looked at the documentation you mention and I don’t >> read it to mean that threads are logical processors. >> >> To go back to your original question, if you set local[6] and you have 12 >> logical processors then you are likely to have half your CPU resources >> unused by Spark. >> >> >> On 15 Jun 2016, at 23:08, Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >> I think it is slightly more than that. >> >> These days software is licensed by core (generally speaking). That is >> the physical processor. * A core may have one or more threads - or >> logical processors*. Virtualization adds some fun to the mix. >> Generally what they present is ‘virtual processors’. What that equates to >> depends on the virtualization layer itself. In some simpler VM’s - it is >> virtual=logical. In others, virtual=logical but they are constrained to >> be from the same cores - e.g. if you get 6 virtual processors, it really is >> 3 full cores with 2 threads each. Rational is due to the way OS >> dispatching works on ‘logical’ processors vs. cores and POSIX threaded >> applications. >> >> HTH >> >> Dr Mich Talebzadeh >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> On 13 June 2016 at 18:17, Mark Hamstra <m...@clearstorydata.com> wrote: >> >>> I don't know what documentation you were referring to, but this is >>> clearly an erroneous statement: "Threads are virtual cores." At best it is >>> terminology abuse by a hardware manufacturer. Regardless, Spark can't get >>> too concerned about how any particular hardware vendor wants to refer to >>> the specific components of their CPU architecture. For us, a core is a >>> logical execution unit, something on which a thread of execution can run. >>> That can map in different ways to different physical or virtual hardware. >>> >>> On Mon, Jun 13, 2016 at 12:02 AM, Mich Talebzadeh < >>> mich.talebza...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> It is not the issue of testing anything. I was referring to >>>> documentation that clearly use the term "threads". As I said and showed >>>> before, one line is using the term "thread" and the next one "logical >>>> cores". >>>> >>>> >>>> HTH >>>> >>>> Dr Mich Talebzadeh >>>> >>>> >>>> LinkedIn * >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>> >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> >>>> >>>> On 12 June 2016 at 23:57, Daniel Darabos < >>>> daniel.dara...@lynxanalytics.com> wrote: >>>> >>>>> Spark is a software product. In software a "core" is something that a >>>>> process can run on. So it's a "virtual core". (Do not call these >>>>> "threads". >>>>> A "thread" is not something a process can run on.) >>>>> >>>>> local[*] uses java.lang.Runtime.availableProcessors() >>>>> <https://github.com/apache/spark/blob/v1.6.1/core/src/main/scala/org/apache/spark/SparkContext.scala#L2608>. >>>>> Since Java is software, this also returns the number of virtual cores. >>>>> (You >>>>> can test this easily.) >>>>> >>>>> >>>>> On Sun, Jun 12, 2016 at 9:23 PM, Mich Talebzadeh < >>>>> mich.talebza...@gmail.com> wrote: >>>>> >>>>>> >>>>>> Hi, >>>>>> >>>>>> I was writing some docs on Spark P&T and came across this. >>>>>> >>>>>> It is about the terminology or interpretation of that in Spark doc. >>>>>> >>>>>> This is my understanding of cores and threads. >>>>>> >>>>>> Cores are physical cores. Threads are virtual cores. Cores with 2 >>>>>> threads is called hyper threading technology so 2 threads per core makes >>>>>> the core work on two loads at same time. In other words, every thread >>>>>> takes >>>>>> care of one load. >>>>>> >>>>>> Core has its own memory. So if you have a dual core with hyper >>>>>> threading, the core works with 2 loads each at same time because of the 2 >>>>>> threads per core, but this 2 threads will share memory in that core. >>>>>> >>>>>> Some vendors as I am sure most of you aware charge licensing per core. >>>>>> >>>>>> For example on the same host that I have Spark, I have a SAP product >>>>>> that checks the licensing and shuts the application down if the license >>>>>> does not agree with the cores speced. >>>>>> >>>>>> This is what it says >>>>>> >>>>>> ./cpuinfo >>>>>> License hostid: 00e04c69159a 0050b60fd1e7 >>>>>> Detected 12 logical processor(s), 6 core(s), in 1 chip(s) >>>>>> >>>>>> So here I have 12 logical processors and 6 cores and 1 chip. I call >>>>>> logical processors as threads so I have 12 threads? >>>>>> >>>>>> Now if I go and start worker process >>>>>> ${SPARK_HOME}/sbin/start-slaves.sh, I see this in GUI page >>>>>> >>>>>> <image.png> >>>>>> >>>>>> it says 12 cores but I gather it is threads? >>>>>> >>>>>> Spark document >>>>>> <http://spark.apache.org/docs/latest/submitting-applications.html> >>>>>> states and I quote >>>>>> >>>>>> <image.png> >>>>>> >>>>>> >>>>>> OK the line local[k] adds .. *set this to the number of cores on >>>>>> your machine* >>>>>> >>>>>> But I know that it means threads. Because if I went and set that to >>>>>> 6, it would be only 6 threads as opposed to 12 threads. >>>>>> >>>>>> the next line local[*] seems to indicate it correctly as it refers to >>>>>> "logical cores" that in my understanding it is threads. >>>>>> >>>>>> I trust that I am not nitpicking here! >>>>>> >>>>>> Cheers, >>>>>> >>>>>> >>>>>> Dr Mich Talebzadeh >>>>>> >>>>>> >>>>>> LinkedIn * >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>>> >>>>>> >>>>>> http://talebzadehmich.wordpress.com >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>> >> >> > >