On the contrary, it is a common occurrence in a Spark Jobserver style of
application with multiple users.
On Thu, Dec 20, 2018 at 6:09 PM Jiaan Geng wrote:
> This scene is rare.
> When you provide a web server for spark. maybe you need it.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1
That will kill an entire Spark application, not a batch Job.
On Wed, Dec 5, 2018 at 3:07 PM Priya Matpadi wrote:
> if you are deploying your spark application on YARN cluster,
> 1. ssh into master node
> 2. List the currently running application and retreive the application_id
> yarn applica
It is intentionally not accessible in your code since Utils is internal
Spark code, not part of the public API. Changing Spark to make that private
code public would be inviting trouble, or at least future headaches. If you
don't already know how to build and maintain your own custom fork of Spark
What do you mean? Spark Jobs don't have names.
On Thu, Sep 20, 2018 at 9:40 PM Priya Ch
wrote:
> Hello All,
>
> I am trying to extend SparkListener and post job ends trying to retrieve
> job name to check the status of either success/failure and write to log
> file.
>
> I couldn't find a way whe
ng before fully removing it: for
> example, if Pandas and TensorFlow no longer support Python 2 past some
> point, that might be a good point to remove it.
>
> Matei
>
> > On Sep 17, 2018, at 11:01 AM, Mark Hamstra
> wrote:
> >
> > If we're going to do tha
some spark versions supporting
> Py2 past the point where Py2 is no longer receiving security patches
>
>
> On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra
> wrote:
>
>> We could also deprecate Py2 already in the 2.4.0 release.
>>
>> On Sat, Sep 15, 2018 at 11:46 A
We could also deprecate Py2 already in the 2.4.0 release.
On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson wrote:
> In case this didn't make it onto this thread:
>
> There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and remove
> it entirely on a later 3.x release.
>
> On Sat, Sep 15
It's been done many times before by many organizations. Use Spark Job
Server or Livy or create your own implementation of a similar long-running
Spark Application. Creating a new Application for every Job is not the way
to achieve low-latency performance.
On Tue, Jul 10, 2018 at 4:18 AM wrote:
>
Essentially correct. The latency to start a Spark Job is nowhere close to
2-4 seconds under typical conditions. Creating a new Spark Application
every time instead of running multiple Jobs in one Application is not going
to lead to acceptable interactive or real-time performance, nor is that an
exe
The latency to start a Spark Job is nowhere close to 2-4 seconds under
typical conditions. You appear to be creating a new Spark Application
everytime instead of running multiple Jobs in one Application.
On Fri, Jul 6, 2018 at 3:12 AM Tien Dat wrote:
> Dear Timothy,
>
> It works like a charm now
Horizontal scaling is scaling across multiple, distributed computers (or at
least OS instances). Local mode is, therefore, by definition not
horizontally scalable since it just uses a configurable number of local
threads. If the question actually asked "which cluster manager...?", then I
have a sma
spark.driver.maxResultSize
http://spark.apache.org/docs/latest/configuration.html
On Sat, Apr 28, 2018 at 8:41 AM, klrmowse wrote:
> i am currently trying to find a workaround for the Spark application i am
> working on so that it does not have to use .collect()
>
> but, for now, it is going to
Even more to the point:
http://apache-spark-developers-list.1001551.n3.nabble.com/Scala-2-12-support-td23833.html
tldr; It's an item of discussion, but there is no imminent release of Spark
that will use Scala 2.12.
On Sat, Apr 21, 2018 at 2:44 AM, purijatin wrote:
> I see a discussion post on
Keep forgetting to reply to user list...
On Sun, Apr 15, 2018 at 1:58 PM, Mark Hamstra
wrote:
> Sure, data locality all the way at the basic storage layer is the easy way
> to avoid paying the costs of remote I/O. My point, though, is that that
> kind of storage locality isn't n
This is not a Databricks forum.
On Mon, Nov 13, 2017 at 3:18 PM, Benjamin Kim wrote:
> I have a question about this. The documentation compares the concept
> similar to BigQuery. Does this mean that we will no longer need to deal
> with instances and just pay for execution duration and amount of
with sbt. Why ?
>
>
> On Sun, Oct 15, 2017 at 11:54 PM, Mark Hamstra
> wrote:
>
>> I am building Spark using build.sbt.
>>
>>
>> Which just gets me back to my original question: Why? This is not the
>> correct way to build Spark with sbt.
>>
Very likely, much of the potential duplication is already being avoided
even without calling cache/persist. When running the above code without
`myrdd.cache`, have you looked at the Spark web UI for the Jobs? For at
least one of them you will likely see that many Stages are marked as
"skipped", whi
n the meantime, submitting another Spark Application (*Application* # B)
> with the scheduler.mode as FAIR and dynamicallocation=true but it got only
> one executor. "
>
> Thanks & Regards,
> Gokula Krishnan* (Gokul)*
>
> On Thu, Jul 20, 2017 at 4:56 PM, Mark Hamstr
First, Executors are not allocated to Jobs, but rather to Applications. If
you run multiple Jobs within a single Application, then each of the Tasks
associated with Stages of those Jobs has the potential to run on any of the
Application's Executors. Second, once a Task starts running on an Executor
Yes, a Stage can be part of more than one Job. The jobIds field of Stage is
used repeatedly in the DAGScheduler.
On Tue, Jun 6, 2017 at 5:04 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote:
> Hi all,
>
> I read same code of spark about stage.
>
> The constructor of stage keep the first job ID the stage was
You can't do that. SparkContext and SparkSession can exist only on the
Driver.
On Sun, May 28, 2017 at 6:56 AM, Abdulfattah Safa
wrote:
> How can I use SparkContext (to create Spark Session or Cassandra Sessions)
> in executors?
> If I pass it as parameter to the foreach or foreachpartition, the
I heard that once we reach release candidates it's not a question of time
or a target date, but only whether blockers are resolved and the code is
ready to release.
On Tue, May 23, 2017 at 11:07 AM, kant kodali wrote:
> Heard its end of this month (May)
>
> On Tue, May 23, 2017 at 9:41 AM, mojha
> two replies are not even in the same "email conversation".
>
I don't know the mechanics of why posts do or don't show up via Nabble, but
Nabble is neither the canonical archive nor the system of record for Apache
mailing lists.
> On Thu, May 4, 2017 at 8:11 PM, Mark
ding which
> lib to use.
>
> On 9 May 2017 at 14:30, Mark Hamstra wrote:
>
>> This looks more like a matter for Databricks support than spark-user.
>>
>> On Tue, May 9, 2017 at 2:02 PM, lucas.g...@gmail.com <
>> lucas.g...@gmail.com> wrote:
>>
>>>
This looks more like a matter for Databricks support than spark-user.
On Tue, May 9, 2017 at 2:02 PM, lucas.g...@gmail.com
wrote:
> df = spark.sqlContext.read.csv('out/df_in.csv')
>>
>
>
>> 17/05/09 15:51:29 WARN ObjectStore: Version information not found in
>> metastore. hive.metastore.schema.v
The check goal of the scalastyle plugin runs during the "verify" phase,
which is between "package" and "install"; so running just "package" will
not run scalastyle:check.
On Thu, May 4, 2017 at 7:45 AM, yiskylee wrote:
> ./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean
>
spark.local.dir
http://spark.apache.org/docs/latest/configuration.html
On Fri, Apr 28, 2017 at 8:51 AM, Shashi Vishwakarma <
shashi.vish...@gmail.com> wrote:
> Yes I am using HDFS .Just trying to understand couple of point.
>
> There would be two kind of encryption which would be required.
>
> 1
evant/useful in this context?
>
> On Wed, Apr 5, 2017 at 3:54 PM Mark Hamstra
> wrote:
>
>> grrr... s/your/you're/
>>
>> On Wed, Apr 5, 2017 at 12:54 PM, Mark Hamstra
>> wrote:
>>
>> Your mixing up different levels of scheduling. Spark's
grrr... s/your/you're/
On Wed, Apr 5, 2017 at 12:54 PM, Mark Hamstra
wrote:
> Your mixing up different levels of scheduling. Spark's fair scheduler
> pools are about scheduling Jobs, not Applications; whereas YARN queues with
> Spark are about scheduling Applications, not Jo
Your mixing up different levels of scheduling. Spark's fair scheduler pools
are about scheduling Jobs, not Applications; whereas YARN queues with Spark
are about scheduling Applications, not Jobs.
On Wed, Apr 5, 2017 at 12:27 PM, Nick Chammas
wrote:
> I'm having trouble understanding the differe
When the RDD using them goes out of scope.
On Mon, Mar 27, 2017 at 3:13 PM, Ashwin Sai Shankar
wrote:
> Thanks Mark! follow up question, do you know when shuffle files are
> usually un-referenced?
>
> On Mon, Mar 27, 2017 at 2:35 PM, Mark Hamstra
> wrote:
>
>> Shuffl
Shuffle files are cleaned when they are no longer referenced. See
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ContextCleaner.scala
On Mon, Mar 27, 2017 at 12:38 PM, Ashwin Sai Shankar <
ashan...@netflix.com.invalid> wrote:
> Hi!
>
> In spark on yarn, when are
foreachPartition is not a transformation; it is an action. If you want to
transform an RDD using an iterator in each partition, then use
mapPartitions.
On Tue, Feb 28, 2017 at 8:17 PM, jeremycod wrote:
> Hi,
>
> I'm trying to transform one RDD two times. I'm using foreachParition and
> embedded
First, the word you are looking for is "straggler", not "strangler" -- very
different words. Second, "idempotent" doesn't mean "only happens once", but
rather "if it does happen more than once, the effect is no different than
if it only happened once".
It is possible to insert a nearly limitless v
If you update the data, then you don't have the same DataFrame anymore. If
you don't do like Assaf did, caching and forcing evaluation of the
DataFrame before using that DataFrame concurrently, then you'll still get
consistent and correct results, but not necessarily efficient results. If
the fully
yes
On Fri, Feb 3, 2017 at 10:08 PM, kant kodali wrote:
> can I use Spark Standalone with HDFS but no YARN?
>
> Thanks!
>
More than one Spark Context in a single Application is not supported.
On Sun, Jan 29, 2017 at 9:08 PM, wrote:
> Hi,
>
>
>
> I have a requirement in which, my application creates one Spark context in
> Distributed mode whereas another Spark context in local mode.
>
> When I am creating this, my c
Try selecting a particular Job instead of looking at the summary page for
all Jobs.
On Sat, Jan 28, 2017 at 4:25 PM, Md. Rezaul Karim <
rezaul.ka...@insight-centre.org> wrote:
> Hi Jacek,
>
> I tried accessing Spark web UI on both Firefox and Google Chrome browsers
> with ad blocker enabled. I do
I wouldn't say that Executors are dumb, but there are some pretty clear
divisions of concepts and responsibilities across the different pieces of
the Spark architecture. A Job is a concept that is completely unknown to an
Executor, which deals instead with just the Tasks that it is given. So you
a
See "API compatibility" in http://spark.apache.org/versioning-policy.html
While code that is annotated as Experimental is still a good faith effort
to provide a stable and useful API, the fact is that we're not yet
confident enough that we've got the public API in exactly the form that we
want to
;
> Wed
> Dec 28 20:01:10 UTC 2016
> 2.2.0-SNAPSHOT/
> <https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-sql_2.11/2.2.0-SNAPSHOT/>
> Wed
> Dec 28 19:12:38 UTC 2016
>
> What's with 2.1.1-SNAPSHOT? Is that version about to be released as well?
&g
The v2.1.0 tag is there: https://github.com/apache/spark/tree/v2.1.0
On Wed, Dec 28, 2016 at 2:04 PM, Koert Kuipers wrote:
> seems like the artifacts are on maven central but the website is not yet
> updated.
>
> strangely the tag v2.1.0 is not yet available on github. i assume its
> equal to v2
Using a single SparkContext for an extended period of time is how
long-running Spark Applications such as the Spark Job Server work (
https://github.com/spark-jobserver/spark-jobserver). It's an established
pattern.
On Thu, Oct 27, 2016 at 11:46 AM, Gervásio Santos wrote:
> Hi guys!
>
> I'm dev
There is no need to do that if 1) the stage that you are concerned with
either made use of or produced MapOutputs/shuffle files; 2) reuse of those
shuffle files (which may very well be in the OS buffer cache of the worker
nodes) is sufficient for your needs; 3) the relevant Stage objects haven't
go
Yes and no. Something that you need to be aware of is that a Job as such
exists in the DAGScheduler as part of the Application running on the
Driver. When talking about stopping or killing a Job, however, what people
often mean is not just stopping the DAGScheduler from telling the Executors
to r
>
> The best network results are achieved when Spark nodes share the same
> hosts as Hadoop or they happen to be on the same subnet.
>
That's only true for those portions of a Spark execution pipeline that are
actually reading from HDFS. If you're re-using an RDD for which the needed
shuffle file
It sounds like you should be writing an application and not trying to force
the spark-shell to do more than what it was intended for.
On Tue, Sep 13, 2016 at 11:53 AM, Kevin Burton wrote:
> I sort of agree but the problem is that some of this should be code.
>
> Some of our ES indexes have 100-2
And, no, Spark's scheduler will not preempt already running Tasks. In
fact, just killing running Tasks for any reason is trickier than we'd like
it to be, so it isn't done by default:
https://issues.apache.org/jira/browse/SPARK-17064
On Fri, Sep 2, 2016 at 11:34 AM, Mark Hamstra
ht?
>
> Thank you
> ----------
> *From:* Mark Hamstra
> *Sent:* Thursday, September 1, 2016 8:44:10 PM
>
> *To:* enrico d'urso
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark scheduling mode
>
> Spark's FairSchedulingAlgorithm is not round robin:
scheduled in round robin way,
> am I right?
>
> --
> *From:* Mark Hamstra
> *Sent:* Thursday, September 1, 2016 8:19:44 PM
> *To:* enrico d'urso
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark scheduling mode
>
> The default pool (``) can be configured like any
> ot
mean, round robin for the jobs that belong to the default pool.
>
> Cheers,
> --
> *From:* Mark Hamstra
> *Sent:* Thursday, September 1, 2016 7:24:54 PM
> *To:* enrico d'urso
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark schedul
Just because you've flipped spark.scheduler.mode to FAIR, that doesn't mean
that Spark can magically configure and start multiple scheduling pools for
you, nor can it know to which pools you want jobs assigned. Without doing
any setup of additional scheduling pools or assigning of jobs to pools,
y
vailability. It is used in Spark
>>> Streaming with Kafka, it is also used with Hive for concurrency. It is also
>>> a distributed locking system.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn *
s/playing a role/paying a role/
On Thu, Aug 25, 2016 at 12:51 PM, Mark Hamstra
wrote:
> One way you can start to make this make more sense, Sean, is if you
> exploit the code/data duality so that the non-distributed data that you are
> sending out from the driver is actually paying a
One way you can start to make this make more sense, Sean, is if you exploit
the code/data duality so that the non-distributed data that you are sending
out from the driver is actually paying a role more like code (or at least
parameters.) What is sent from the driver to an Executer is then used
(t
imer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damag
What are you expecting to find? There currently are no releases beyond
Spark 2.0.0.
On Tue, Aug 9, 2016 at 9:55 AM, Jestin Ma wrote:
> If we want to use versions of Spark beyond the official 2.0.0 release,
> specifically on Maven + Java, what steps should we take to upgrade? I can't
> find the
Don't use Spark 2.0.0-preview. That was a preview release with known
issues, and was intended to be used only for early, pre-release testing
purpose. Spark 2.0.0 is now released, and you should be using that.
On Thu, Jul 28, 2016 at 3:48 AM, Carlo.Allocca
wrote:
> and, of course I am using
>
>
Nothing has changed in that regard, nor is there likely to be "progress",
since more sophisticated or capable resource scheduling at the Application
level is really beyond the design goals for standalone mode. If you want
more in the way of multi-Application resource scheduling, then you should
be
kedIn *
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 16 June 2016 at 19:07, Mark Hamstra wrot
‘logical’ processors vs. cores and POSIX threaded
>> applications.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn *
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/
d
>> applications.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn *
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABU
I don't know what documentation you were referring to, but this is clearly
an erroneous statement: "Threads are virtual cores." At best it is
terminology abuse by a hardware manufacturer. Regardless, Spark can't get
too concerned about how any particular hardware vendor wants to refer to
the spec
box then it's fine but when you
> have large number of people on this site complaining about OOM and shuffle
> error all the time you need to start providing some transparency to
> address that.
>
> Thanks
>
>
> On Wed, May 25, 2016 at 6:41 PM, Mark Hamstra
> wrote:
>
You appear to be misunderstanding the nature of a Stage. Individual
transformation steps such as `map` do not define the boundaries of Stages.
Rather, a sequence of transformations in which there is only a
NarrowDependency between each of the transformations will be pipelined into
a single Stage.
To be fair, the Stratosphere project from which Flink springs was started
as a collaborative university research project in Germany about the same
time that Spark was first released as Open Source, so they are near
contemporaries rather than Flink having been started only well after Spark
was an es
That's also available in standalone.
On Thu, Apr 14, 2016 at 12:47 PM, Alexander Pivovarov
wrote:
> Spark on Yarn supports dynamic resource allocation
>
> So, you can run several spark-shells / spark-submits / spark-jobserver /
> zeppelin on one cluster without defining upfront how many executor
https://spark.apache.org/docs/latest/cluster-overview.html
On Sat, Apr 9, 2016 at 12:28 AM, Ashok Kumar
wrote:
> On Spark GUI I can see the list of Workers.
>
> I always understood that workers are used by executors.
>
> What is the relationship between workers and executors please. Is it one
>
Why would the Executors shutdown when the Job is terminated? Executors are
bound to Applications, not Jobs. Furthermore,
unless spark.job.interruptOnCancel is set to true, canceling the Job at the
Application and DAGScheduler level won't actually interrupt the Tasks
running on the Executors. If
file/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 30 March 2016 at 00:22, Mark Hamstra wrote:
>
>> Yes and no. The i
Yes and no. The idea of n-tier architecture is about 20 years older than
Spark and doesn't really apply to Spark as n-tier was original conceived.
If the n-tier model helps you make sense of some things related to Spark,
then use it; but don't get hung up on trying to force a Spark architecture
in
You seem to be confusing the concepts of Job and Application. A Spark
Application has a SparkContext. A Spark Application is capable of running
multiple Jobs, each with its own ID, visible in the webUI.
On Thu, Mar 24, 2016 at 6:11 AM, Max Schmidt wrote:
> Am 24.03.2016 um 10:34 schrieb Simon
On Wed, Mar 23, 2016 at 7:38 PM, Ted Yu wrote:
> bq. when I get the last RDD
> If I read Todd's first email correctly, the computation has been done.
> I could be wrong.
>
> On Wed, Mar 23, 2016 at 7:34 PM, Mark Hamstra
> wrote:
>
>> Neither of you is making a
Neither of you is making any sense to me. If you just have an RDD for
which you have specified a series of transformations but you haven't run
any actions, then neither checkpointing nor saving makes sense -- you
haven't computed anything yet, you've only written out the recipe for how
the computa
You're not getting what Ted is telling you. Your `dict` is an RDD[String]
-- i.e. it is a collection of a single value type, String. But
`collectAsMap` is only defined for PairRDDs that have key-value pairs for
their data elements. Both a key and a value are needed to collect into a
Map[K, V].
g may not be 100% accurate and bug
free.
On Tue, Mar 15, 2016 at 6:34 PM, Prabhu Joseph
wrote:
> Okay, so out of 164 stages, is 163 are skipped. And how 41405 tasks are
> skipped if the total is only 19788.
>
> On Wed, Mar 16, 2016 at 6:31 AM, Mark Hamstra
> wrote:
>
>>
It's not just if the RDD is explicitly cached, but also if the map outputs
for stages have been materialized into shuffle files and are still
accessible through the map output tracker. Because of that, explicitly
caching RDD actions often gains you little or nothing, since even without a
call to c
uction. i
> would love for someone to prove otherwise.
>
> On Thu, Mar 10, 2016 at 5:44 PM, Mark Hamstra
> wrote:
>
>> For example, if you're looking to scale out to 1000 concurrent requests,
>>> this is 1000 concurrent Spark jobs. This would require a cluster
>
> For example, if you're looking to scale out to 1000 concurrent requests,
> this is 1000 concurrent Spark jobs. This would require a cluster with 1000
> cores.
This doesn't make sense. A Spark Job is a driver/DAGScheduler concept
without any 1:1 correspondence between Worker cores and Jobs.
One issue is that RAID levels providing data replication are not necessary
since HDFS already replicates blocks on multiple nodes.
On Tue, Mar 8, 2016 at 8:45 AM, Alex Kozlov wrote:
> Parallel disk IO? But the effect should be less noticeable compared to
> Hadoop which reads/writes a lot. Much
There's probably nothing wrong other than a glitch in the reporting of
Executor state transitions to the UI -- one of those low-priority items
I've been meaning to look at for awhile
On Mon, Mar 7, 2016 at 12:15 AM, Sonal Goyal wrote:
> Maybe check the worker logs to see what's going wrong w
standalone deployment (it is slightly mentioned in SPARK-9882, but it seems
> to be abandoned). Do you know if there is such an activity?
>
> --
> Be well!
> Jean Morozov
>
> On Sun, Feb 21, 2016 at 4:32 AM, Mark Hamstra
> wrote:
>
>> It's 2 -- and it's
It's 2 -- and it's pretty hard to point to a line of code, a method, or
even a class since the scheduling of Tasks involves a pretty complex
interaction of several Spark components -- mostly the DAGScheduler,
TaskScheduler/TaskSchedulerImpl, TaskSetManager, Schedulable and Pool, as
well as the Sche
Welcome to
__
/ __/__ ___ _/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT
/_/
Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java
1.8.0_72)
Type in expressions to have them evaluated.
Type :he
https://github.com/apache/spark/pull/10608
On Fri, Jan 29, 2016 at 11:50 AM, Jakob Odersky wrote:
> I'm not an authoritative source but I think it is indeed the plan to
> move the default build to 2.11.
>
> See this discussion for more detail
>
> http://apache-spark-developers-list.1001551.n3.na
What do you think is preventing you from optimizing your own RDD-level
transformations and actions? AFAIK, nothing that has been added in
Catalyst precludes you from doing that. The fact of the matter is, though,
that there is less type and semantic information available to Spark from
the raw RDD
ess JobServer can fundamentally solve my problem,
> so that jobs can be submitted at different time and still share RDDs.
>
> Best Regards,
> Jia
>
>
> On Jan 17, 2016, at 3:44 PM, Mark Hamstra wrote:
>
> There is a 1-to-1 relationship between Spark Applications and
> SparkC
utor. After the application
> runs to completion. The executor process will be killed.
> But I hope that all applications submitted can run in the same executor,
> can JobServer do that? If so, it’s really good news!
>
> Best Regards,
> Jia
>
> On Jan 17, 2016, at 3:09 PM, Mark
a Zou wrote:
> Hi, Mark, sorry, I mean SparkContext.
> I mean to change Spark into running all submitted jobs (SparkContexts) in
> one executor JVM.
>
> Best Regards,
> Jia
>
> On Sun, Jan 17, 2016 at 2:21 PM, Mark Hamstra
> wrote:
>
>> -dev
>>
>&
ice surprise to me
>
> On Sun, Jan 17, 2016 at 3:17 PM, Mark Hamstra
> wrote:
>
>> Same SparkContext means same pool of Workers. It's up to the Scheduler,
>> not the SparkContext, whether the exact same Workers or Executors will be
>> used to calculate simultaneou
-dev
What do you mean by JobContext? That is a Hadoop mapreduce concept, not
Spark.
On Sun, Jan 17, 2016 at 7:29 AM, Jia Zou wrote:
> Dear all,
>
> Is there a way to reuse executor JVM across different JobContexts? Thanks.
>
> Best Regards,
> Jia
>
Same SparkContext means same pool of Workers. It's up to the Scheduler,
not the SparkContext, whether the exact same Workers or Executors will be
used to calculate simultaneous actions against the same RDD. It is likely
that many of the same Workers and Executors will be used as the Scheduler
tri
It's not a bug, but a larger heap is required with the new
UnifiedMemoryManager:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/memory/UnifiedMemoryManager.scala#L172
On Wed, Jan 6, 2016 at 6:35 AM, kali.tumm...@gmail.com <
kali.tumm...@gmail.com> wrote:
> Hi All
I can override the root pool in configuration file, Thanks Mark.
>
> On Wed, Jan 6, 2016 at 8:45 AM, Mark Hamstra
> wrote:
>
>> Just configure with
>> FAIR in fairscheduler.xml (or
>> in spark.scheduler.allocation.file if you have over-riden the default name
>>
ant is the default pool is fair
> scheduling. But seems if I want to use fair scheduling now, I have to set
> spark.scheduler.pool explicitly.
>
> On Wed, Jan 6, 2016 at 2:03 AM, Mark Hamstra
> wrote:
>
>> I don't understand. If you're using fair scheduling an
I don't understand. If you're using fair scheduling and don't set a pool,
the default pool will be used.
On Tue, Jan 5, 2016 at 1:57 AM, Jeff Zhang wrote:
>
> It seems currently spark.scheduler.pool must be set as localProperties
> (associate with thread). Any reason why spark.scheduler.pool ca
g-apache-spark/
> Follow me at https://twitter.com/jaceklaskowski
> Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski
>
>
> On Wed, Dec 16, 2015 at 10:55 AM, Mark Hamstra
> wrote:
> > It can be used, and is used in user code, but it isn't always as
>
It can be used, and is used in user code, but it isn't always as
straightforward as you might think. This is mostly because a Job often
isn't a Job -- or rather it is more than one Job. There are several RDD
transformations that aren't lazy, so they end up launching "hidden" Jobs
that you may not
No, publishing a spark assembly jar is not fine. See the doc attached to
https://issues.apache.org/jira/browse/SPARK-11157 and be aware that a
likely goal of Spark 2.0 will be the elimination of assemblies.
On Thu, Dec 10, 2015 at 11:19 PM, fightf...@163.com
wrote:
> Using maven to download the
Where it could start to make some sense is if you wanted a single
application to be able to work with more than one Spark cluster -- but
that's a pretty weird or unusual thing to do, and I'm pretty sure it
wouldn't work correctly at present.
On Fri, Dec 4, 2015 at 11:10 AM, Michael Armbrust
wrote
1 - 100 of 198 matches
Mail list logo