Hi,
Just updating on my findings for future reference.
The problem was that after refactoring my code I ended up with a scala
object which held SparkContext as a member, eg:
object A {
sc: SparkContext = new SparkContext
def mapFunction {}
}
and when I called rdd.map(A.mapFunction) it
How about using `transient` annotations?
// maropu
On Sun, Jun 19, 2016 at 10:51 PM, Daniel Haviv <
daniel.ha...@veracity-group.com> wrote:
> Hi,
> Just updating on my findings for future reference.
> The problem was that after refactoring my code I ended up with a scala
> object which held
Mind sharing code? I think only shuffle failures lead to stage failures and
re-tries.
Jacek
On 19 Jun 2016 4:35 p.m., "Ted Yu" wrote:
> You can utilize a counter in external storage (NoSQL e.g.)
> When the counter reaches 2, stop throwing exception so that the task
>
With RDD API, you could optimize shuffling data by making sure that bytes
are shuffled instead of objects and using the appropriate ser/de mechanism
before and after the shuffle, for example:
Before parallelize, transform to bytes using a dedicated serializer,
parallelize, and immediately after
Hi, Joseph,
This is a known issue but not a bug.
This issue does not occur when you use interactive SparkR session, while it
does occur when you execute an R file.
The reason behind this is that in case you execute an R file, the R backend
launches before the R interpreter, so there is no
Hi,
I have been told Spark in Local mode is simplest for testing. Spark document
covers little on local mode except the cores used in --master local[k].
Where are the the driver program, executor and resources. Do I need to start
worker threads and how many app I can use safely without
Hi,
In a local mode, spark runs in a single JVM that has a master and one
executor with `k` threads.
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/local/LocalSchedulerBackend.scala#L94
// maropu
On Sun, Jun 19, 2016 at 5:39 PM, Ashok Kumar
thank you
What are the main differences between a local mode and standalone mode. I
understand local mode does not support cluster. Is that the only difference?
On Sunday, 19 June 2016, 9:52, Takeshi Yamamuro
wrote:
Hi,
In a local mode, spark runs in a single
hi,
who can get score for each row of classification algortithmes , and how i
can plot features importance of variables like sickit learn ?
thanks.
Hi,
To start when you store the data in ORC file can you verify that the data
is there?
For example register it as tempTable
processDF.register("tmp")
sql("select count(1) from tmp).show
Also what do you mean by index file in ORC?
HTH
Dr Mich Talebzadeh
LinkedIn *
There are many technical differences inside though, how to use is the
almost same with each other.
yea, in a standalone mode, spark runs in a cluster way: see
http://spark.apache.org/docs/1.6.1/cluster-overview.html
// maropu
On Sun, Jun 19, 2016 at 6:14 PM, Ashok Kumar
On Sun, Jun 19, 2016 at 12:30 PM, Mich Talebzadeh
wrote:
> Spark Local - Spark runs on the local host. This is the simplest set up and
> best suited for learners who want to understand different concepts of Spark
> and those performing unit testing.
There are also the
Good points but I am an experimentalist
In Local mode I have this
In local mode with:
--master local
This will start with one thread or equivalent to –master local[1]. You can
also start by more than one thread by specifying the number of threads *k*
in –master local[k]. You can also start
You can utilize a counter in external storage (NoSQL e.g.)
When the counter reaches 2, stop throwing exception so that the task passes.
FYI
On Sun, Jun 19, 2016 at 3:22 AM, Jacek Laskowski wrote:
> Hi,
>
> Thanks Burak for the idea, but it *only* fails the tasks that
>
Spark works on different modes, either local (Spark or anything else does
not manager) resources and standalone (Spark itself manages resources)
plus others (see below)
These are from my notes, excluding mesos that I have not used
- Spark Local - Spark runs on the local host. This is the
Have you looked at http://spark.apache.org/docs/latest/ec2-scripts.html ?
There is description on setting AWS_SECRET_ACCESS_KEY.
On Sun, Jun 19, 2016 at 4:46 AM, Mohamed Taher AlRefaie
wrote:
> Hello all:
>
> I have an application that requires accessing DynamoDB tables. Each
Hi,
Thanks Burak for the idea, but it *only* fails the tasks that
eventually fail the entire job not a particular stage (just once or
twice) before the entire job is failed. The idea is to see the
attempts in web UI as there's a special handling for cases where a
stage failed once or twice before
Hi,
Thanks for that input, I tried doing that but apparently thats not working
as well. I thought i am having problems with my spark installation so I ran
simple word count and that works, so I am not really sure what the problem
is now.
Is my translation of the scala code correct? I don't
Hello all:
I have an application that requires accessing DynamoDB tables. Each worker
establishes a connection with the database on its own.
I have added both `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` to both
master's and workers `spark-env.sh` file. I have also run the file using
`sh` to
I think good practice is not to hold on to SparkContext in mapFunction.
On Sun, Jun 19, 2016 at 7:10 AM, Takeshi Yamamuro
wrote:
> How about using `transient` annotations?
>
> // maropu
>
> On Sun, Jun 19, 2016 at 10:51 PM, Daniel Haviv <
>
Please help
From: amit assudani
Date: Thursday, June 16, 2016 at 6:11 PM
To: "user@spark.apache.org"
Subject: Update Batch DF with Streaming
Hi All,
Can I update batch data frames loaded in memory with Streaming data,
For eg,
I have
Thank you all sirs
Appreciated Mich your clarification.
On Sunday, 19 June 2016, 19:31, Mich Talebzadeh
wrote:
Thanks Jonathan for your points
I am aware of the fact yarn-client and yarn-cluster are both depreciated (still
work in 1.6.1), hence the new
Mich, what Jacek is saying is not that you implied that YARN relies on two
masters. He's just clarifying that yarn-client and yarn-cluster modes are
really both using the same (type of) master (simply "yarn"). In fact, if
you specify "--master yarn-client" or "--master yarn-cluster", spark-submit
Thanks Jonathan for your points
I am aware of the fact yarn-client and yarn-cluster are both depreciated
(still work in 1.6.1), hence the new nomenclature.
Bear in mind this is what I stated in my notes:
"YARN Cluster Mode, the Spark driver runs inside an application master
process which is
I am trying to join a Dataframe(say 100 records) with an ORC file with 500
million records through Spark(can increase to 4-5 billion, 25 bytes each
record).
I used Spark hiveContext API.
*ORC File Creation Code*
//fsdtRdd is JavaRDD, fsdtSchema is StructType schema
DataFrame fsdtDf =
25 matches
Mail list logo