Re: Switching broadcast mechanism from torrrent

2016-06-19 Thread Daniel Haviv
Hi, Just updating on my findings for future reference. The problem was that after refactoring my code I ended up with a scala object which held SparkContext as a member, eg: object A { sc: SparkContext = new SparkContext def mapFunction {} } and when I called rdd.map(A.mapFunction) it

Re: Switching broadcast mechanism from torrrent

2016-06-19 Thread Takeshi Yamamuro
How about using `transient` annotations? // maropu On Sun, Jun 19, 2016 at 10:51 PM, Daniel Haviv < daniel.ha...@veracity-group.com> wrote: > Hi, > Just updating on my findings for future reference. > The problem was that after refactoring my code I ended up with a scala > object which held

Re: How to cause a stage to fail (using spark-shell)?

2016-06-19 Thread Jacek Laskowski
Mind sharing code? I think only shuffle failures lead to stage failures and re-tries. Jacek On 19 Jun 2016 4:35 p.m., "Ted Yu" wrote: > You can utilize a counter in external storage (NoSQL e.g.) > When the counter reaches 2, stop throwing exception so that the task >

Are ser/de optimizations relevant with Dataset API and Encoders ?

2016-06-19 Thread Amit Sela
With RDD API, you could optimize shuffling data by making sure that bytes are shuffled instead of objects and using the appropriate ser/de mechanism before and after the shuffle, for example: Before parallelize, transform to bytes using a dedicated serializer, parallelize, and immediately after

Re: sparkR.init() can not load sparkPackages.

2016-06-19 Thread Sun Rui
Hi, Joseph, This is a known issue but not a bug. This issue does not occur when you use interactive SparkR session, while it does occur when you execute an R file. The reason behind this is that in case you execute an R file, the R backend launches before the R interpreter, so there is no

Running Spark in local mode

2016-06-19 Thread Ashok Kumar
Hi, I have been told Spark in Local mode is simplest for testing. Spark document covers little on local mode except the cores used in --master local[k].  Where are the the driver program, executor and resources. Do I need to start worker threads and how many app I can use safely without

Re: Running Spark in local mode

2016-06-19 Thread Takeshi Yamamuro
Hi, In a local mode, spark runs in a single JVM that has a master and one executor with `k` threads. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/local/LocalSchedulerBackend.scala#L94 // maropu On Sun, Jun 19, 2016 at 5:39 PM, Ashok Kumar

Re: Running Spark in local mode

2016-06-19 Thread Ashok Kumar
thank you  What are the main differences between a local mode and standalone mode. I understand local mode does not support cluster. Is that the only difference? On Sunday, 19 June 2016, 9:52, Takeshi Yamamuro wrote: Hi, In a local mode, spark runs in a single

plot importante variable in pyspark

2016-06-19 Thread pseudo oduesp
hi, who can get score for each row of classification algortithmes , and how i can plot features importance of variables like sickit learn ? thanks.

Re: Spark - “min key = null, max key = null” while reading ORC file

2016-06-19 Thread Mich Talebzadeh
Hi, To start when you store the data in ORC file can you verify that the data is there? For example register it as tempTable processDF.register("tmp") sql("select count(1) from tmp).show Also what do you mean by index file in ORC? HTH Dr Mich Talebzadeh LinkedIn *

Re: Running Spark in local mode

2016-06-19 Thread Takeshi Yamamuro
There are many technical differences inside though, how to use is the almost same with each other. yea, in a standalone mode, spark runs in a cluster way: see http://spark.apache.org/docs/1.6.1/cluster-overview.html // maropu On Sun, Jun 19, 2016 at 6:14 PM, Ashok Kumar

Re: Running Spark in local mode

2016-06-19 Thread Jacek Laskowski
On Sun, Jun 19, 2016 at 12:30 PM, Mich Talebzadeh wrote: > Spark Local - Spark runs on the local host. This is the simplest set up and > best suited for learners who want to understand different concepts of Spark > and those performing unit testing. There are also the

Re: Running Spark in local mode

2016-06-19 Thread Mich Talebzadeh
Good points but I am an experimentalist In Local mode I have this In local mode with: --master local This will start with one thread or equivalent to –master local[1]. You can also start by more than one thread by specifying the number of threads *k* in –master local[k]. You can also start

Re: How to cause a stage to fail (using spark-shell)?

2016-06-19 Thread Ted Yu
You can utilize a counter in external storage (NoSQL e.g.) When the counter reaches 2, stop throwing exception so that the task passes. FYI On Sun, Jun 19, 2016 at 3:22 AM, Jacek Laskowski wrote: > Hi, > > Thanks Burak for the idea, but it *only* fails the tasks that >

Re: Running Spark in local mode

2016-06-19 Thread Mich Talebzadeh
Spark works on different modes, either local (Spark or anything else does not manager) resources and standalone (Spark itself manages resources) plus others (see below) These are from my notes, excluding mesos that I have not used - Spark Local - Spark runs on the local host. This is the

Re: Accessing system environment on Spark Worker

2016-06-19 Thread Ted Yu
Have you looked at http://spark.apache.org/docs/latest/ec2-scripts.html ? There is description on setting AWS_SECRET_ACCESS_KEY. On Sun, Jun 19, 2016 at 4:46 AM, Mohamed Taher AlRefaie wrote: > Hello all: > > I have an application that requires accessing DynamoDB tables. Each

Re: How to cause a stage to fail (using spark-shell)?

2016-06-19 Thread Jacek Laskowski
Hi, Thanks Burak for the idea, but it *only* fails the tasks that eventually fail the entire job not a particular stage (just once or twice) before the entire job is failed. The idea is to see the attempts in web UI as there's a special handling for cases where a stage failed once or twice before

Re: Running JavaBased Implementationof StreamingKmeans

2016-06-19 Thread Biplob Biswas
Hi, Thanks for that input, I tried doing that but apparently thats not working as well. I thought i am having problems with my spark installation so I ran simple word count and that works, so I am not really sure what the problem is now. Is my translation of the scala code correct? I don't

Accessing system environment on Spark Worker

2016-06-19 Thread Mohamed Taher AlRefaie
Hello all: I have an application that requires accessing DynamoDB tables. Each worker establishes a connection with the database on its own. I have added both `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` to both master's and workers `spark-env.sh` file. I have also run the file using `sh` to

Re: Switching broadcast mechanism from torrrent

2016-06-19 Thread Ted Yu
I think good practice is not to hold on to SparkContext in mapFunction. On Sun, Jun 19, 2016 at 7:10 AM, Takeshi Yamamuro wrote: > How about using `transient` annotations? > > // maropu > > On Sun, Jun 19, 2016 at 10:51 PM, Daniel Haviv < >

Re: Update Batch DF with Streaming

2016-06-19 Thread Amit Assudani
Please help From: amit assudani Date: Thursday, June 16, 2016 at 6:11 PM To: "user@spark.apache.org" Subject: Update Batch DF with Streaming Hi All, Can I update batch data frames loaded in memory with Streaming data, For eg, I have

Re: Running Spark in local mode

2016-06-19 Thread Ashok Kumar
Thank you all sirs Appreciated Mich your clarification. On Sunday, 19 June 2016, 19:31, Mich Talebzadeh wrote: Thanks Jonathan for your points I am aware of the fact yarn-client and yarn-cluster are both depreciated (still work in 1.6.1), hence the new

Re: Running Spark in local mode

2016-06-19 Thread Jonathan Kelly
Mich, what Jacek is saying is not that you implied that YARN relies on two masters. He's just clarifying that yarn-client and yarn-cluster modes are really both using the same (type of) master (simply "yarn"). In fact, if you specify "--master yarn-client" or "--master yarn-cluster", spark-submit

Re: Running Spark in local mode

2016-06-19 Thread Mich Talebzadeh
Thanks Jonathan for your points I am aware of the fact yarn-client and yarn-cluster are both depreciated (still work in 1.6.1), hence the new nomenclature. Bear in mind this is what I stated in my notes: "YARN Cluster Mode, the Spark driver runs inside an application master process which is

Spark - “min key = null, max key = null” while reading ORC file

2016-06-19 Thread Mohanraj Ragupathiraj
I am trying to join a Dataframe(say 100 records) with an ORC file with 500 million records through Spark(can increase to 4-5 billion, 25 bytes each record). I used Spark hiveContext API. *ORC File Creation Code* //fsdtRdd is JavaRDD, fsdtSchema is StructType schema DataFrame fsdtDf =