Re: [Spark Context]: How to add on demand jobs to an existing spark context?

2017-02-07 Thread Gourav Sengupta
Hi, Michael's answer will solve the problem in case you using only SQL based solution. Otherwise please refer to the wonderful details mentioned here https://spark.apache.org/docs/latest/job-scheduling.html. With EMR 5.3.0 released SPARK 2.1.0 is available in AWS. (note that there is an issue

[Spark 2.0.0] java.util.concurrent.TimeoutException while writing to mongodb from Spark

2017-02-07 Thread Palash Gupta
Hi All, I'm writing  data frame to mongodb using Stratio/Spark-MongoDB  Initially it was working fine but when the data volume is high then it started giving me subjected error and details are as follows. Could anybody help me out or suggest what might the solution I should apply or how can I

Re: does persistence required for single action ?

2017-02-07 Thread Jörn Franke
Depends on the use case, but a persist before checkpointing can make sense after some of the map steps. > On 8 Feb 2017, at 03:09, Shushant Arora wrote: > > Hi > > I have a workflow like below: > > rdd1 = sc.textFile(input); > rdd2 = rdd1.filter(filterfunc1); >

Re: Dynamic resource allocation to Spark on Mesos

2017-02-07 Thread Sun Rui
Yi Jan, We have been using Spark on Mesos with dynamic allocation enabled, which works and improves the overall cluster utilization. In terms of job, do you mean jobs inside a Spark application or jobs among different applications? Maybe you can read

does persistence required for single action ?

2017-02-07 Thread Shushant Arora
Hi I have a workflow like below: rdd1 = sc.textFile(input); rdd2 = rdd1.filter(filterfunc1); rdd3 = rdd1.filter(fiterfunc2); rdd4 = rdd2.map(mapptrans1); rdd5 = rdd3.map(maptrans2); rdd6 = rdd4.union(rdd5); rdd6.foreach(some transformation); [image: Inline image 1] 1. Do I need to

Re: Resource Leak in Spark Streaming

2017-02-07 Thread Nipun Arora
I located the issue: Having the following seems to be necessary in the pool object to make it serialized: *private transient *ConcurrentLinkedQueue> *pool*; However this means open connections cannot be re-used in subsequent micro-batches, as transient objects are

Re: Resource Leak in Spark Streaming

2017-02-07 Thread Nipun Arora
Ryan, Apologies for coming back so late, I created a github repo to resolve this problem. On trying your solution for making the pool a Singleton, I get a null pointer exception in the worker. Do you have any other suggestions, or a simpler mechanism for handling this? I have put all the current

Re: Fault tolerant broadcast in updateStateByKey

2017-02-07 Thread Amit Sela
I'm updating the Broadcast between batches, but I've ended up doing it in a listener, thanks! On Wed, Feb 8, 2017 at 12:31 AM Tathagata Das wrote: > broadcasts are not saved in checkpoints. so you have to save it externally > yourself, and recover it before

Re: Un-exploding / denormalizing Spark SQL help

2017-02-07 Thread Everett Anderson
On Tue, Feb 7, 2017 at 2:21 PM, Michael Armbrust wrote: > I think the fastest way is likely to use a combination of conditionals > (when / otherwise), first (ignoring nulls), while grouping by the id. > This should get the answer with only a single shuffle. > > Here is an

Re: Exception in spark streaming + kafka direct app

2017-02-07 Thread Srikanth
This is running in YARN cluster mode. It was restarted automatically and continued fine. I was trying to see what went wrong. AFAIK there were no task failure. Nothing in executor logs. The log I gave is in driver. After some digging, I did see that there was a rebalance in kafka logs around this

Re: [Spark Context]: How to add on demand jobs to an existing spark context?

2017-02-07 Thread Michael Segel
Why couldn’t you use the spark thrift server? On Feb 7, 2017, at 1:28 PM, Cosmin Posteuca > wrote: answer for Gourav Sengupta I want to use same spark application because i want to work as a FIFO scheduler. My problem is that i have

Re: Fault tolerant broadcast in updateStateByKey

2017-02-07 Thread Tathagata Das
broadcasts are not saved in checkpoints. so you have to save it externally yourself, and recover it before restarting the stream from checkpoints. On Tue, Feb 7, 2017 at 3:55 PM, Amit Sela wrote: > I know this approach, only thing is, it relies on the transformation being

Re: Exception in spark streaming + kafka direct app

2017-02-07 Thread Tathagata Das
Does restarting after a few minutes solves the problem? Could be a transient issue that lasts long enough for spark task-level retries to all fail. On Tue, Feb 7, 2017 at 4:34 PM, Srikanth wrote: > Hello, > > I had a spark streaming app that reads from kafka running for a

Re: Un-exploding / denormalizing Spark SQL help

2017-02-07 Thread Michael Armbrust
I think the fastest way is likely to use a combination of conditionals (when / otherwise), first (ignoring nulls), while grouping by the id. This should get the answer with only a single shuffle. Here is an example

Re: Un-exploding / denormalizing Spark SQL help

2017-02-07 Thread Jacek Laskowski
Hi Everett, That's pretty much what I'd do. Can't think of a way to beat your solution. Why do you "feel vaguely uneasy about it"? I'd also check out the execution plan (with explain) to see how it's gonna work at runtime. I may have seen groupBy + join be better than window (there were more

Re: Un-exploding / denormalizing Spark SQL help

2017-02-07 Thread Everett Anderson
On Tue, Feb 7, 2017 at 12:50 PM, Jacek Laskowski wrote: > Hi, > > Could groupBy and withColumn or UDAF work perhaps? I think window could > help here too. > This seems to work, but I do feel vaguely uneasy about it. :) // First add a 'rank' column which is priority order just

Exception in spark streaming + kafka direct app

2017-02-07 Thread Srikanth
Hello, I had a spark streaming app that reads from kafka running for a few hours after which it failed with error *17/02/07 20:04:10 ERROR JobScheduler: Error generating jobs for time 148649785 ms java.lang.IllegalStateException: No current assignment for partition mt_event-5 at

Re: submit a spark code on google cloud

2017-02-07 Thread Dinko Srkoč
Getting to the Spark web UI when Spark is running on Dataproc is not that straightforward. Connecting to that web interface is a two step process: 1. create an SSH tunnel 2. configure the browser to use a SOCKS proxy to connect The above steps are described here:

Re: Fault tolerant broadcast in updateStateByKey

2017-02-07 Thread Amit Sela
I know this approach, only thing is, it relies on the transformation being an RDD transfomration as well and so could be applied via foreachRDD and using the rdd context to avoid a stale context after recovery/resume. My question is how to void stale context in a DStream-only transformation such

Re: Un-exploding / denormalizing Spark SQL help

2017-02-07 Thread Jacek Laskowski
Hi, Could groupBy and withColumn or UDAF work perhaps? I think window could help here too. Jacek On 7 Feb 2017 8:02 p.m., "Everett Anderson" wrote: > Hi, > > I'm trying to un-explode or denormalize a table like > > +---++-+--++ > |id

Re: submit a spark code on google cloud

2017-02-07 Thread Jacek Laskowski
Hi, I know nothing about Spark in GCP so answering this for a pure Spark. Can you use web UI and Executors tab or a SparkListener? Jacek On 7 Feb 2017 5:33 p.m., "Anahita Talebi" wrote: Hello Friends, I am trying to run a spark code on multiple machines. To this

Re: How to get a spark sql statement implement duration ?

2017-02-07 Thread Jacek Laskowski
On 7 Feb 2017 4:17 a.m., "Mars Xu" wrote: Hello All, Some spark sqls will produce one or more jobs, I have 2 questions, 1, How the cc.sql(“sql statement”) divided into one or more jobs ? It's an implementation detail. You can have zero or more jobs

Re: [Structured Streaming] Using File Sink to store to hive table.

2017-02-07 Thread Jacek Laskowski
Hi, Have you considered foreach sink? Jacek On 6 Feb 2017 8:39 p.m., "Egor Pahomov" wrote: > Hi, I'm thinking of using Structured Streaming instead of old streaming, > but I need to be able to save results to Hive table. Documentation for file > sink

Re: [Spark Context]: How to add on demand jobs to an existing spark context?

2017-02-07 Thread Cosmin Posteuca
answer for Gourav Sengupta I want to use same spark application because i want to work as a FIFO scheduler. My problem is that i have many jobs(not so big) and if i run an application for every job my cluster will split resources as a FAIR scheduler(it's what i observe, maybe i'm wrong) and exist

Re: Fault tolerant broadcast in updateStateByKey

2017-02-07 Thread Shixiong(Ryan) Zhu
It's documented here: http://spark.apache.org/docs/latest/streaming-programming-guide.html#accumulators-broadcast-variables-and-checkpoints On Tue, Feb 7, 2017 at 8:12 AM, Amit Sela wrote: > Hi all, > > I was wondering if anyone ever used a broadcast variable within > an

Re: [Spark Context]: How to add on demand jobs to an existing spark context?

2017-02-07 Thread Cosmin Posteuca
Response for vincent: Thanks for answer! Yes, i need a business solution, that's the reason why i can't use Spark jobserver or Livy solutions. I will look on your github to see how to build such a system. But i don't understand, why spark doesn't have a solution for this kind of problem? and

Re: NoNodeAvailableException (None of the configured nodes are available) error when trying to push data to Elastic from a Spark job

2017-02-07 Thread Jacek Laskowski
Hi, I may have seen this issue already... What's the cluster manager? How do you spark-submit? Jacek On 7 Feb 2017 7:44 p.m., "dgoldenberg" wrote: Hi, Any reason why we might be getting this error? The code seems to work fine in the non-distributed mode but the

Un-exploding / denormalizing Spark SQL help

2017-02-07 Thread Everett Anderson
Hi, I'm trying to un-explode or denormalize a table like +---++-+--++ |id |name|extra|data |priority| +---++-+--++ |1 |Fred|8|value1|1 | |1 |Fred|8|value8|2 | |1 |Fred|8|value5|3 | |2 |Amy |9|value3|1 | |2 |Amy

Re: Spark streaming question - SPARK-13758 Need to use an external RDD inside DStream processing...Please help

2017-02-07 Thread Shixiong(Ryan) Zhu
You can create lazily instantiated singleton instances. See http://spark.apache.org/docs/latest/streaming-programming-guide.html#accumulators-broadcast-variables-and-checkpoints for examples of accumulators and broadcast variables. You can use the same approach to create your cached RDD. On Tue,

Re: Spark streaming question - SPARK-13758 Need to use an external RDD inside DStream processing...Please help

2017-02-07 Thread shyla deshpande
and my cached RDD is not small. If it was maybe I could materialize and broadcast. Thanks On Tue, Feb 7, 2017 at 10:28 AM, shyla deshpande wrote: > I have a situation similar to the following and I get SPARK-13758 >

NoNodeAvailableException (None of the configured nodes are available) error when trying to push data to Elastic from a Spark job

2017-02-07 Thread dgoldenberg
Hi, Any reason why we might be getting this error? The code seems to work fine in the non-distributed mode but the same code when run from a Spark job is not able to get to Elastic. Spark version: 2.0.1 built for Hadoop 2.4, Scala 2.11 Elastic version: 2.3.1 I've verified the Elastic hosts and

Re: Launching an Spark application in a subset of machines

2017-02-07 Thread Michael Gummelt
> Looking into Mesos attributes this seems the perfect fit for it. Is that correct? Yes. On Tue, Feb 7, 2017 at 3:43 AM, Muhammad Asif Abbasi wrote: > YARN provides the concept of node labels. You should explore the > "spark.yarn.executor.nodeLabelConfiguration"

Spark streaming question - SPARK-13758 Need to use an external RDD inside DStream processing...Please help

2017-02-07 Thread shyla deshpande
I have a situation similar to the following and I get SPARK-13758 . I understand why I get this error, but I want to know what should be the approach in dealing with these situations. Thanks > var cached =

Re: [Spark Context]: How to add on demand jobs to an existing spark context?

2017-02-07 Thread Gourav Sengupta
Hi, May I ask the reason for using the same spark application? Is it because of the time it takes in order to start a spark context? On another note you may want to look at the number of contributors in a github repo before choosing a solution. Regards, Gourav On Tue, Feb 7, 2017 at 5:26 PM,

Re: [Spark Context]: How to add on demand jobs to an existing spark context?

2017-02-07 Thread vincent gromakowski
Spark jobserver or Livy server are the best options for pure technical API. If you want to publish business API you will probably have to build you own app like the one I wrote a year ago https://github.com/elppc/akka-spark-experiments It combines Akka actors and a shared Spark context to serve

Re: About saving a model file

2017-02-07 Thread durgaswaroop
Did you find any solution? Are you using a Spark cluster? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/About-saving-a-model-file-tp25136p28369.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

No topicDistributions(..) method in ml.clustering.LocalLDAModel

2017-02-07 Thread sachintyagi22
Hi, I was using ml.clustering.LDA for topic modelling (with online optimizer) and it returns ml.clustering.LocalLDAModel. However, using this model there doesn't seem to be any way to get the topic distribution over documents. While older mllib API (mllib.clustering.LocalLDAModel ) does have the

submit a spark code on google cloud

2017-02-07 Thread Anahita Talebi
Hello Friends, I am trying to run a spark code on multiple machines. To this aim, I submit a spark code on submit job on google cloud platform. https://cloud.google.com/dataproc/docs/guides/submit-job I have created a cluster with 6 nodes. Does anyone know how I can realize which nodes are

Fault tolerant broadcast in updateStateByKey

2017-02-07 Thread Amit Sela
Hi all, I was wondering if anyone ever used a broadcast variable within an updateStateByKey op. ? Using it is straight-forward but I was wondering how it'll work after resuming from checkpoint (using the rdd.context() trick is not possible here) ? Thanks, Amit

Re: [Spark Context]: How to add on demand jobs to an existing spark context?

2017-02-07 Thread ayan guha
I think you are loking for livy or spark jobserver On Wed, 8 Feb 2017 at 12:37 am, Cosmin Posteuca wrote: > I want to run different jobs on demand with same spark context, but i > don't know how exactly i can do this. > > I try to get current context, but seems it

[Spark Context]: How to add on demand jobs to an existing spark context?

2017-02-07 Thread Cosmin Posteuca
I want to run different jobs on demand with same spark context, but i don't know how exactly i can do this. I try to get current context, but seems it create a new spark context(with new executors). I call spark-submit to add new jobs. I run code on Amazon EMR(3 instances, 4 core & 16GB ram /

Re: Launching an Spark application in a subset of machines

2017-02-07 Thread Muhammad Asif Abbasi
YARN provides the concept of node labels. You should explore the "spark.yarn.executor.nodeLabelConfiguration" property. Cheers, Asif Abbasi On Tue, 7 Feb 2017 at 10:21, Alvaro Brandon wrote: > Hello all: > > I have the following scenario. > - I have a cluster of 50

Re: Launching an Spark application in a subset of machines

2017-02-07 Thread Alvaro Brandon
I want to scale up or down the number of machines used, depending on the SLA of a job. For example if I have a low priority job I will give it 10 machines, while a high priority will be given 50. Also I want to choose subsets depending on the hardware. For example "Launch this job only on machines

Re: Launching an Spark application in a subset of machines

2017-02-07 Thread Jörn Franke
If you want to run them always on the same machines use yarn node labels. If it is any 10 machines then use capacity or fair scheduler. What is the use case for running it always on the same 10 machines. If it is for licensing reasons then I would ask your vendor if this is a suitable mean to

Re: Launching an Spark application in a subset of machines

2017-02-07 Thread Alvaro Brandon
Hello Pavel: Thanks for the pointers. For standalone cluster manager: I understand that I just have to start several masters with a subset of slaves attached. Then each master will listen on a different pair of , allowing me to spark-submit to any of these pairs depending on the

Re: Launching an Spark application in a subset of machines

2017-02-07 Thread Pavel Plotnikov
Hi, Alvaro You can create different clusters using standalone cluster manager, and than manage subset of machines through submitting application on different masters. Or you can use Mesos attributes to mark subset of workers and specify it in spark.mesos.constraints On Tue, Feb 7, 2017 at 1:21

Launching an Spark application in a subset of machines

2017-02-07 Thread Alvaro Brandon
Hello all: I have the following scenario. - I have a cluster of 50 machines with Hadoop and Spark installed on them. - I want to launch one Spark application through spark submit. However I want this application to run on only a subset of these machines, disregarding data locality. (e.g. 10

Re: spark architecture question -- Pleas Read

2017-02-07 Thread Alex
Hi All, So Will be there any performance difference instead of running hive java native udfs in spark-shell using hive context if we recode the entire logic to spark-sql code? or spark is anyway converting hiev java udf to spark sql code so we dont need to rewrite the entire logic in spark-sql?