Re: Spark Streaming Suggestion

2015-09-16 Thread srungarapu vamsi
@David, I am going through the articles you have shared. Will message you if i need any hellp. Thanks @Ayan, Yes, it looks like i can get every thing done with spark streaming. In fact we have storm already in the architecture sanitizing the data and dumping into cassandra. Now, i got some new

Idle time between jobs

2015-09-16 Thread patcharee
Hi, I am using Spark 1.5. I have a spark application which is divided into some jobs. I noticed from the Event Timeline - Spark History UI, that there was idle time between jobs. See below, job 1 was submitted at 11:20:49 and finished at 11:20:52, but the job 2 was submitted "16s" after (at

Re: Managing scheduling delay in Spark Streaming

2015-09-16 Thread Dibyendu Bhattacharya
Hi Michal, If you use https://github.com/dibbhatt/kafka-spark-consumer , it comes with int own built-in back pressure mechanism. By default this is disabled, you need to enable it to use this feature with this consumer. It does control the rate based on Scheduling Delay at runtime.. Regards,

Re: How to avoid shuffle errors for a large join ?

2015-09-16 Thread Reynold Xin
Only SQL and DataFrame for now. We are thinking about how to apply that to a more general distributed collection based API, but it's not in 1.5. On Sat, Sep 5, 2015 at 11:56 AM, Gurvinder Singh wrote: > On 09/05/2015 11:22 AM, Reynold Xin wrote: > > Try increase

RE: Getting parent RDD

2015-09-16 Thread Samya MAITI
Hi Akhil, I suppose this will give me the transformed msg & not the original msg. I need the data corresponding to msgStream & not wordCountPair. As per my understanding, we need to keep a copy of incoming stream (not sure how), so as to refer to that in catch block. Regards, Sam From: Akhil

Re: Getting parent RDD

2015-09-16 Thread Akhil Das
If i understand it correctly, then in your transform function make it return DStream[OldMsg, TranformedMsg] and then in the try...catch you can access ._1 for oldMsg and then ._2 for transformedMsg. Thanks Best Regards On Wed, Sep 16, 2015 at 2:46 PM, Samya MAITI wrote:

Re: Managing scheduling delay in Spark Streaming

2015-09-16 Thread Akhil Das
I had a workaround for exactly the same scenario http://apache-spark-developers-list.1001551.n3.nabble.com/SparkStreaming-Workaround-for-BlockNotFound-Exceptions-td12096.html Apart from that, if you are using this consumer https://github.com/dibbhatt/kafka-spark-consumer it also has a built-in

Re: Getting parent RDD

2015-09-16 Thread Akhil Das
​How many RDDs are you having in that stream? If its a single RDD then you could do a .foreach and log the message, something like: val ssc = val msgStream = . //SparkKafkaDirectAPI val wordCountPair = TransformStream.transform(msgStream) /wordCountPair.foreach( ​msg​ => try{

Re: Difference between sparkDriver and "executor ID driver"

2015-09-16 Thread Hemant Bhanawat
1. When you call new SparkContext(), spark driver is started which internally create Akka ActorSystem which registers on this port. 2. Since you are running in local mode, starting of executor is short circuited and an Executor object is created in the same process (see LocalEndpoint). This

Spark streaming on spark-standalone/ yarn inside Spring XD

2015-09-16 Thread Vignesh Radhakrishnan
Hi, I am trying to run a Spark processor on Spring XD for streaming operation. The spark processor module on Spring XD works when spark is pointing to local. The processor fails to run when we point spark to spark standalone (running on the same machine) or yarn-client. Is it possible to run

How to update python code in memory

2015-09-16 Thread Margus Roo
Hi In example I submited python code to cluster: in/spark-submit --master spark://nn1:7077 SocketListen.py Now I discovered that I have to change something in SocketListen.py. One way is stop older work and submit new one. Is there way to change code in workers machines so that there no need to

Re: How to recovery DStream from checkpoint directory?

2015-09-16 Thread Akhil Das
You can't really recover from checkpoint if you alter the code. A better approach would be to use some sort of external storage (like a db or zookeeper etc) to keep the state (the indexes etc) and then when you deploy new code they can be easily recovered. Thanks Best Regards On Wed, Sep 16,

Re: How to recovery DStream from checkpoint directory?

2015-09-16 Thread Bin Wang
Will StreamingContex.getOrCreate do this work?What kind of code change will make it cannot load? Akhil Das 于2015年9月16日周三 20:20写道: > You can't really recover from checkpoint if you alter the code. A better > approach would be to use some sort of external storage (like

How to recovery DStream from checkpoint directory?

2015-09-16 Thread Bin Wang
I'd like to know if there is a way to recovery dstream from checkpoint. Because I stores state in DStream, I'd like the state to be recovered when I restart the application and deploy new code.

RE: application failed on large dataset

2015-09-16 Thread java8964
Can you try for "nio", instead of "netty". set "spark.shuffle.blockTransferService", to "nio" and give it a try. Yong From: z.qian...@gmail.com Date: Wed, 16 Sep 2015 03:21:02 + Subject: Re: application failed on large dataset To: java8...@hotmail.com; user@spark.apache.org Hi, after

Spark Thrift Server JDBC Drivers

2015-09-16 Thread Daniel Haviv
Hi, are there any free JDBC drivers for thrift ? The only ones I could find are Simba's which require a license. Thank, Daniel

Re: application failed on large dataset

2015-09-16 Thread 周千昊
Hi, I have switch 'spark.shuffle.blockTransferService' to 'nio'. But the problem still exists. However the stack trace is a little bit different: PART one: 15/09/16 06:20:32 ERROR executor.Executor: Exception in task 1.2 in stage 15.0 (TID 5341) java.io.IOException: Failed without being ACK'd

Re: Spark wastes a lot of space (tmp data) for iterative jobs

2015-09-16 Thread Ali Hadian
Thanks for your response, Alexis. I have seen this page, but its suggested solutions do not work and the tmp space still grows linearly after unpersisting RDDs and calling System.gc() in each iteration. I think it might be due to one of the following reasons: 1. System.gc() does not directly

Spark on YARN / aws - executor lost on node restart

2015-09-16 Thread Adrian Tanase
Hi all, We’re using spark streaming (1.4.0), deployed on AWS through yarn. It’s a stateful app that reads from kafka (with the new direct API) and we’re checkpointing to HDFS. During some resilience testing, we restarted one of the machines and brought it back online. During the offline

How to calculate average from multiple values

2015-09-16 Thread diplomatic Guru
have a mapper that emit key/value pairs(composite keys and composite values separated by comma). e.g *key:* a,b,c,d *Value:* 1,2,3,4,5 *key:* a1,b1,c1,d1 *Value:* 5,4,3,2,1 ... ... *key:* a,b,c,d *Value:* 5,4,3,2,1 I could easily SUM these values using reduceByKey. e.g. reduceByKey(new

Re: Spark Streaming application code change and stateful transformations

2015-09-16 Thread Ofir Kerker
Thanks Cody! The 2nd solution is safer but seems wasteful :/ I'll try to optimize it by keeping in addition to the 'last-complete-hour' the corresponding offsets that bound the incomplete data to try and fast-forward only the last couple of hours in the worst case. On Mon, Sep 14, 2015 at 22:14

RE: application failed on large dataset

2015-09-16 Thread java8964
This sounds like a memory issue. Do you enable the GC output? When this is happening, are your executors doing full gc? How long is the full gc? Yong From: qhz...@apache.org Date: Wed, 16 Sep 2015 13:52:25 + Subject: Re: application failed on large dataset To: java8...@hotmail.com;

Re: spark performance - executor computing time

2015-09-16 Thread Robin East
Is this repeatable? Do you always get one or two executors that are 6 times as slow? It could be that some of your tasks have more work to do (maybe you are filtering some records out? If it’s always one particular worker node is there something about the machine configuration (e.g. CPU speed)

Spark Cassandra Filtering

2015-09-16 Thread Ashish Soni
Hi , How can i pass an dynamic value inside below function to filter instead of hardcoded if have an existing RDD and i would like to use data in that for filter so instead of doing .where("name=?","Anna") i want to do .where("name=?",someobject.value) Please help JavaRDD rdd3 =

Re: Spark wastes a lot of space (tmp data) for iterative jobs

2015-09-16 Thread Alexis Gillain
Ok just realized you don't use mllib pagerank. You must use checkpointing as pointed in the databricks url. https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/impl/PeriodicGraphCheckpointer.scala Due to lineage Spark doesn't erase the shuffle file. When you

Re: mappartition's FlatMapFunction help

2015-09-16 Thread Ankur Srivastava
Good to know it worked for you. CC'ed user group so that the thread reaches a closure. Thanks Ankur On Wed, Sep 16, 2015 at 6:13 AM, Thiago Diniz wrote: > Nailed it. > > Thank you! > > 2015-09-15 14:39 GMT-03:00 Ankur Srivastava : > >> Hi,

Incorrect results with spark sql

2015-09-16 Thread gpatcham
Hi, I'm trying to query on hive view using spark and it is giving different rowcounts when compared to hive. here is the view definition in hive create view test_hive_view as select col1 , col2 from tab1 left join tab2 on tab1.col1 = tab2.col1 left join tab3 on tab1.col1 = tab3.col1 where col1

Spark SQL 'create table' options

2015-09-16 Thread Dan LaBar
The SQL programming guide provides an example for creating a table using Spark SQL: CREATE TEMPORARY TABLE parquetTableUSING org.apache.spark.sql.parquet OPTIONS ( path

problem with a very simple word count program

2015-09-16 Thread huajun
Hi. I have a problem with this very simple word count rogram. The program works fine for thousands of similar files in the dataset but is very slow for these first 28 or so. The files are about 50 to 100 MB each and the program process other similar 28 files in about 30sec. These first 28 files,

Re: problem with a very simple word count program

2015-09-16 Thread Shawn Carroll
Your loop is deciding the files to process and then you are unioning the data on each iteration. If you change it to load all the files at the same time and let spark sort it out you should be much faster. Untested: val rdd = sc.textFile("medline15n00*.xml") val words = rdd.flatMap( x=>

SparkR - calling as.vector() with rdd dataframe causes error

2015-09-16 Thread ekraffmiller
Hi, I have a library of clustering algorithms that I'm trying to run in the SparkR interactive shell. (I am working on a proof of concept for a document classification tool.) Each algorithm takes a term document matrix in the form of a dataframe. When I pass the method a local dataframe, the

Re: why when I double the number of workers, ml LogisticRegression fitting time is not reduced in half?

2015-09-16 Thread Robineast
In principle yes, however it depends on whether your application is actually utilising the extra resources. Use the Task metrics available in the application UI (usually available from the driver machine on port 4040) to find out. -- Robin East Spark GraphX

Re: Spark streaming on spark-standalone/ yarn inside Spring XD

2015-09-16 Thread Tathagata Das
I would check the following. See if your setup (spark master, etc.) is correct for running simple applications in Yarn/Standalone, like the SparkPi example. If that does not work then the problem is elsewhere. If that works, then the problem could be in the Spring XD. On Wed, Sep 16, 2015 at

Re: Spark streaming on spark-standalone/ yarn inside Spring XD

2015-09-16 Thread Vignesh Radhakrishnan
Yes, it is TD. I'm able to run word count etc on spark standalone/ yarn when it's not integrated with spring xd. But the same breaks when used as processor on spring. Was trying to get an opinion on whether it's doable or it's something that's not supported at the moment On 16 Sep 2015 23:50,

unpersist RDD from another thread

2015-09-16 Thread Paul Weiss
Hi, What is the behavior when calling rdd.unpersist() from a different thread while another thread is using that rdd. Below is a simple case for this: 1) create rdd and load data 2) call rdd.cache() to bring data into memory 3) create another thread and pass rdd for a long computation 4) call

DataFrame repartition not repartitioning

2015-09-16 Thread Steve Annessa
Hello, I'm trying to load in an Avro file and write it out as Parquet. I would like to have enough partitions to properly parallelize on. When I do the simple load and save I get 1 partition out. I thought I would be able to use repartition like the following: val avroFile =

why when I double the number of workers, ml LogisticRegression fitting time is not reduced in half?

2015-09-16 Thread julia
Hi all, I run the following LogisticRegression code (ml classification class) with 14 and 28 workers respectively (2 cores/worker, 12G/worker), and the fitting times are almost the same: 11.25 vs 10.39 minutes for 14 & 28 workers. Shouldn't it reduce speed in half? DataFrame 'train' has 3,654,390

Re: How to update python code in memory

2015-09-16 Thread Davies Liu
Short answer is No. On Wed, Sep 16, 2015 at 4:06 AM, Margus Roo wrote: > Hi > > In example I submited python code to cluster: > in/spark-submit --master spark://nn1:7077 SocketListen.py > Now I discovered that I have to change something in SocketListen.py. > One way is stop older

Re: SparkR - calling as.vector() with rdd dataframe causes error

2015-09-16 Thread ekraffmiller
Also, just for completeness, matrix.csv contains: 1,2,3 4,5,6 7,8,9 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-calling-as-vector-with-rdd-dataframe-causes-error-tp24717p24719.html Sent from the Apache Spark User List mailing list archive at

Suggested Method for Execution of Periodic Actions

2015-09-16 Thread Bryan Jeffrey
Hello. I have a streaming job that is processing data. I process a stream of events, taking actions when I see anomalous events. I also keep a count events observed using updateStateByKey to maintain a map of type to count. I would like to periodically (every 5 minutes) write the results of my

Re: DataFrame repartition not repartitioning

2015-09-16 Thread Silvio Fiorito
You just need to assign it to a new variable: val avroFile = sqlContext.read.format("com.databricks.spark.avro").load(inFile) val repart = avroFile.repartition(10) repart.save(outFile, "parquet") From: Steve Annessa Date: Wednesday, September 16, 2015 at 2:08 PM To:

Re: unpersist RDD from another thread

2015-09-16 Thread Paul Weiss
So in order to not incur any performance issues I should really wait for all usage of the rdd to complete before calling unpersist, correct? On Wed, Sep 16, 2015 at 4:08 PM, Tathagata Das wrote: > unpredictable. I think it will be safe (as in nothing should fail),

Re: problem with a very simple word count program

2015-09-16 Thread Alexander Krasheninnikov
Collect all your rdds from single files into List, then call context.union(context.emptyRdd(), YOUR_LIST); Otherwise, on greater number of elements to union, you will get stack overflow exception. On Wed, Sep 16, 2015 at 10:17 PM, Shawn Carroll wrote: > Your loop

Re: Suggested Method for Execution of Periodic Actions

2015-09-16 Thread Ted Yu
bq. and check if 5 minutes have passed What if the duration for the window is longer than 5 minutes ? Cheers On Wed, Sep 16, 2015 at 1:25 PM, Adrian Tanase wrote: > If you don't need the counts in betweem the DB writes, you could simply > use a 5 min window for the

Re: unpersist RDD from another thread

2015-09-16 Thread Tathagata Das
Yes. On Wed, Sep 16, 2015 at 1:12 PM, Paul Weiss wrote: > So in order to not incur any performance issues I should really wait for > all usage of the rdd to complete before calling unpersist, correct? > > On Wed, Sep 16, 2015 at 4:08 PM, Tathagata Das < >

Issue with writing Dataframe to Vertica through JDBC

2015-09-16 Thread Divya Ravichandran
> I get the following stack trace with this issue, anybody has any clue? I am running spark on yarn in cluster mode. > > > > > > 15/09/16 16:30:28 INFO spark.SparkContext: Starting job: jdbc at AssetMetadataToVertica.java:114 > > 15/09/16 16:30:28 INFO scheduler.DAGScheduler: Got job 0 (jdbc at

Re: Suggested Method for Execution of Periodic Actions

2015-09-16 Thread Adrian Tanase
If you don't need the counts in betweem the DB writes, you could simply use a 5 min window for the updateStateByKey and use foreachRdd on the resulting DStream. Even simpler, you could use reduceByKeyAndWindow directly. Lastly, you could keep a variable on the driver and check if 5 minutes have

Re: Spark streaming on spark-standalone/ yarn inside Spring XD

2015-09-16 Thread Tathagata Das
I am not at all familiar with how SpringXD works so hard to say. On Wed, Sep 16, 2015 at 12:12 PM, Vignesh Radhakrishnan < vignes...@altiux.com> wrote: > Yes, it is TD. I'm able to run word count etc on spark standalone/ yarn > when it's not integrated with spring xd. > But the same breaks when

Re: Suggested Method for Execution of Periodic Actions

2015-09-16 Thread Adrian Tanase
The window can be larger, the batch/slide interval has to be smaller (assuming every 5-10 secs?). You have a separate parameter on most default functions and you can override it as long as it's a multiple of streaming context batch interval. Sent from my iPhone On 16 Sep 2015, at 23:30, Ted Yu

parquet error

2015-09-16 Thread Chengi Liu
Hi, I have a spark cluster setup and I am trying to write the data to s3 but in parquet format. Here is what I am doing df = sqlContext.load('test', 'com.databricks.spark.avro') df.saveAsParquetFile("s3n://test") But I get some nasty error: Py4JJavaError: An error occurred while calling

Stopping criteria for gradient descent

2015-09-16 Thread Nishanth P S
Hi, I am running LogisticRegressionWithSGD in spark 1.4.1 and it always takes 100 iterations to train (which is the default). It never meets the convergence criteria, shouldn't the convergence criteria for SGD be based on difference in logloss or the difference in accuracy on a held out test set

spark-submit chronos issue

2015-09-16 Thread Saurabh Malviya (samalviy)
Hi, I am using facing strange issue while using chronos, As job is not able to find the Main class while invoking spark-submit using chronos. Issue I identified as "colon" in the task name Env -Chronos scheduled job on mesos

Re: RE: spark sql hook

2015-09-16 Thread r7raul1...@163.com
Example: select * from test.table chang to select * from production.table r7raul1...@163.com From: Cheng, Hao Date: 2015-09-17 11:05 To: r7raul1...@163.com; user Subject: RE: spark sql hook Catalyst TreeNode is very fundamental API, not sure what kind of hook you need. Any concrete

Table is modified by DataFrameWriter

2015-09-16 Thread guoqing0...@yahoo.com.hk
Hi all, I found the table structure was modified when use DataFrameWriter.jdbc to save the content of DataFrame , sqlContext.sql("select '2015-09-17',count(1) from test").write.jdbc(url,test,properties) table structure before saving: app_key text t_amount bigint(20) saved: _c0 text _c1

Lost tasks in Spark SQL join jobs

2015-09-16 Thread Gang Bai
Hi all, I’m joining two tables on a specific attribute. The job is like `sqlContext.sql(“SELECT * FROM tableA LEFT JOIN tableB on tableA.uuid=tableB.uuid”)`, where tableA and tableB are two temp tables, of which both sizes are around 100 GBs and are not skewed on 'uuid’. As I run the

Re: application failed on large dataset

2015-09-16 Thread 周千昊
indeed, the operation in this stage is quite memory consuming. We are trying to enable the printGCDetail option and see what is going on. java8964 于2015年9月16日周三 下午11:47写道: > This sounds like a memory issue. > > Do you enable the GC output? When this is happening, are your

spark sql hook

2015-09-16 Thread r7raul1...@163.com
I want to modify some sql treenode before execute. I cau do this by hive hook in hive. Does spark support such hook? Any advise? r7raul1...@163.com

Re: Re: Table is modified by DataFrameWriter

2015-09-16 Thread Josh Rosen
What are your JDBC properties configured to? Do you have overwrite mode enabled? On Wed, Sep 16, 2015 at 7:39 PM, guoqing0...@yahoo.com.hk < guoqing0...@yahoo.com.hk> wrote: > Spark-1.4.1 > > > *From:* Ted Yu > *Date:* 2015-09-17 10:29 > *To:* guoqing0...@yahoo.com.hk >

Re: Iterating over JavaRDD

2015-09-16 Thread Ted Yu
How about using this method: * Return a new RDD by applying a function to all elements of this RDD. */ def mapToDouble[R](f: DoubleFunction[T]): JavaDoubleRDD = { new JavaDoubleRDD(rdd.map(x => f.call(x).doubleValue())) On Wed, Sep 16, 2015 at 8:30 PM, Tapan Sharma

RE: RE: spark sql hook

2015-09-16 Thread Cheng, Hao
Probably a workable solution is, create your own SQLContext by extending the class HiveContext, and override the `analyzer`, and add your own rule to do the hacking. From: r7raul1...@163.com [mailto:r7raul1...@163.com] Sent: Thursday, September 17, 2015 11:08 AM To: Cheng, Hao; user Subject:

Re: RE: spark sql hook

2015-09-16 Thread r7raul1...@163.com
Thank you r7raul1...@163.com From: Cheng, Hao Date: 2015-09-17 12:32 To: r7raul1...@163.com; user Subject: RE: RE: spark sql hook Probably a workable solution is, create your own SQLContext by extending the class HiveContext, and override the `analyzer`, and add your own rule to do the

Re: Table is modified by DataFrameWriter

2015-09-16 Thread Ted Yu
Can you tell us which release you were using ? Thanks > On Sep 16, 2015, at 7:11 PM, "guoqing0...@yahoo.com.hk" > wrote: > > Hi all, > I found the table structure was modified when use DataFrameWriter.jdbc to > save the content of DataFrame , > >

RE: spark sql hook

2015-09-16 Thread Cheng, Hao
Catalyst TreeNode is very fundamental API, not sure what kind of hook you need. Any concrete example will be more helpful to understand your requirement. Hao From: r7raul1...@163.com [mailto:r7raul1...@163.com] Sent: Thursday, September 17, 2015 10:54 AM To: user Subject: spark sql hook I

Iterating over JavaRDD

2015-09-16 Thread Tapan Sharma
Hi All, I am a newbie. I want to achieve following scenario. I would like to iterate over a JavaRDD of Complex type (user defined class, say X). I need to calculate the sum of integers stored in X and return. Which method do I need to call of JavaRDD? Thanks -- View this message in context:

Re: How to recovery DStream from checkpoint directory?

2015-09-16 Thread Bin Wang
And here is another question. If I load the DStream from database every time I start the job, will the data be loaded when the job is failed and auto restart? If so, both the checkpoint data and database data are loaded, won't this a problem? Bin Wang 于2015年9月16日周三 下午8:40写道:

Re: Re: Table is modified by DataFrameWriter

2015-09-16 Thread guoqing0...@yahoo.com.hk
I tried SaveMode.Append and SaveMode.Overwrite , the output table was modified . Is the _c0 and _c1 automatically generated for the DataFrame Schema? In my scenario , i hope it just flush the data from DataFrame to RMDB if there are the same structure between on both . but i found the column

Re: Re: Table is modified by DataFrameWriter

2015-09-16 Thread guoqing0...@yahoo.com.hk
Spark-1.4.1 From: Ted Yu Date: 2015-09-17 10:29 To: guoqing0...@yahoo.com.hk CC: user Subject: Re: Table is modified by DataFrameWriter Can you tell us which release you were using ? Thanks On Sep 16, 2015, at 7:11 PM, "guoqing0...@yahoo.com.hk" wrote: Hi all, I

Re: Spark Thrift Server JDBC Drivers

2015-09-16 Thread Dan LaBar
I'm running Spark in EMR, and using the JDBC driver provided by AWS . Don't know if it will work outside of EMR, but it's worth a try. I've also used the ODBC driver from Hortonworks