Re: Spark Streaming input data source list

2015-03-09 Thread Tathagata Das
Spark Streaming has StreamingContext.socketStream() http://spark.apache.org/docs/1.2.1/api/java/org/apache/spark/streaming/StreamingContext.html#socketStream(java.lang.String, int, scala.Function1, org.apache.spark.storage.StorageLevel, scala.reflect.ClassTag) TD On Mon, Mar 9, 2015 at 11:37 AM,

Re: Spark Streaming input data source list

2015-03-09 Thread Tathagata Das
data from RDBMs, for the details you can refer to the docs. Thanks Jerry *From:* Cui Lin [mailto:cui@hds.com] *Sent:* Tuesday, March 10, 2015 8:36 AM *To:* Tathagata Das *Cc:* user@spark.apache.org *Subject:* Re: Spark Streaming input data source list Tathagata, Thanks

Re: Using Neo4j with Apache Spark

2015-03-12 Thread Tathagata Das
-the-core-java-api-for-traversing-your-graph-data-base-instead-of-cypher-query-language/ On Thu, Mar 12, 2015 at 5:09 PM, Tathagata Das t...@databricks.com wrote: Well the answers you got there are correct as well. Unfortunately I am not familiar with Neo4j enough to comment any more

Re: Using Neo4j with Apache Spark

2015-03-12 Thread Tathagata Das
I am not sure if you realized but the code snipper it pretty mangled up in the email we received. It might be a good idea to put the code in pastebin or gist, much much easier for everyone to read. On Thu, Mar 12, 2015 at 12:48 AM, d34th4ck3r gautam1...@gmail.com wrote: I'm trying to use Neo4j

Re: Using Neo4j with Apache Spark

2015-03-12 Thread Tathagata Das
at 12:58 AM, Gautam Bajaj gautam1...@gmail.com wrote: Alright, I have also asked this question in StackOverflow: http://stackoverflow.com/questions/28896898/using-neo4j-with-apache-spark The code there is pretty neat. On Thu, Mar 12, 2015 at 4:55 PM, Tathagata Das t...@databricks.com wrote

Re: Writing to a single file from multiple executors

2015-03-11 Thread Tathagata Das
Why do you have to write a single file? On Wed, Mar 11, 2015 at 1:00 PM, SamyaMaiti samya.maiti2...@gmail.com wrote: Hi Experts, I have a scenario, where in I want to write to a avro file from a streaming job that reads data from kafka. But the issue is, as there are multiple executors

Re: Spark Streaming - Duration 1s not matching reality

2015-03-05 Thread Tathagata Das
Hint: Print() just gives a sample of what is in the data, and does not enforce the processing on all the data (only the first partition of the rdd is computed to get 10 items). Count() actually processes all the data. This is all due to lazy eval, if you don't need to use all the data, don't

Re: Spark Streaming Switchover Time

2015-03-06 Thread Tathagata Das
://www.cisco.com/web/siteassets/legal/privacy.html* *From:* Tathagata Das [mailto:t...@databricks.com] *Sent:* Tuesday, March 03, 2015 11:11 PM *To:* Nastooh Avessta (navesta) *Cc:* user@spark.apache.org *Subject:* Re: Spark Streaming Switchover Time I am confused. Are you killing the 1st worker

RE: Handling worker batch processing during driver shutdown

2015-03-13 Thread Tathagata Das
. I really appreciate your help, but it looks like I’m back to the drawing board on this one. *From:* Tathagata Das [mailto:t...@databricks.com] *Sent:* Thursday, March 12, 2015 7:53 PM *To:* Jose Fernandez *Cc:* user@spark.apache.org *Subject:* Re: Handling worker batch processing during

Re: Unable to connect

2015-03-13 Thread Tathagata Das
Are you running the driver program (that is your application process) in your desktop and trying to run it on the cluster in EC2? It could very well be a hostname mismatch in some way due to the all the public hostname, private hostname, private ip, firewall craziness of ec2. You have to probably

Re: Using rdd methods with Dstream

2015-03-13 Thread Tathagata Das
Is the number of top K elements you want to keep small? That is, is K small? In which case, you can 1. either do it in the driver on the array DStream.foreachRDD ( rdd = { val topK = rdd.top(K) ; // use top K }) 2. Or, you can use the topK to create another RDD using sc.makeRDD

Re: Partitioning

2015-03-13 Thread Tathagata Das
If you want to access the keys in an RDD that is partition by key, then you can use RDD.mapPartition(), which gives you access to the whole partition as an iteratorkey, value. You have the option of maintaing the partitioning information or not by setting the preservePartitioning flag in

Re: Partitioning

2015-03-13 Thread Tathagata Das
, Tathagata Das t...@databricks.com wrote: If you want to access the keys in an RDD that is partition by key, then you can use RDD.mapPartition(), which gives you access to the whole partition as an iteratorkey, value. You have the option of maintaing the partitioning information or not by setting

Re: Compilation error

2015-03-10 Thread Tathagata Das
If you are using tools like SBT/Maven/Gradle/etc, they figure out all the recursive dependencies and includes them in the class path. I haven't touched Eclipse in years so I am not sure off the top of my head what's going on instead. Just in case you only downloaded the spark-streaming_2.10.jar

Re: Compilation error

2015-03-10 Thread Tathagata Das
You have to include Scala libraries in the Eclipse dependencies. TD On Tue, Mar 10, 2015 at 10:54 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I am trying out streaming example as documented and I am using spark 1.2.1 streaming from maven for Java. When I add this code I get compilation

Re: Why spark master consumes 100% CPU when we kill a spark streaming app?

2015-03-10 Thread Tathagata Das
Do you have event logging enabled? That could be the problem. The Master tries to aggressively recreate the web ui of the completed job with the event logs (when it is enabled) causing the Master to stall. I created a JIRA for this. https://issues.apache.org/jira/browse/SPARK-6270 On Tue, Mar 10,

Re: Compilation error

2015-03-10 Thread Tathagata Das
artifactIdspark-streaming_2.10/artifactId version1.2.0/version /dependency dependency groupIdorg.apache.spark/groupId artifactIdspark-core_2.10/artifactId version1.2.1/version /dependency /dependencies On Tue, Mar 10, 2015 at 11:06 AM, Tathagata Das t...@databricks.com wrote: If you

Re: Using Neo4j with Apache Spark

2015-03-12 Thread Tathagata Das
. As for each partition, I'd need to restart the server, that was the basic reason I was creating graphDb object outside this loop. On Fri, Mar 13, 2015 at 5:34 AM, Tathagata Das t...@databricks.com wrote: (Putting user@spark back in the to list) In the gist, you are creating graphDB object way

Re: Handling worker batch processing during driver shutdown

2015-03-12 Thread Tathagata Das
using Spark 1.2 on CDH 5.3. I stop the application with yarn application -kill appID. *From:* Tathagata Das [mailto:t...@databricks.com] *Sent:* Thursday, March 12, 2015 1:29 PM *To:* Jose Fernandez *Cc:* user@spark.apache.org *Subject:* Re: Handling worker batch processing during driver

Re: Writing to a single file from multiple executors

2015-03-12 Thread Tathagata Das
the AVRO data to multiple files. Thanks, Sam On Mar 12, 2015, at 4:09 AM, Tathagata Das t...@databricks.com wrote: Why do you have to write a single file? On Wed, Mar 11, 2015 at 1:00 PM, SamyaMaiti samya.maiti2...@gmail.com wrote: Hi Experts, I have a scenario, where in I want to write

Re: Handling worker batch processing during driver shutdown

2015-03-12 Thread Tathagata Das
Can you access the batcher directly? Like is there is there a handle to get access to the batchers on the executors by running a task on that executor? If so, after the streamingContext has been stopped (not the SparkContext), then you can use `sc.makeRDD()` to run a dummy task like this.

Re: Efficient Top count in each window

2015-03-12 Thread Tathagata Das
Why are you repartitioning 1? That would obviously be slow, you are converting a distributed operation to a single node operation. Also consider using RDD.top(). If you define the ordering right (based on the count), then you will get top K across then without doing a shuffle for sortByKey. Much

Re: Spark Streaming recover from Checkpoint with Spark SQL

2015-03-11 Thread Tathagata Das
Can you show us the code that you are using? This might help. This is the updated streaming programming guide for 1.3, soon to be up, this is a quick preview. http://people.apache.org/~tdas/spark-1.3.0-temp-docs/streaming-programming-guide.html#dataframe-and-sql-operations TD On Wed, Mar 11,

Re: Race Condition in Streaming Thread

2015-02-27 Thread Tathagata Das
Are you sure the multiple invocations are not from previous runs of the program? TD On Fri, Feb 27, 2015 at 12:16 PM, Nastooh Avessta (navesta) nave...@cisco.com wrote: Hi Under Spark 1.0.0, standalone, client mode am trying to invoke a 3rd party udp traffic generator, from the streaming

Re: Race Condition in Streaming Thread

2015-02-27 Thread Tathagata Das
, ON, Canada, M5J 2T3. Phone: 416-306-7000; Fax: 416-306-7099. *Preferences http://www.cisco.com/offer/subscribe/?sid=000478326 - Unsubscribe http://www.cisco.com/offer/unsubscribe/?sid=000478327 – Privacy http://www.cisco.com/web/siteassets/legal/privacy.html* *From:* Tathagata Das [mailto:t

Re: Why different numbers of partitions give different results for the same computation on the same dataset?

2015-03-03 Thread Tathagata Das
You can use DStream.transform() to do any arbitrary RDD transformations on the RDDs generated by a DStream. val coalescedDStream = myDStream.transform { _.coalesce(...) } On Tue, Mar 3, 2015 at 1:47 PM, Saiph Kappa saiph.ka...@gmail.com wrote: Sorry I made a mistake in my code. Please ignore

Re: Spark Streaming Switchover Time

2015-03-03 Thread Tathagata Das
Can you elaborate on what is this switchover time? TD On Tue, Mar 3, 2015 at 9:57 PM, Nastooh Avessta (navesta) nave...@cisco.com wrote: Hi On a standalone, Spark 1.0.0, with 1 master and 2 workers, operating in client mode, running a udp streaming application, I am noting around 2

Re: Spark Streaming Switchover Time

2015-03-03 Thread Tathagata Das
- Unsubscribe http://www.cisco.com/offer/unsubscribe/?sid=000478327 – Privacy http://www.cisco.com/web/siteassets/legal/privacy.html* *From:* Tathagata Das [mailto:t...@databricks.com] *Sent:* Tuesday, March 03, 2015 10:24 PM *To:* Nastooh Avessta (navesta) *Cc:* user@spark.apache.org *Subject:* Re

Re: Why can't Spark Streaming recover from the checkpoint directory when using a third party library for processingmulti-line JSON?

2015-03-04 Thread Tathagata Das
That could be a corner case bug. How do you add the 3rd party library to the class path of the driver? Through spark-submit? Could you give the command you used? TD On Wed, Mar 4, 2015 at 12:42 AM, Emre Sevinc emre.sev...@gmail.com wrote: I've also tried the following: Configuration

Re: Is FileInputDStream returned by fileStream method a reliable receiver?

2015-03-04 Thread Tathagata Das
The file stream does not use receiver. May be that was not clear in the programming guide. I am updating it for 1.3 release right now, I will make it more clear. And file stream has full reliability. Read this in the programming guide.

Re: Why must the dstream.foreachRDD(...) parameter be serializable?

2015-02-23 Thread Tathagata Das
There are different kinds of checkpointing going on. updateStateByKey requires RDD checkpointing which can be enabled only by called sparkContext.setCheckpointDirectory. But that does not enable Spark Streaming driver checkpoints, which is necessary for recovering from driver failures. That is

Re: Query data in Spark RRD

2015-02-23 Thread Tathagata Das
, I'm thinking on that line. The problem is how to send to send the query to the backend? Bundle a http server into a spark streaming job, that will accept the parameters? -- Nikhil Bafna On Mon, Feb 23, 2015 at 2:04 PM, Tathagata Das t...@databricks.com wrote: You will have a build a split

Re: Any sample code for Kafka consumer

2015-02-22 Thread Tathagata Das
Spark Streaming already directly supports Kafka http://spark.apache.org/docs/latest/streaming-programming-guide.html#advanced-sources Is there any reason why that is not sufficient? TD On Sun, Feb 22, 2015 at 5:18 PM, mykidong mykid...@gmail.com wrote: In java, you can see this example:

Re: Periodic Broadcast in Apache Spark Streaming

2015-02-23 Thread Tathagata Das
You could do something like this. def rddTrasnformationUsingBroadcast(rdd: RDD[...]): RDD[...] = { val broadcastToUse = getBroadcast()// get the reference to a broadcast variable, new or existing. rdd.map { .. } // use broadcast variable }

Re: How to diagnose could not compute split errors and failed jobs?

2015-02-23 Thread Tathagata Das
Could you find the executor logs on the executor where that task was scheduled? That may provide more information on what caused the error. Also take a look at where the block in question was stored, and where the task was scheduled. You will need to enabled log4j INFO level logs for this

Re: Query data in Spark RRD

2015-02-23 Thread Tathagata Das
You will have a build a split infrastructure - a front end that takes the queries from the UI and sends them to the backend, and the backend (running the Spark Streaming app) will actually run the queries on table created in the contexts. The RPCs necessary between the frontend and backend will

Re: spark streaming window operations on a large window size

2015-02-23 Thread Tathagata Das
The default persistence level is MEMORY_AND_DISK, so the LRU policy would discard the blocks to disk, so the streaming app will not fail. However, since things will get constantly read in and out of disk as windows are processed, the performance wont be great. So it is best to have sufficient

Re: Accumulator in SparkUI for streaming

2015-02-23 Thread Tathagata Das
Unless I am unaware some latest changes, the SparkUI shows stages, and jobs, not accumulator results. And the UI not designed to be pluggable for showing user-defined stuff. TD On Fri, Feb 20, 2015 at 12:25 AM, Tim Smith secs...@gmail.com wrote: On Spark 1.2: I am trying to capture # records

Re: Re: About FlumeUtils.createStream

2015-02-23 Thread Tathagata Das
-24 12:58 *To:* Tathagata Das t...@databricks.com *CC:* user user@spark.apache.org; bit1129 bit1...@163.com *Subject:* Re: About FlumeUtils.createStream I see, thanks for the clarification TD. On 24 Feb 2015 09:56, Tathagata Das t...@databricks.com wrote: Akhil, that is incorrect. Spark

Re: Magic number 16: Why doesn't Spark Streaming process more than 16 files?

2015-02-23 Thread Tathagata Das
In case this mystery has not been solved, DStream.print() essentially does a RDD.take(10) on each RDD, which computes only a subset of the partitions in the RDD. But collects forces the evaluation of all the RDDs. Since you are writing to json in the mapI() function, this could be the reason. TD

Re: Write ahead Logs and checkpoint

2015-02-23 Thread Tathagata Das
Exactly, that is the reason. To avoid that, in Spark 1.3 to-be-released, we have added a new Kafka API (called direct stream) which does not use Zookeeper at all to keep track of progress, and maintains offset within Spark Streaming. That can guarantee all records being received exactly-once. Its

Re: Write ahead Logs and checkpoint

2015-02-23 Thread Tathagata Das
felixcheun...@hotmail.com wrote: Kafka 0.8.2 has built-in offset management, how would that affect direct stream in spark? Please see KAFKA-1012 --- Original Message --- From: Tathagata Das t...@databricks.com Sent: February 23, 2015 9:53 PM To: V Dineshkumar developer.dines...@gmail.com Cc

Re: throughput in the web console?

2015-02-26 Thread Tathagata Das
26, 2015 at 1:36 AM, Tathagata Das t...@databricks.com wrote: Yes. # tuples processed in a batch = sum of all the tuples received by all the receivers. In screen shot, there was a batch with 69.9K records, and there was a batch which took 1 s 473 ms. These two batches can be the same, can

Re: throughput in the web console?

2015-02-25 Thread Tathagata Das
Yes. # tuples processed in a batch = sum of all the tuples received by all the receivers. In screen shot, there was a batch with 69.9K records, and there was a batch which took 1 s 473 ms. These two batches can be the same, can be different batches. TD On Wed, Feb 25, 2015 at 10:11 AM, Josh J

Re: Re: Many Receiver vs. Many threads per Receiver

2015-02-25 Thread Tathagata Das
Spark Streaming has a new Kafka direct stream, to be release as experimental feature with 1.3. That uses a low level consumer. Not sure if it satisfies your purpose. If you want more control, its best to create your own Receiver with the low level Kafka API. TD On Tue, Feb 24, 2015 at 12:09 AM,

Re: Integrating Spark Streaming with Reactive Mongo

2015-02-26 Thread Tathagata Das
Hey Mike, I quickly looked through the example and I found major performance issue. You are collecting the RDDs to the driver and then sending them to Mongo in a foreach. Why not doing a distributed push to Mongo? WHAT YOU HAVE val mongoConnection = ... WHAT YOU SHUOLD DO rdd.foreachPartition

Re: Could not compute split, block not found in Spark Streaming Simple Application

2015-03-27 Thread Tathagata Das
If it is deterministically reproducible, could you generate full DEBUG level logs, from the driver and the workers and give it to me? Basically I want to trace through what is happening to the block that is not being found. And can you tell what Cluster manager are you using? Spark Standalone,

Re: spark streaming driver hang

2015-03-27 Thread Tathagata Das
Do you have the logs of the driver? Does that give any exceptions? TD On Fri, Mar 27, 2015 at 12:24 PM, Chen Song chen.song...@gmail.com wrote: I ran a spark streaming job. 100 executors 30G heap per executor 4 cores per executor The version I used is 1.3.0-cdh5.1.0. The job is reading

Re: Strange JavaDeserialization error - java.lang.ClassNotFoundException: org/apache/spark/storage/StorageLevel

2015-03-27 Thread Tathagata Das
-spark . From console I see that spark is trying to replicate to nodes - nodes show up in Mesos active tasks ... but they always fail with ClassNotFoundE. 2015-03-27 0:52 GMT+01:00 Tathagata Das t...@databricks.com: Could you try running a simpler spark streaming program with receiver (may

Re: Strange JavaDeserialization error - java.lang.ClassNotFoundException: org/apache/spark/storage/StorageLevel

2015-03-27 Thread Tathagata Das
(StorageLevel.MEMORY_ONLY_2 etc). 2015-03-27 19:06 GMT+01:00 Tathagata Das t...@databricks.com: Does it fail with just Spark jobs (using storage levels) on non-coarse mode? TD On Fri, Mar 27, 2015 at 4:39 AM, Ondrej Smola ondrej.sm...@gmail.com wrote: More info when using *spark.mesos.coarse

Re: foreachRDD execution

2015-03-26 Thread Tathagata Das
Yes, that is the correct understanding. There are undocumented parameters that allow that, but I do not recommend using those :) TD On Wed, Mar 25, 2015 at 6:57 AM, Luis Ángel Vicente Sánchez langel.gro...@gmail.com wrote: I have a simple and probably dumb question about foreachRDD. We are

Re: Spark Streaming Worker runs out of inodes

2015-04-02 Thread Tathagata Das
Are you saying that even with the spark.cleaner.ttl set your files are not getting cleaned up? TD On Thu, Apr 2, 2015 at 8:23 AM, andrem amesa...@gmail.com wrote: Apparently Spark Streaming 1.3.0 is not cleaning up its internal files and the worker nodes eventually run out of inodes. We see

Re: Spark Application Stages and DAG

2015-04-03 Thread Tathagata Das
What he meant is that look it up in the Spark UI, specifically in the Stage tab to see what is taking so long. And yes code snippet helps us debug. TD On Fri, Apr 3, 2015 at 12:47 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You need open the Stage\'s page which is taking time, and see how

Re: About Waiting batches on the spark streaming UI

2015-04-03 Thread Tathagata Das
Very good question! This is because the current code is written such that the ui considers a batch as waiting only when it has actually started being processed. Thats batched waiting in the job queue is not considered in the calculation. It is arguable that it may be more intuitive to count that

Re: About Waiting batches on the spark streaming UI

2015-04-03 Thread Tathagata Das
, Tathagata Das t...@databricks.com wrote: Very good question! This is because the current code is written such that the ui considers a batch as waiting only when it has actually started being processed. Thats batched waiting in the job queue is not considered in the calculation. It is arguable

Re: SparkStreaming onStart not being invoked on CustomReceiver attached to master with multiple workers

2015-04-20 Thread Tathagata Das
Responses inline. On Mon, Apr 20, 2015 at 3:27 PM, Ankit Patel patel7...@hotmail.com wrote: What you said is correct and I am expecting the printlns to be in my console or my SparkUI. I do not see it in either places. Can you actually login into the machine running the executor which runs the

Re: Shuffle files not cleaned up (Spark 1.2.1)

2015-04-23 Thread Tathagata Das
What was the state of your streaming application? Was it falling behind with a large increasing scheduling delay? TD On Thu, Apr 23, 2015 at 11:31 AM, N B nb.nos...@gmail.com wrote: Thanks for the response, Conor. I tried with those settings and for a while it seemed like it was cleaning up

Re: Convert DStream to DataFrame

2015-04-22 Thread Tathagata Das
Did you checkout the latest streaming programming guide? http://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations You also need to be aware of that to convert json RDDs to dataframe, sqlContext has to make a pass on the data to learn the schema. This will

Re: Task not Serializable: Graph is unexpectedly null when DStream is being serialized

2015-04-21 Thread Tathagata Das
is being serialized must mean something. Under which circumstances, such an exception would trigger? On Tue, Apr 21, 2015 at 4:12 PM, Tathagata Das t...@databricks.com wrote: Yeah, I am not sure what is going on. The only way to figure to take a look at the disassembled bytecodes using javap

Re: Map Question

2015-04-22 Thread Tathagata Das
Is the mylist present on every executor? If not, then you have to pass it on. And broadcasts are the best way to pass them on. But note that once broadcasted it will immutable at the executors, and if you update the list at the driver, you will have to broadcast it again. TD On Wed, Apr 22, 2015

Re: Not able run multiple tasks in parallel, spark streaming

2015-04-22 Thread Tathagata Das
Furthermore, just to explain, doing arr.par.foreach does not help because it not really running anything, it only doing setup of the computation. Doing the setup in parallel does not mean that the jobs will be done concurrently. Also, from your code it seems like your pairs of dstreams dont

Re: Task not Serializable: Graph is unexpectedly null when DStream is being serialized

2015-04-21 Thread Tathagata Das
that. The puzzling thing is why removing the context bounds solve the problem... What does this exception mean in general? On Mon, Apr 20, 2015 at 1:33 PM, Tathagata Das t...@databricks.com wrote: When are you getting this exception? After starting the context? TD On Mon, Apr 20, 2015 at 10:44

Re: Map Question

2015-04-22 Thread Tathagata Das
Absolutely. The same code would work for local as well as distributed mode! On Wed, Apr 22, 2015 at 11:08 AM, Vadim Bichutskiy vadim.bichuts...@gmail.com wrote: Can I use broadcast vars in local mode? ᐧ On Wed, Apr 22, 2015 at 2:06 PM, Tathagata Das t...@databricks.com wrote: Yep

Re: Map Question

2015-04-22 Thread Tathagata Das
it locally...and I plan to move it to production on EC2. The way I fixed it is by doing myrdd.map(lambda x: (x, mylist)).map(myfunc) but I don't think it's efficient? mylist is filled only once at the start and never changes. Vadim ᐧ On Wed, Apr 22, 2015 at 1:42 PM, Tathagata Das t

Re: Spark Streaming updatyeStateByKey throws OutOfMemory Error

2015-04-22 Thread Tathagata Das
It could very well be that your executor memory is not enough to store the state RDDs AND operate on the data. 1G per executor is quite low. Definitely give more memory. And have you tried increasing the number of partitions (specify number of partitions in updateStateByKey) ? On Wed, Apr 22,

Re: Map Question

2015-04-22 Thread Tathagata Das
: Great. Will try to modify the code. Always room to optimize! ᐧ On Wed, Apr 22, 2015 at 2:11 PM, Tathagata Das t...@databricks.com wrote: Absolutely. The same code would work for local as well as distributed mode! On Wed, Apr 22, 2015 at 11:08 AM, Vadim Bichutskiy vadim.bichuts

Re: Task not Serializable: Graph is unexpectedly null when DStream is being serialized

2015-04-22 Thread Tathagata Das
is unexpectedly null when DStream is being serialized must mean something. Under which circumstances, such an exception would trigger? On Tue, Apr 21, 2015 at 4:12 PM, Tathagata Das t...@databricks.com wrote: Yeah, I am not sure what is going on. The only way to figure to take a look

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Tathagata Das
Is the function ingestToMysql running on the driver or on the executors? Accordingly you can try debugging while running in a distributed manner, with and without calling the function. If you dont get too many open files without calling ingestToMysql(), the problem is likely to be in

Re: Join between Streaming data vs Historical Data in spark

2015-04-29 Thread Tathagata Das
Have you taken a look at the join section in the streaming programming guide? http://spark.apache.org/docs/latest/streaming-programming-guide.html#stream-dataset-joins On Wed, Apr 29, 2015 at 7:11 AM, Rendy Bambang Junior rendy.b.jun...@gmail.com wrote: Let say I have transaction data and

Re: implicit function in SparkStreaming

2015-04-29 Thread Tathagata Das
I believe that the implicit def is pulling in the enclosing class (in which the def is defined) in the closure which is not serializable. On Wed, Apr 29, 2015 at 4:20 AM, guoqing0...@yahoo.com.hk guoqing0...@yahoo.com.hk wrote: Hi guys, I`m puzzled why i cant use the implicit function in

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Tathagata Das
Also cc;ing Cody. @Cody maybe there is a reason for doing connection pooling even if there is not performance difference. TD On Wed, Apr 29, 2015 at 4:06 PM, Tathagata Das t...@databricks.com wrote: Is the function ingestToMysql running on the driver or on the executors? Accordingly you can

Re: Driver memory leak?

2015-04-29 Thread Tathagata Das
It could be related to this. https://issues.apache.org/jira/browse/SPARK-6737 This was fixed in Spark 1.3.1. On Wed, Apr 29, 2015 at 8:38 AM, Sean Owen so...@cloudera.com wrote: Not sure what you mean. It's already in CDH since 5.4 = 1.3.0 (This isn't the place to ask about CDH) I also

Re: Re: implicit function in SparkStreaming

2015-04-29 Thread Tathagata Das
in the enclosing class ? *From:* Tathagata Das t...@databricks.com *Date:* 2015-04-30 07:00 *To:* guoqing0...@yahoo.com.hk *CC:* user user@spark.apache.org *Subject:* Re: implicit function in SparkStreaming I believe that the implicit def is pulling in the enclosing class (in which the def

Re: Receiver Fault Tolerance

2015-05-06 Thread Tathagata Das
Incorrect. The receiver runs in an executor just like a any other tasks. In the cluster mode, the driver runs in a worker, however it launches executors in OTHER workers in the cluster. Its those executors running in other workers that run tasks, and also the receivers. On Wed, May 6, 2015 at

Re: How update counter in cassandra

2015-05-06 Thread Tathagata Das
This may help. http://www.slideshare.net/helenaedelson/lambda-architecture-with-spark-spark-streaming-kafka-cassandra-akka-and-scala On Wed, May 6, 2015 at 5:35 PM, Sergio Jiménez Barrio drarse.a...@gmail.com wrote: I have a Counter family colums in Cassandra. I want update this counters

Re: DStream Union vs. StreamingContext Union

2015-05-12 Thread Tathagata Das
@Vadim What happened when you tried unioning using DStream.union in python? TD On Tue, May 12, 2015 at 9:53 AM, Evo Eftimov evo.efti...@isecc.com wrote: I can confirm it does work in Java *From:* Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com] *Sent:* Tuesday, May 12, 2015 5:53 PM

Re: DStream Union vs. StreamingContext Union

2015-05-12 Thread Tathagata Das
On Tue, May 12, 2015 at 12:57 PM, Tathagata Das tathagata.das1...@gmail.com wrote: @Vadim What happened when you tried unioning using DStream.union in python? TD On Tue, May 12, 2015 at 9:53 AM, Evo Eftimov evo.efti...@isecc.com wrote: I can confirm it does work in Java *From:* Vadim

Re: how to use rdd.countApprox

2015-05-13 Thread Tathagata Das
On Wednesday, May 13, 2015 12:02 PM, Tathagata Das t...@databricks.com wrote: That is a good question. I dont see a direct way to do that. You could do try the following val jobGroupId = group-id-based-on-current-time rdd.sparkContext.setJobGroup(jobGroupId) val approxCount

Re: how to use rdd.countApprox

2015-05-13 Thread Tathagata Das
. Thanks, Du On Wednesday, May 13, 2015 1:12 PM, Tathagata Das t...@databricks.com wrote: That is not supposed to happen :/ That is probably a bug. If you have the log4j logs, would be good to file a JIRA. This may be worth debugging. On Wed, May 13, 2015 at 12:13 PM, Du Li l...@yahoo

Re: Multiple Kinesis Streams in a single Streaming job

2015-05-14 Thread Tathagata Das
What is the error you are seeing? TD On Thu, May 14, 2015 at 9:00 AM, Erich Ess er...@simplerelevance.com wrote: Hi, Is it possible to setup streams from multiple Kinesis streams and process them in a single job? From what I have read, this should be possible, however, the Kinesis layer

Re: Multiple Kinesis Streams in a single Streaming job

2015-05-14 Thread Tathagata Das
? it should work the same way - including union() of streams from totally different source types (kafka, kinesis, flume). On Thu, May 14, 2015 at 2:07 PM, Tathagata Das t...@databricks.com wrote: What is the error you are seeing? TD On Thu, May 14, 2015 at 9:00 AM, Erich Ess er

Re: Spark Fair Scheduler for Spark Streaming - 1.2 and beyond

2015-05-16 Thread Tathagata Das
Marscher [mailto:rmarsc...@localytics.com] *Sent:* Friday, May 15, 2015 7:20 PM *To:* Evo Eftimov *Cc:* Tathagata Das; user *Subject:* Re: Spark Fair Scheduler for Spark Streaming - 1.2 and beyond The doc is a bit confusing IMO, but at least for my application I had to use a fair pool

Re: Kafka stream fails: java.lang.NoClassDefFound com/yammer/metrics/core/Gauge

2015-05-14 Thread Tathagata Das
It would be good if you can tell what I should add to the documentation to make it easier to understand. I can update the docs for 1.4.0 release. On Tue, May 12, 2015 at 9:52 AM, Lee McFadden splee...@gmail.com wrote: Thanks for explaining Sean and Cody, this makes sense now. I'd like to help

Re: [Unit Test Failure] Test org.apache.spark.streaming.JavaAPISuite.testCount failed

2015-05-14 Thread Tathagata Das
Do you get this failure repeatedly? On Thu, May 14, 2015 at 12:55 AM, kf wangf...@huawei.com wrote: Hi, all, i got following error when i run unit test of spark by dev/run-tests on the latest branch-1.4 branch. the latest commit id: commit d518c0369fa412567855980c3f0f426cde5c190d

Re: Spark Streaming graceful shutdown in Spark 1.4

2015-05-19 Thread Tathagata Das
If you wanted to stop it gracefully, then why are you not calling ssc.stop(stopGracefully = true, stopSparkContext = true)? Then it doesnt matter whether the shutdown hook was called or not. TD On Mon, May 18, 2015 at 9:43 PM, Dibyendu Bhattacharya dibyendu.bhattach...@gmail.com wrote: Hi,

Re: [SparkStreaming] Is it possible to delay the start of some DStream in the application?

2015-05-19 Thread Tathagata Das
If you dont want the fileStream to start only after certain event has happened, why not start the streamingContext after that event? TD On Sun, May 17, 2015 at 7:51 PM, Haopu Wang hw...@qilinsoft.com wrote: I want to use file stream as input. And I look at SparkStreaming document again, it's

Re: how to use rdd.countApprox

2015-05-12 Thread Tathagata Das
From the code it seems that as soon as the rdd.countApprox(5000) returns, you can call pResult.initialValue() to get the approximate count at that point of time (that is after timeout). Calling pResult.getFinalValue() will further block until the job is over, and give the final correct values

Re: RAM management during cogroup and join

2015-04-15 Thread Tathagata Das
Significant optimizations can be made by doing the joining/cogroup in a smart way. If you have to join streaming RDDs with the same batch RDD, then you can first partition the batch RDDs using a partitions and cache it, and then use the same partitioner on the streaming RDDs. That would make sure

Re: RAM management during cogroup and join

2015-04-15 Thread Tathagata Das
and RAM allocated for the result RDD The optimizations mentioned still don’t change the total number of elements included in the result RDD and RAM allocated – right? *From:* Tathagata Das [mailto:t...@databricks.com] *Sent:* Wednesday, April 15, 2015 9:25 PM *To:* Evo Eftimov *Cc:* user

Re: RAM management during cogroup and join

2015-04-15 Thread Tathagata Das
:* Tathagata Das [mailto:t...@databricks.com] *Sent:* Wednesday, April 15, 2015 9:48 PM *To:* Evo Eftimov *Cc:* user *Subject:* Re: RAM management during cogroup and join Agreed. On Wed, Apr 15, 2015 at 1:29 PM, Evo Eftimov evo.efti...@isecc.com wrote: That has been done Sir

Re: Is it feasible to keep millions of keys in state of Spark Streaming job for two months?

2015-04-15 Thread Tathagata Das
asynchronously--- window1 --- window2 While writing it, I start to believe they can, because windows are time-triggered, not triggered when previous window has finished... But it's better to ask:) 2015-04-15 2:08 GMT+02:00 Tathagata Das t

Re: How to do dispatching in Streaming?

2015-04-15 Thread Tathagata Das
It may be worthwhile to do architect the computation in a different way. dstream.foreachRDD { rdd = rdd.foreach { record = // do different things for each record based on filters } } TD On Sun, Apr 12, 2015 at 7:52 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I have a

Re: not found: value SQLContextSingleton

2015-04-11 Thread Tathagata Das
Have you created a class called SQLContextSingleton ? If so, is it in the compile class path? On Fri, Apr 10, 2015 at 6:47 AM, Mukund Ranjan (muranjan) muran...@cisco.com wrote: Hi All, Any idea why I am getting this error? wordsTenSeconds.foreachRDD((rdd: RDD[String], time: Time)

Re: Is it feasible to keep millions of keys in state of Spark Streaming job for two months?

2015-04-14 Thread Tathagata Das
Fundamentally, stream processing systems are designed for processing streams of data, not for storing large volumes of data for a long period of time. So if you have to maintain that much state for months, then its best to use another system that is designed for long term storage (like Cassandra)

Re: Seeing message about receiver not being de-registered on invoking Streaming context stop

2015-04-14 Thread Tathagata Das
What version of Spark are you using? There was a known bug which could be causing this. It got fixed in Spark 1.3 TD On Mon, Apr 13, 2015 at 11:44 PM, Akhil Das ak...@sigmoidanalytics.com wrote: When you say done fetching documents, does it mean that you are stopping the streamingContext?

Re: sbt-assembly spark-streaming-kinesis-asl error

2015-04-14 Thread Tathagata Das
Have you tried marking only spark-streaming-kinesis-asl as not provided, and the rest as provided? Then you will not even need to add kinesis-asl.jar in the spark-submit. TD On Tue, Apr 14, 2015 at 2:27 PM, Mike Trienis mike.trie...@orcsol.com wrote: Richard, You response was very helpful

Re: Spark + Kinesis

2015-04-06 Thread Tathagata Das
, InitialPositionInStream.LATEST, StorageLevel.MEMORY_AND_DISK_2)}/* Union all the streams */val unionStreams = ssc.union(kinesisStreams).map(byteArray = new String(byteArray))unionStreams.print()ssc.start() ssc.awaitTermination() }}* On Fri, Apr 3, 2015 at 3:48 PM, Tathagata Das t

Re: WordCount example

2015-04-06 Thread Tathagata Das
, Apr 6, 2015 at 9:55 AM, Mohit Anchlia mohitanch...@gmail.com wrote: Interesting, I see 0 cores in the UI? - *Cores:* 0 Total, 0 Used On Fri, Apr 3, 2015 at 2:55 PM, Tathagata Das t...@databricks.com wrote: What does the Spark Standalone UI at port 8080 say about number of cores

Re: How to restrict foreach on a streaming RDD only once upon receiver completion

2015-04-06 Thread Tathagata Das
So you want to sort based on the total count of the all the records received through receiver? In that case, you have to combine all the counts using updateStateByKey (

Re: WordCount example

2015-04-03 Thread Tathagata Das
: 3 processor : 4 processor : 5 processor : 6 processor : 7 On Fri, Apr 3, 2015 at 2:33 PM, Tathagata Das t...@databricks.com wrote: How many cores are present in the works allocated to the standalone cluster spark://ip-10-241-251-232:7077 ? On Fri, Apr 3, 2015

<    1   2   3   4   5   6   7   8   9   >