from:"Tathagata Das"

Re: Simple but faster data streaming

2015-04-03 Thread Tathagata Das

I am afraid not. The whole point of Spark Streaming is to make it easy to do complicated processing on streaming data while interoperating with core Spark, MLlib, SQL without the operational overheads of maintain 4 different systems. As a slight cost of achieving that unification, there maybe some

Re: Spark Streaming FileStream Nested File Support

2015-04-03 Thread Tathagata Das

be added in at some point. On Fri, Apr 3, 2015 at 12:47 PM, Tathagata Das t...@databricks.com wrote: I sort-a-hacky workaround is to use a queueStream where you can manually create RDDs (using sparkContext.hadoopFile) and insert into the queue. Note that this is for testing only as queueStream does

Re: WordCount example

2015-04-03 Thread Tathagata Das

How many cores are present in the works allocated to the standalone cluster spark://ip-10-241-251-232:7077 ? On Fri, Apr 3, 2015 at 2:18 PM, Mohit Anchlia mohitanch...@gmail.com wrote: If I use local[2] instead of *URL:* spark://ip-10-241-251-232:7077 this seems to work. I don't understand

Re: Spark + Kinesis

2015-04-03 Thread Tathagata Das

Just remove provided for spark-streaming-kinesis-asl libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 1.3.0 On Fri, Apr 3, 2015 at 12:45 PM, Vadim Bichutskiy vadim.bichuts...@gmail.com wrote: Thanks. So how do I fix it? ᐧ On Fri, Apr 3, 2015 at 3:43 PM, Kelly,

Re: Spark Streaming FileStream Nested File Support

2015-04-03 Thread Tathagata Das

I sort-a-hacky workaround is to use a queueStream where you can manually create RDDs (using sparkContext.hadoopFile) and insert into the queue. Note that this is for testing only as queueStream does not work with driver fautl recovery. TD On Fri, Apr 3, 2015 at 12:23 PM, adamgerst

Re: Spark Streaming Worker runs out of inodes

2015-04-02 Thread Tathagata Das

Are you saying that even with the spark.cleaner.ttl set your files are not getting cleaned up? TD On Thu, Apr 2, 2015 at 8:23 AM, andrem amesa...@gmail.com wrote: Apparently Spark Streaming 1.3.0 is not cleaning up its internal files and the worker nodes eventually run out of inodes. We see

Re: Spark Streaming 1.3 Kafka Direct Streams

2015-04-01 Thread Tathagata Das

We should be able to support that use case in the direct API. It may be as simple as allowing the users to pass on a function that returns the set of topic+partitions to read from. That is function (Time) = Set[TopicAndPartition] This gets called every batch interval before the offsets are

Re: Spark Streaming and JMS

2015-04-01 Thread Tathagata Das

Its not a built in component of Spark. However there is a spark-package for Apache Camel receiver which can integrate with JMS. http://spark-packages.org/package/synsys/spark I have not tried it but do check it out. TD On Wed, Apr 1, 2015 at 4:38 AM, danila danila.erma...@gmail.com wrote: Hi

Re: Size of arbitrary state managed via DStream updateStateByKey

2015-04-01 Thread Tathagata Das

In the current state yes there will be performance issues. It can be done much more efficiently and we are working on doing that. TD On Wed, Apr 1, 2015 at 7:49 AM, Vinoth Chandar vin...@uber.com wrote: Hi all, As I understand from docs and talks, the streaming state is in memory as RDD

Re: Spark Streaming 1.3 Kafka Direct Streams

2015-04-01 Thread Tathagata Das

, kafka metadata, etc) = Option[KafkaRDD] I think it's more straightforward to give access to that additional state via subclassing than it is to add in more callbacks for every possible use case. On Wed, Apr 1, 2015 at 2:01 PM, Tathagata Das t...@databricks.com wrote: We should be able

Re: Could not compute split, block not found in Spark Streaming Simple Application

2015-03-27 Thread Tathagata Das

If it is deterministically reproducible, could you generate full DEBUG level logs, from the driver and the workers and give it to me? Basically I want to trace through what is happening to the block that is not being found. And can you tell what Cluster manager are you using? Spark Standalone,

Re: spark streaming driver hang

2015-03-27 Thread Tathagata Das

Do you have the logs of the driver? Does that give any exceptions? TD On Fri, Mar 27, 2015 at 12:24 PM, Chen Song chen.song...@gmail.com wrote: I ran a spark streaming job. 100 executors 30G heap per executor 4 cores per executor The version I used is 1.3.0-cdh5.1.0. The job is reading

Re: Strange JavaDeserialization error - java.lang.ClassNotFoundException: org/apache/spark/storage/StorageLevel

2015-03-27 Thread Tathagata Das

-spark . From console I see that spark is trying to replicate to nodes - nodes show up in Mesos active tasks ... but they always fail with ClassNotFoundE. 2015-03-27 0:52 GMT+01:00 Tathagata Das t...@databricks.com: Could you try running a simpler spark streaming program with receiver (may

Re: Strange JavaDeserialization error - java.lang.ClassNotFoundException: org/apache/spark/storage/StorageLevel

2015-03-27 Thread Tathagata Das

(StorageLevel.MEMORY_ONLY_2 etc). 2015-03-27 19:06 GMT+01:00 Tathagata Das t...@databricks.com: Does it fail with just Spark jobs (using storage levels) on non-coarse mode? TD On Fri, Mar 27, 2015 at 4:39 AM, Ondrej Smola ondrej.sm...@gmail.com wrote: More info when using *spark.mesos.coarse

Re: foreachRDD execution

2015-03-26 Thread Tathagata Das

Yes, that is the correct understanding. There are undocumented parameters that allow that, but I do not recommend using those :) TD On Wed, Mar 25, 2015 at 6:57 AM, Luis Ángel Vicente Sánchez langel.gro...@gmail.com wrote: I have a simple and probably dumb question about foreachRDD. We are

Re: MappedStream vs Transform API

2015-03-17 Thread Tathagata Das

etc so they are consistent with other API's?. Regards, Madhukara Phatak http://datamantra.io/ On Mon, Mar 16, 2015 at 11:42 PM, Tathagata Das t...@databricks.com wrote: It's mostly for legacy reasons. First we had added all the MappedDStream, etc. and then later we realized we need to expose

Re: why generateJob is a private API?

2015-03-16 Thread Tathagata Das

It was not really meant to be pubic and overridden. Because anything you want to do to generate jobs from RDDs can be done using DStream.foreachRDD On Sun, Mar 15, 2015 at 11:14 PM, madhu phatak phatak@gmail.com wrote: Hi, I am trying to create a simple subclass of DStream. If I

Re: MappedStream vs Transform API

2015-03-16 Thread Tathagata Das

It's mostly for legacy reasons. First we had added all the MappedDStream, etc. and then later we realized we need to expose something that is more generic for arbitrary RDD-RDD transformations. It can be easily replaced. However, there is a slight value in having MappedDStream, for developers to

Re: problems with spark-streaming-kinesis-asl and sbt assembly (different file contents found)

2015-03-16 Thread Tathagata Das

If you are creating an assembly, make sure spark-streaming is marked as provided. spark-streaming is already part of the spark installation so will be present at run time. That might solve some of these, may be!? TD On Mon, Mar 16, 2015 at 11:30 AM, Kelly, Jonathan jonat...@amazon.com wrote:

Re: problems with spark-streaming-kinesis-asl and sbt assembly (different file contents found)

2015-03-16 Thread Tathagata Das

-kinesis-asl.) Jonathan Kelly Elastic MapReduce - SDE Port 99 (SEA35) 08.220.C2 From: Tathagata Das t...@databricks.com Date: Monday, March 16, 2015 at 12:45 PM To: Jonathan Kelly jonat...@amazon.com Cc: user@spark.apache.org user@spark.apache.org Subject: Re: problems with spark

RE: Handling worker batch processing during driver shutdown

2015-03-13 Thread Tathagata Das

. I really appreciate your help, but it looks like I’m back to the drawing board on this one. *From:* Tathagata Das [mailto:t...@databricks.com] *Sent:* Thursday, March 12, 2015 7:53 PM *To:* Jose Fernandez *Cc:* user@spark.apache.org *Subject:* Re: Handling worker batch processing during

Re: Unable to connect

2015-03-13 Thread Tathagata Das

Are you running the driver program (that is your application process) in your desktop and trying to run it on the cluster in EC2? It could very well be a hostname mismatch in some way due to the all the public hostname, private hostname, private ip, firewall craziness of ec2. You have to probably

Re: Using rdd methods with Dstream

2015-03-13 Thread Tathagata Das

Is the number of top K elements you want to keep small? That is, is K small? In which case, you can 1. either do it in the driver on the array DStream.foreachRDD ( rdd = { val topK = rdd.top(K) ; // use top K }) 2. Or, you can use the topK to create another RDD using sc.makeRDD

Re: Partitioning

2015-03-13 Thread Tathagata Das

If you want to access the keys in an RDD that is partition by key, then you can use RDD.mapPartition(), which gives you access to the whole partition as an iteratorkey, value. You have the option of maintaing the partitioning information or not by setting the preservePartitioning flag in

Re: Partitioning

2015-03-13 Thread Tathagata Das

, Tathagata Das t...@databricks.com wrote: If you want to access the keys in an RDD that is partition by key, then you can use RDD.mapPartition(), which gives you access to the whole partition as an iteratorkey, value. You have the option of maintaing the partitioning information or not by setting

Re: Using Neo4j with Apache Spark

2015-03-12 Thread Tathagata Das

-the-core-java-api-for-traversing-your-graph-data-base-instead-of-cypher-query-language/ On Thu, Mar 12, 2015 at 5:09 PM, Tathagata Das t...@databricks.com wrote: Well the answers you got there are correct as well. Unfortunately I am not familiar with Neo4j enough to comment any more

Re: Using Neo4j with Apache Spark

2015-03-12 Thread Tathagata Das

I am not sure if you realized but the code snipper it pretty mangled up in the email we received. It might be a good idea to put the code in pastebin or gist, much much easier for everyone to read. On Thu, Mar 12, 2015 at 12:48 AM, d34th4ck3r gautam1...@gmail.com wrote: I'm trying to use Neo4j

Re: Using Neo4j with Apache Spark

2015-03-12 Thread Tathagata Das

at 12:58 AM, Gautam Bajaj gautam1...@gmail.com wrote: Alright, I have also asked this question in StackOverflow: http://stackoverflow.com/questions/28896898/using-neo4j-with-apache-spark The code there is pretty neat. On Thu, Mar 12, 2015 at 4:55 PM, Tathagata Das t...@databricks.com wrote

Re: Using Neo4j with Apache Spark

2015-03-12 Thread Tathagata Das

. As for each partition, I'd need to restart the server, that was the basic reason I was creating graphDb object outside this loop. On Fri, Mar 13, 2015 at 5:34 AM, Tathagata Das t...@databricks.com wrote: (Putting user@spark back in the to list) In the gist, you are creating graphDB object way

Re: Handling worker batch processing during driver shutdown

2015-03-12 Thread Tathagata Das

using Spark 1.2 on CDH 5.3. I stop the application with yarn application -kill appID. *From:* Tathagata Das [mailto:t...@databricks.com] *Sent:* Thursday, March 12, 2015 1:29 PM *To:* Jose Fernandez *Cc:* user@spark.apache.org *Subject:* Re: Handling worker batch processing during driver

Re: Writing to a single file from multiple executors

2015-03-12 Thread Tathagata Das

the AVRO data to multiple files. Thanks, Sam On Mar 12, 2015, at 4:09 AM, Tathagata Das t...@databricks.com wrote: Why do you have to write a single file? On Wed, Mar 11, 2015 at 1:00 PM, SamyaMaiti samya.maiti2...@gmail.com wrote: Hi Experts, I have a scenario, where in I want to write

Re: Handling worker batch processing during driver shutdown

2015-03-12 Thread Tathagata Das

Can you access the batcher directly? Like is there is there a handle to get access to the batchers on the executors by running a task on that executor? If so, after the streamingContext has been stopped (not the SparkContext), then you can use `sc.makeRDD()` to run a dummy task like this.

Re: Efficient Top count in each window

2015-03-12 Thread Tathagata Das

Why are you repartitioning 1? That would obviously be slow, you are converting a distributed operation to a single node operation. Also consider using RDD.top(). If you define the ordering right (based on the count), then you will get top K across then without doing a shuffle for sortByKey. Much

Re: Writing to a single file from multiple executors

2015-03-11 Thread Tathagata Das

Why do you have to write a single file? On Wed, Mar 11, 2015 at 1:00 PM, SamyaMaiti samya.maiti2...@gmail.com wrote: Hi Experts, I have a scenario, where in I want to write to a avro file from a streaming job that reads data from kafka. But the issue is, as there are multiple executors

Re: Spark Streaming recover from Checkpoint with Spark SQL

2015-03-11 Thread Tathagata Das

Can you show us the code that you are using? This might help. This is the updated streaming programming guide for 1.3, soon to be up, this is a quick preview. http://people.apache.org/~tdas/spark-1.3.0-temp-docs/streaming-programming-guide.html#dataframe-and-sql-operations TD On Wed, Mar 11,

Re: Compilation error

2015-03-10 Thread Tathagata Das

If you are using tools like SBT/Maven/Gradle/etc, they figure out all the recursive dependencies and includes them in the class path. I haven't touched Eclipse in years so I am not sure off the top of my head what's going on instead. Just in case you only downloaded the spark-streaming_2.10.jar

Re: Compilation error

2015-03-10 Thread Tathagata Das

You have to include Scala libraries in the Eclipse dependencies. TD On Tue, Mar 10, 2015 at 10:54 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I am trying out streaming example as documented and I am using spark 1.2.1 streaming from maven for Java. When I add this code I get compilation

Re: Why spark master consumes 100% CPU when we kill a spark streaming app?

2015-03-10 Thread Tathagata Das

Do you have event logging enabled? That could be the problem. The Master tries to aggressively recreate the web ui of the completed job with the event logs (when it is enabled) causing the Master to stall. I created a JIRA for this. https://issues.apache.org/jira/browse/SPARK-6270 On Tue, Mar 10,

Re: Compilation error

2015-03-10 Thread Tathagata Das

artifactIdspark-streaming_2.10/artifactId version1.2.0/version /dependency dependency groupIdorg.apache.spark/groupId artifactIdspark-core_2.10/artifactId version1.2.1/version /dependency /dependencies On Tue, Mar 10, 2015 at 11:06 AM, Tathagata Das t...@databricks.com wrote: If you

Re: Spark Streaming input data source list

2015-03-09 Thread Tathagata Das

Spark Streaming has StreamingContext.socketStream() http://spark.apache.org/docs/1.2.1/api/java/org/apache/spark/streaming/StreamingContext.html#socketStream(java.lang.String, int, scala.Function1, org.apache.spark.storage.StorageLevel, scala.reflect.ClassTag) TD On Mon, Mar 9, 2015 at 11:37 AM,

Re: Spark Streaming input data source list

2015-03-09 Thread Tathagata Das

data from RDBMs, for the details you can refer to the docs. Thanks Jerry *From:* Cui Lin [mailto:cui@hds.com] *Sent:* Tuesday, March 10, 2015 8:36 AM *To:* Tathagata Das *Cc:* user@spark.apache.org *Subject:* Re: Spark Streaming input data source list Tathagata, Thanks

Re: Spark Streaming Switchover Time

2015-03-06 Thread Tathagata Das

://www.cisco.com/web/siteassets/legal/privacy.html* *From:* Tathagata Das [mailto:t...@databricks.com] *Sent:* Tuesday, March 03, 2015 11:11 PM *To:* Nastooh Avessta (navesta) *Cc:* user@spark.apache.org *Subject:* Re: Spark Streaming Switchover Time I am confused. Are you killing the 1st worker

Re: Spark Streaming - Duration 1s not matching reality

2015-03-05 Thread Tathagata Das

Hint: Print() just gives a sample of what is in the data, and does not enforce the processing on all the data (only the first partition of the rdd is computed to get 10 items). Count() actually processes all the data. This is all due to lazy eval, if you don't need to use all the data, don't

Re: Why can't Spark Streaming recover from the checkpoint directory when using a third party library for processingmulti-line JSON?

2015-03-04 Thread Tathagata Das

That could be a corner case bug. How do you add the 3rd party library to the class path of the driver? Through spark-submit? Could you give the command you used? TD On Wed, Mar 4, 2015 at 12:42 AM, Emre Sevinc emre.sev...@gmail.com wrote: I've also tried the following: Configuration

Re: Is FileInputDStream returned by fileStream method a reliable receiver?

2015-03-04 Thread Tathagata Das

The file stream does not use receiver. May be that was not clear in the programming guide. I am updating it for 1.3 release right now, I will make it more clear. And file stream has full reliability. Read this in the programming guide.

Re: Why different numbers of partitions give different results for the same computation on the same dataset?

2015-03-03 Thread Tathagata Das

You can use DStream.transform() to do any arbitrary RDD transformations on the RDDs generated by a DStream. val coalescedDStream = myDStream.transform { _.coalesce(...) } On Tue, Mar 3, 2015 at 1:47 PM, Saiph Kappa saiph.ka...@gmail.com wrote: Sorry I made a mistake in my code. Please ignore

Re: Spark Streaming Switchover Time

2015-03-03 Thread Tathagata Das

Can you elaborate on what is this switchover time? TD On Tue, Mar 3, 2015 at 9:57 PM, Nastooh Avessta (navesta) nave...@cisco.com wrote: Hi On a standalone, Spark 1.0.0, with 1 master and 2 workers, operating in client mode, running a udp streaming application, I am noting around 2

Re: Spark Streaming Switchover Time

2015-03-03 Thread Tathagata Das

- Unsubscribe http://www.cisco.com/offer/unsubscribe/?sid=000478327 – Privacy http://www.cisco.com/web/siteassets/legal/privacy.html* *From:* Tathagata Das [mailto:t...@databricks.com] *Sent:* Tuesday, March 03, 2015 10:24 PM *To:* Nastooh Avessta (navesta) *Cc:* user@spark.apache.org *Subject:* Re

Re: Race Condition in Streaming Thread

2015-02-27 Thread Tathagata Das

Are you sure the multiple invocations are not from previous runs of the program? TD On Fri, Feb 27, 2015 at 12:16 PM, Nastooh Avessta (navesta) nave...@cisco.com wrote: Hi Under Spark 1.0.0, standalone, client mode am trying to invoke a 3rd party udp traffic generator, from the streaming

Re: Race Condition in Streaming Thread

2015-02-27 Thread Tathagata Das

, ON, Canada, M5J 2T3. Phone: 416-306-7000; Fax: 416-306-7099. *Preferences http://www.cisco.com/offer/subscribe/?sid=000478326 - Unsubscribe http://www.cisco.com/offer/unsubscribe/?sid=000478327 – Privacy http://www.cisco.com/web/siteassets/legal/privacy.html* *From:* Tathagata Das [mailto:t

Re: throughput in the web console?

2015-02-26 Thread Tathagata Das

26, 2015 at 1:36 AM, Tathagata Das t...@databricks.com wrote: Yes. # tuples processed in a batch = sum of all the tuples received by all the receivers. In screen shot, there was a batch with 69.9K records, and there was a batch which took 1 s 473 ms. These two batches can be the same, can

Re: Integrating Spark Streaming with Reactive Mongo

2015-02-26 Thread Tathagata Das

Hey Mike, I quickly looked through the example and I found major performance issue. You are collecting the RDDs to the driver and then sending them to Mongo in a foreach. Why not doing a distributed push to Mongo? WHAT YOU HAVE val mongoConnection = ... WHAT YOU SHUOLD DO rdd.foreachPartition

Re: throughput in the web console?

2015-02-25 Thread Tathagata Das

Yes. # tuples processed in a batch = sum of all the tuples received by all the receivers. In screen shot, there was a batch with 69.9K records, and there was a batch which took 1 s 473 ms. These two batches can be the same, can be different batches. TD On Wed, Feb 25, 2015 at 10:11 AM, Josh J

Re: Re: Many Receiver vs. Many threads per Receiver

2015-02-25 Thread Tathagata Das

Spark Streaming has a new Kafka direct stream, to be release as experimental feature with 1.3. That uses a low level consumer. Not sure if it satisfies your purpose. If you want more control, its best to create your own Receiver with the low level Kafka API. TD On Tue, Feb 24, 2015 at 12:09 AM,

Re: Why must the dstream.foreachRDD(...) parameter be serializable?

2015-02-23 Thread Tathagata Das

There are different kinds of checkpointing going on. updateStateByKey requires RDD checkpointing which can be enabled only by called sparkContext.setCheckpointDirectory. But that does not enable Spark Streaming driver checkpoints, which is necessary for recovering from driver failures. That is

Re: Query data in Spark RRD

2015-02-23 Thread Tathagata Das

, I'm thinking on that line. The problem is how to send to send the query to the backend? Bundle a http server into a spark streaming job, that will accept the parameters? -- Nikhil Bafna On Mon, Feb 23, 2015 at 2:04 PM, Tathagata Das t...@databricks.com wrote: You will have a build a split

Re: Periodic Broadcast in Apache Spark Streaming

2015-02-23 Thread Tathagata Das

You could do something like this. def rddTrasnformationUsingBroadcast(rdd: RDD[...]): RDD[...] = { val broadcastToUse = getBroadcast()// get the reference to a broadcast variable, new or existing. rdd.map { .. } // use broadcast variable }

Re: How to diagnose could not compute split errors and failed jobs?

2015-02-23 Thread Tathagata Das

Could you find the executor logs on the executor where that task was scheduled? That may provide more information on what caused the error. Also take a look at where the block in question was stored, and where the task was scheduled. You will need to enabled log4j INFO level logs for this

Re: Query data in Spark RRD

2015-02-23 Thread Tathagata Das

You will have a build a split infrastructure - a front end that takes the queries from the UI and sends them to the backend, and the backend (running the Spark Streaming app) will actually run the queries on table created in the contexts. The RPCs necessary between the frontend and backend will

Re: spark streaming window operations on a large window size

2015-02-23 Thread Tathagata Das

The default persistence level is MEMORY_AND_DISK, so the LRU policy would discard the blocks to disk, so the streaming app will not fail. However, since things will get constantly read in and out of disk as windows are processed, the performance wont be great. So it is best to have sufficient

Re: Accumulator in SparkUI for streaming

2015-02-23 Thread Tathagata Das

Unless I am unaware some latest changes, the SparkUI shows stages, and jobs, not accumulator results. And the UI not designed to be pluggable for showing user-defined stuff. TD On Fri, Feb 20, 2015 at 12:25 AM, Tim Smith secs...@gmail.com wrote: On Spark 1.2: I am trying to capture # records

Re: Re: About FlumeUtils.createStream

2015-02-23 Thread Tathagata Das

-24 12:58 *To:* Tathagata Das t...@databricks.com *CC:* user user@spark.apache.org; bit1129 bit1...@163.com *Subject:* Re: About FlumeUtils.createStream I see, thanks for the clarification TD. On 24 Feb 2015 09:56, Tathagata Das t...@databricks.com wrote: Akhil, that is incorrect. Spark

Re: Magic number 16: Why doesn't Spark Streaming process more than 16 files?

2015-02-23 Thread Tathagata Das

In case this mystery has not been solved, DStream.print() essentially does a RDD.take(10) on each RDD, which computes only a subset of the partitions in the RDD. But collects forces the evaluation of all the RDDs. Since you are writing to json in the mapI() function, this could be the reason. TD

Re: Write ahead Logs and checkpoint

2015-02-23 Thread Tathagata Das

Exactly, that is the reason. To avoid that, in Spark 1.3 to-be-released, we have added a new Kafka API (called direct stream) which does not use Zookeeper at all to keep track of progress, and maintains offset within Spark Streaming. That can guarantee all records being received exactly-once. Its

Re: Write ahead Logs and checkpoint

2015-02-23 Thread Tathagata Das

felixcheun...@hotmail.com wrote: Kafka 0.8.2 has built-in offset management, how would that affect direct stream in spark? Please see KAFKA-1012 --- Original Message --- From: Tathagata Das t...@databricks.com Sent: February 23, 2015 9:53 PM To: V Dineshkumar developer.dines...@gmail.com Cc

Re: Any sample code for Kafka consumer

2015-02-22 Thread Tathagata Das

Spark Streaming already directly supports Kafka http://spark.apache.org/docs/latest/streaming-programming-guide.html#advanced-sources Is there any reason why that is not sufficient? TD On Sun, Feb 22, 2015 at 5:18 PM, mykidong mykid...@gmail.com wrote: In java, you can see this example:

Re: In a Spark Streaming application, what might be the potential causes for util.AkkaUtils: Error sending message in 1 attempts and java.util.concurrent.TimeoutException: Futures timed out and

2015-02-19 Thread Tathagata Das

What version of Spark are you using? TD On Thu, Feb 19, 2015 at 2:45 AM, Emre Sevinc emre.sev...@gmail.com wrote: Hello, We have a Spark Streaming application that watches an input directory, and as files are copied there the application reads them and sends the contents to a RESTful web

Re: spark-local dir running out of space during long ALS run

2015-02-16 Thread Tathagata Das

Correct, brute force clean up is not useful. Since Spark 1.0, Spark can do automatic cleanup of files based on which RDDs are used/garbage collected by JVM. That would be the best way, but depends on the JVM GC characteristics. If you force a GC periodically in the driver that might help you get

Re: NaiveBayes classifier causes ShuffleDependency class cast exception

2015-02-13 Thread Tathagata Das

You cannot have two Spark Contexts in the same JVM active at the same time. Just create one SparkContext and then use it for both purpose. TD On Fri, Feb 6, 2015 at 8:49 PM, VISHNU SUBRAMANIAN johnfedrickena...@gmail.com wrote: Can you try creating just a single spark context and then try

Re: Interact with streams in a non-blocking way

2015-02-13 Thread Tathagata Das

Here is an example of how you can do. Lets say myDStream contains the data that you may want to asynchornously query, say using, Spark SQL. val sqlContext = new SqlContext(streamingContext.sparkContext) myDStream.foreachRDD { rdd = // rdd is a RDD of case class

Re: Concurrent batch processing

2015-02-12 Thread Tathagata Das

So you have come across spark.streaming.concurrentJobs already :) Yeah, that is an undocumented feature that does allow multiple output operations to submitted in parallel. However, this is not made public for the exact reasons that you realized - the semantics in case of stateful operations is

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-12 Thread Tathagata Das

Can you give me the whole logs? TD On Tue, Feb 10, 2015 at 10:48 AM, Jon Gregg jonrgr...@gmail.com wrote: OK that worked and getting close here ... the job ran successfully for a bit and I got output for the first couple buckets before getting a java.lang.Exception: Could not compute split,

Re: Streaming scheduling delay

2015-02-12 Thread Tathagata Das

Thanks for looking into it. On Thu, Feb 12, 2015 at 8:10 PM, Tathagata Das t...@databricks.com wrote: Hey Tim, Let me get the key points. 1. If you are not writing back to Kafka, the delay is stable? That is, instead of foreachRDD { // write to kafka } if you do dstream.count

Re: Spark streaming job throwing ClassNotFound exception when recovering from checkpointing

2015-02-12 Thread Tathagata Das

Could you come up with a minimal example through which I can reproduce the problem? On Tue, Feb 10, 2015 at 12:30 PM, conor fennell.co...@gmail.com wrote: I am getting the following error when I kill the spark driver and restart the job: 15/02/10 17:31:05 INFO CheckpointReader: Attempting to

Re: streaming joining multiple streams

2015-02-12 Thread Tathagata Das

Sorry for the late response. With the amount of data you are planning join, any system would take time. However, between Hive's MapRduce joins, and Spark's basic shuffle, and Spark SQL's join, the latter wins hands down. Furthermore, with the APIs of Spark and Spark Streaming, you will have to do

Re: Streaming scheduling delay

2015-02-12 Thread Tathagata Das

Hey Tim, Let me get the key points. 1. If you are not writing back to Kafka, the delay is stable? That is, instead of foreachRDD { // write to kafka } if you do dstream.count, then the delay is stable. Right? 2. If so, then Kafka is the bottleneck. Is the number of partitions, that you spoke of

Re: StreamingContext getOrCreate with queueStream

2015-02-05 Thread Tathagata Das

I dont think your screenshots came through in the email. None the less, queueStream will not work with getOrCreate. Its mainly for testing (by generating your own RDDs) and not really useful for production usage (where you really need to checkpoint-based recovery). TD On Thu, Feb 5, 2015 at 4:12

Re: how to send JavaDStream RDD using foreachRDD using Java

2015-02-02 Thread Tathagata Das

Hello Sachin, While Akhil's solution is correct, this is not sufficient for your usecase. RDD.foreach (that Akhil is using) will run on the workers, but you are creating the Producer object on the driver. This will not work, a producer create on the driver cannot be used from the worker/executor.

Re: Error when Spark streaming consumes from Kafka

2015-02-02 Thread Tathagata Das

This is an issue that is hard to resolve without rearchitecting the whole Kafka Receiver. There are some workarounds worth looking into. http://mail-archives.apache.org/mod_mbox/kafka-users/201312.mbox/%3CCAFbh0Q38qQ0aAg_cj=jzk-kbi8xwf+1m6xlj+fzf6eetj9z...@mail.gmail.com%3E On Mon, Feb 2, 2015

Re: Build error

2015-01-30 Thread Tathagata Das

That is a known issue uncovered last week. It fails on certain environments, not on Jenkins which is our testing environment. There is already a PR up to fix it. For now you can build using mvn package -DskipTests TD On Fri, Jan 30, 2015 at 8:59 PM, Andrew Musselman andrew.mussel...@gmail.com

Re: reduceByKeyAndWindow, but using log timestamps instead of clock seconds

2015-01-28 Thread Tathagata Das

Ohhh nice! Would be great if you can share us some code soon. It is indeed a very complicated problem and there is probably no single solution that fits all usecases. So having one way of doing things would be a great reference. Looking forward to that! On Wed, Jan 28, 2015 at 4:52 PM, Tobias

Re: Error reporting/collecting for users

2015-01-28 Thread Tathagata Das

You could use foreachRDD to do the operations and then inside the foreach create an accumulator to gather all the errors together dstream.foreachRDD { rdd = val accumulator = new Accumulator[] rdd.map { . }.count // whatever operation that is error prone // gather all errors

Re: How to make spark partition sticky, i.e. stay with node?

2015-01-23 Thread Tathagata Das

Hello mingyu, That is a reasonable way of doing this. Spark Streaming natively does not support sticky because Spark launches tasks based on data locality. If there is no locality (example reduce tasks can run anywhere), location is randomly assigned. So the cogroup or join introduces a locality

Re: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Tathagata Das

This is not normal. Its a huge scheduling delay!! Can you tell me more about the application? - cluser setup, number of receivers, whats the computation, etc. On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote: Hate to do this...but...erm...bump? Would really appreciate input

Re: Big performance difference between client and cluster deployment mode; is this expected?

2014-12-31 Thread Tathagata Das

Whats your spark-submit commands in both cases? Is it Spark Standalone or YARN (both support client and cluster)? Accordingly what is the number of executors/cores requested? TD On Wed, Dec 31, 2014 at 10:36 AM, Enno Shioji eshi...@gmail.com wrote: Also the job was deployed from the master

Re: Spark Streaming: HiveContext within Custom Actor

2014-12-30 Thread Tathagata Das

I am not sure that can be done. Receivers are designed to be run only on the executors/workers, whereas a SQLContext (for using Spark SQL) can only be defined on the driver. On Mon, Dec 29, 2014 at 6:45 PM, sranga sra...@gmail.com wrote: Hi Could Spark-SQL be used from within a custom actor

Re: word count aggregation

2014-12-30 Thread Tathagata Das

For windows that large (1 hour), you will probably also have to increase the batch interval for efficiency. TD On Mon, Dec 29, 2014 at 12:16 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You can use reduceByKeyAndWindow for that. Here's a pretty clean example

Re: Kafka + Spark streaming

2014-12-30 Thread Tathagata Das

1. Of course, a single block / partition has many Kafka messages, and from different Kafka topics interleaved together. The message count is not related to the block count. Any message received within a particular block interval will go in the same block. 2. Yes, the receiver will be started on

Re: SPARK-streaming app running 10x slower on YARN vs STANDALONE cluster

2014-12-30 Thread Tathagata Das

Thats is kind of expected due to data locality. Though you should see some tasks running on the executors as the data gets replicated to other nodes and can therefore run tasks based on locality. You have two solutions 1. kafkaStream.repartition() to explicitly repartition the received data

Re: Gradual slow down of the Streaming job (getCallSite at DStream.scala:294)

2014-12-30 Thread Tathagata Das

Which version of Spark Streaming are you using. When the batch processing time increases to 15-20 seconds, could you compare the task times compared to the tasks time when the application is just launched? Basically is the increase from 6 seconds to 15-20 seconds is caused by increase in

Re: Help with updateStateByKey

2014-12-18 Thread Tathagata Das

Another point to start playing with updateStateByKey is the example StatefulNetworkWordCount. See the streaming examples directory in the Spark repository. TD On Thu, Dec 18, 2014 at 6:07 AM, Pierce Lamb richard.pierce.l...@gmail.com wrote: I am trying to run stateful Spark Streaming

Re: Spark Streaming Python APIs?

2014-12-18 Thread Tathagata Das

A more updated version of the streaming programming guide is here http://people.apache.org/~tdas/spark-1.2-temp/streaming-programming-guide.html Please refer to this until we make the official release of Spark 1.2 TD On Tue, Dec 16, 2014 at 3:50 PM, smallmonkey...@hotmail.com

Re: Read data from SparkStreaming from Java socket.

2014-12-12 Thread Tathagata Das

Yes, socketTextStream starts a TCP client that tries to connect to a TCP server (localhost: in your case). If there is a server running on that port that can send data to connected TCP connections, then you will receive data in the stream. Did you check out the quick example in the streaming

Re: Key not valid / already cancelled using Spark Streaming

2014-12-11 Thread Tathagata Das

Following Gerard's thoughts, here are possible things that could be happening. 1. Is there another process in the background that is deleting files in the directory where you are trying to write? Seems like the temporary file generated by one of the tasks is getting delete before it is renamed to

Re: Spark steaming : work with collect() but not without collect()

2014-12-11 Thread Tathagata Das

What does process do? Maybe when this process function is being run in the Spark executor, it is causing the some static initialization, which fails causing this exception. For Oracle documentation, an ExceptionInInitializerError is thrown to indicate that an exception occurred during evaluation

Re: Session for connections?

2014-12-11 Thread Tathagata Das

You could create a lazily initialized singleton factory and connection pool. Whenever an executor starts running the firt task that needs to push out data, it will create the connection pool as a singleton. And subsequent tasks running on the executor is going to use the connection pool. You will

Re: KafkaUtils explicit acks

2014-12-11 Thread Tathagata Das

I am updating the docs right now. Here is a staged copy that you can have sneak peek of. This will be part of the Spark 1.2. http://people.apache.org/~tdas/spark-1.2-temp/streaming-programming-guide.html The updated fault-tolerance section tries to simplify the explanation of when and what data

Re: Session for connections?

2014-12-11 Thread Tathagata Das

Also, this is covered in the streaming programming guide in bits and pieces. http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd On Thu, Dec 11, 2014 at 4:55 AM, Ashic Mahtab as...@live.com wrote: That makes sense. I'll try that. Thanks :)

Re: Locking for shared RDDs

2014-12-11 Thread Tathagata Das

Aditya, I think you have the mental model of spark streaming a little off the mark. Unlike traditional streaming systems, where any kind of state is mutable, SparkStreaming is designed on Sparks immutable RDDs. Streaming data is received and divided into immutable blocks, then form immutable RDDs,

Re: Is there an efficient way to append new data to a registered Spark SQL Table?

2014-12-11 Thread Tathagata Das

First of all, how long do you want to keep doing this? The data is going to increase infinitely and without any bounds, its going to get too big for any cluster to handle. If all that is within bounds, then try the following. - Maintain a global variable having the current RDD storing all the log

< 1 2 3 4 5 6 7 8 9 >

501 - 600 of 861 matches

Mail list logo