Save org.apache.spark.mllib.linalg.Matri to a file

2015-04-15 Thread Spico Florin
Hello! The result of correlation in Spark MLLib is a of type org.apache.spark.mllib.linalg.Matrix. (see http://spark.apache.org/docs/1.2.1/mllib-statistics.html#correlations) val data: RDD[Vector] = ... val correlMatrix: Matrix = Statistics.corr(data, pearson) I would like to save the

Execption while using kryo with broadcast

2015-04-15 Thread Jeetendra Gangele
Hi All I am getting below exception while using Kyro serializable with broadcast variable. I am broadcating a hasmap with below line. MapLong, MatcherReleventData matchData =RddForMarch.collectAsMap(); final BroadcastMapLong, MatcherReleventData dataMatchGlobal = jsc.broadcast(matchData);

Re: Execption while using kryo with broadcast

2015-04-15 Thread Jeetendra Gangele
Yes Without Kryo it did work out.when I remove kryo registration it did worked out On 15 April 2015 at 19:24, Jeetendra Gangele gangele...@gmail.com wrote: its not working with the combination of Broadcast. Without Kyro also not working. On 15 April 2015 at 19:20, Akhil Das

Re: Execption while using kryo with broadcast

2015-04-15 Thread Akhil Das
Is it working without kryo? Thanks Best Regards On Wed, Apr 15, 2015 at 6:38 PM, Jeetendra Gangele gangele...@gmail.com wrote: Hi All I am getting below exception while using Kyro serializable with broadcast variable. I am broadcating a hasmap with below line. MapLong, MatcherReleventData

Re: Execption while using kryo with broadcast

2015-04-15 Thread Jeetendra Gangele
This looks like known issue? check this out http://apache-spark-user-list.1001560.n3.nabble.com/java-io-InvalidClassException-org-apache-spark-api-java-JavaUtils-SerializableMapWrapper-no-valid-cor-td20034.html Can you please suggest any work around I am broad casting HashMap return from

Re: Execption while using kryo with broadcast

2015-04-15 Thread Imran Rashid
this is a really strange exception ... I'm especially surprised that it doesn't work w/ java serialization. Do you think you could try to boil it down to a minimal example? On Wed, Apr 15, 2015 at 8:58 AM, Jeetendra Gangele gangele...@gmail.com wrote: Yes Without Kryo it did work out.when I

Re: Execption while using kryo with broadcast

2015-04-15 Thread Imran Rashid
oh interesting. The suggested workaround is to wrap the result from collectAsMap into another hashmap, you should try that: MapLong, MatcherReleventData matchData =RddForMarch.collectAsMap(); MapString, String tmp = new HashMapString, String(matchData); final BroadcastMapLong,

Re: Microsoft SQL jdbc support from spark sql

2015-04-15 Thread ARose
I have found that it works if you place the sqljdbc41.jar directly in the following folder: YOUR_SPARK_HOME/core/target/jars/ So Spark will have the SQL Server jdbc driver when it computes its classpath. -- View this message in context:

Re: Execption while using kryo with broadcast

2015-04-15 Thread Jeetendra Gangele
its not working with the combination of Broadcast. Without Kyro also not working. On 15 April 2015 at 19:20, Akhil Das ak...@sigmoidanalytics.com wrote: Is it working without kryo? Thanks Best Regards On Wed, Apr 15, 2015 at 6:38 PM, Jeetendra Gangele gangele...@gmail.com wrote: Hi All

How to get a clean DataFrame schema merge

2015-04-15 Thread Jaonary Rabarisoa
Hi all, If you follow the example of schema merging in the spark documentation http://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging you obtain the following results when you want to load the result data : single triple double 1 3 null 2 6 null 4

RE: SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0

2015-04-15 Thread Nathan McCarthy
Tried with 1.3.0 release (built myself) the most recent 1.3.1 Snapshot off the 1.3 branch. Haven't tried with 1.4/master. From: Wang, Daoyuan [daoyuan.w...@intel.com] Sent: Wednesday, April 15, 2015 5:22 PM To: Nathan McCarthy; user@spark.apache.org Subject:

Re: Execption while using kryo with broadcast

2015-04-15 Thread Jeetendra Gangele
This worked with java serialization.I am using 1.2.0 you are right if I use 1.2.1 or 1.3.0 this issue will not occur I will test this and let you know On 15 April 2015 at 19:48, Imran Rashid iras...@cloudera.com wrote: oh interesting. The suggested workaround is to wrap the result from

Re: How to get a clean DataFrame schema merge

2015-04-15 Thread Michael Armbrust
Schema merging is not the feature you are looking for. It is designed when you are adding new records (that are not associated with old records), which may or may not have new or missing columns. In your case it looks like you have two datasets that you want to load separately and join on a key.

Re: Running beyond physical memory limits

2015-04-15 Thread Sandy Ryza
The setting to increase is spark.yarn.executor.memoryOverhead On Wed, Apr 15, 2015 at 6:35 AM, Brahma Reddy Battula brahmareddy.batt...@huawei.com wrote: Hello Sean Owen, Thanks for your reply..Ill increase overhead memory and check it.. Bytheway ,Any difference between 1.1 and 1.2 makes,

Re: multinomial and Bernoulli model in NaiveBayes

2015-04-15 Thread Xiangrui Meng
CC Leah, who added Bernoulli option to MLlib's NaiveBayes. -Xiangrui On Wed, Apr 15, 2015 at 4:49 AM, 姜林和 linhe_ji...@163.com wrote: Dear meng: Thanks for the great work for park machine learning, and I saw the changes for NaiveBayes algorithm , separate the algorithm to : multinomial

Spark 1.3 saveAsTextFile with codec gives error - works with Spark 1.2

2015-04-15 Thread Manoj Samel
Env - Spark 1.3 Hadoop 2.3, Kerbeos xx.saveAsTextFile(path, codec) gives following trace. Same works with Spark 1.2 in same environment val codec = classOf[some codec class] val a = sc.textFile(/some_hdfs_file) a.saveAsTextFile(/some_other_hdfs_file, codec) fails with following trace in Spark

Re: adding new elements to batch RDD from DStream RDD

2015-04-15 Thread Sean Owen
What do you mean by batch RDD? they're just RDDs, though store their data in different ways and come from different sources. You can union an RDD from an HDFS file with one from a DStream. It sounds like you want streaming data to live longer than its batch interval, but that's not something you

[SparkSQL; Thriftserver] Help tracking missing 5 minutes

2015-04-15 Thread Yana Kadiyska
Hi Spark users, Trying to upgrade to Spark1.2 and running into the following seeing some very slow queries and wondering if someone can point me in the right direction for debugging. My Spark UI shows a job with duration 15s (see attached screenshot). Which would be great but client side

adding new elements to batch RDD from DStream RDD

2015-04-15 Thread Evo Eftimov
The only way to join / union /cogroup a DStream RDD with Batch RDD is via the transform method, which returns another DStream RDD and hence it gets discarded at the end of the micro-batch. Is there any way to e.g. union Dstream RDD with Batch RDD which produces a new Batch RDD containing the

exception during foreach run

2015-04-15 Thread Jeetendra Gangele
Hi All I am getting below exception while running foreach after zipwithindex ,flatMapvalue,flatmapvalues, Insideview foreach I m doing lookup in broadcast variable java.util.concurrent.RejectedExecutionException: Worker has already been shutdown at

Re: RAM management during cogroup and join

2015-04-15 Thread Tathagata Das
Significant optimizations can be made by doing the joining/cogroup in a smart way. If you have to join streaming RDDs with the same batch RDD, then you can first partition the batch RDDs using a partitions and cache it, and then use the same partitioner on the streaming RDDs. That would make sure

RE: RAM management during cogroup and join

2015-04-15 Thread Evo Eftimov
That has been done Sir and represents further optimizations – the objective here was to confirm whether cogroup always results in the previously described “greedy” explosion of the number of elements included and RAM allocated for the result RDD The optimizations mentioned still don’t

Re: RAM management during cogroup and join

2015-04-15 Thread Tathagata Das
Agreed. On Wed, Apr 15, 2015 at 1:29 PM, Evo Eftimov evo.efti...@isecc.com wrote: That has been done Sir and represents further optimizations – the objective here was to confirm whether cogroup always results in the previously described “greedy” explosion of the number of elements included

RE: RAM management during cogroup and join

2015-04-15 Thread Evo Eftimov
Thank you Sir, and one final confirmation/clarification - are all forms of joins in the Spark API for DStream RDDs based on cogroup in terms of their internal implementation From: Tathagata Das [mailto:t...@databricks.com] Sent: Wednesday, April 15, 2015 9:48 PM To: Evo Eftimov Cc: user

RE: adding new elements to batch RDD from DStream RDD

2015-04-15 Thread Evo Eftimov
I keep seeing only common statements Re DStream RDDs and Batch RDDs - There is certainly something to keep me from using them together and it is the OO API differences I have described previously, several times ... Re the batch RDD reloading from file and that there is no need for threads -

Re: adding new elements to batch RDD from DStream RDD

2015-04-15 Thread Sean Owen
Yep, you are looking at operations on DStream, which is not what I'm talking about. You should look at DStream.foreachRDD (or Java equivalent), which hands you an RDD. Makes more sense? The rest may make more sense when you try it. There is actually a lot less complexity than you think. On Wed,

RAM management during cogroup and join

2015-04-15 Thread Evo Eftimov
There are indications that joins in Spark are implemented with / based on the cogroup function/primitive/transform. So let me focus first on cogroup - it returns a result which is RDD consisting of essentially ALL elements of the cogrouped RDDs. Said in another way - for every key in each of the

RE: adding new elements to batch RDD from DStream RDD

2015-04-15 Thread Evo Eftimov
Hi Sean well there is certainly a difference between batch RDD and streaming RDD and in the previous reply you have already outlined some. Other differences are in the Object Oriented Model / API of Spark, which also matters besides the RDD / Spark Cluster Platform architecture. Secondly, in

Re: adding new elements to batch RDD from DStream RDD

2015-04-15 Thread Sean Owen
Yes, I mean there's nothing to keep you from using them together other than their very different lifetime. That's probably the key here: if you need the streaming data to live a long time it has to live in persistent storage first. I do exactly this and what you describe for the same purpose. I

Re: adding new elements to batch RDD from DStream RDD

2015-04-15 Thread Sean Owen
What API differences are you talking about? a DStream gives a sequence of RDDs. I'm not referring to DStream or its API. Spark in general can execute many pipelines at once, ones that even refer to the same RDD. What I mean you seem to be looking for a way to change one shared RDD, but in fact,

RE: adding new elements to batch RDD from DStream RDD

2015-04-15 Thread Evo Eftimov
The OO API in question was mentioned several times - as the transform method of DStreamRDD which is the ONLY way to join/cogroup/union DSTreamRDD with batch RDD aka JavaRDD Here is paste from the spark javadoc K2,V2 JavaPairDStreamK2,V2 transformToPair(FunctionR,JavaPairRDDK2,V2

aliasing aggregate columns?

2015-04-15 Thread elliott cordo
Hi Guys - Having trouble figuring out the semantics for using the alias function on the final sum and count aggregations? cool_summary = reviews.select(reviews.user_id, cool_cnt(votes.cool).alias(cool_cnt)).groupBy(user_id).agg({cool_cnt:sum,*:count}) cool_summary DataFrame[user_id: string,

Re: spark job progress-style report on console ?

2015-04-15 Thread syepes
Just add the following line spark.ui.showConsoleProgress true do your conf/spark-defaults.conf file. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-job-progress-style-report-on-console-tp22440p22506.html Sent from the Apache Spark User List mailing

Re: RAM management during cogroup and join

2015-04-15 Thread Tathagata Das
Well, DStream joins are nothing but RDD joins at its core. However, there are more optimizations that you using DataFrames and Spark SQL joins. With the schema, there is a greater scope for optimizing the joins. So converting RDDs from streaming and the batch RDDs to data frames, and then applying

RE: RAM management during cogroup and join

2015-04-15 Thread Evo Eftimov
Data Frames are available from the latest 1.3 release I believe – in 1.2 (our case at the moment) I guess the options are more limited PS: agree that DSTreams are just an abstraction for a sequence / streams of (ordinary) RDDs – when i use “DStreams” I mean the DStream OO API in Spark not

Re: Is it feasible to keep millions of keys in state of Spark Streaming job for two months?

2015-04-15 Thread Tathagata Das
Can you clarify more on what you want to do after querying? Is the batch not completed until the querying and subsequent processing has completed? On Tue, Apr 14, 2015 at 10:36 PM, Krzysztof Zarzycki k.zarzy...@gmail.com wrote: Thank you Tathagata, very helpful answer. Though, I would like

Passing Elastic Search Mappings in Spark Conf

2015-04-15 Thread Deepak Subhramanian
Hi, Is there a way to pass the mapping to define a field as not analyzed with es-spark settings. I am just wondering if I can set the mapping type for a field as not analyzed using the set function in spark conf as similar to the other es settings. val sconf = new SparkConf()

Re: How to do dispatching in Streaming?

2015-04-15 Thread Tathagata Das
It may be worthwhile to do architect the computation in a different way. dstream.foreachRDD { rdd = rdd.foreach { record = // do different things for each record based on filters } } TD On Sun, Apr 12, 2015 at 7:52 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I have a

Dataset announcement

2015-04-15 Thread Olivier Chapelle
Dear Spark users, I would like to draw your attention to a dataset that we recently released, which is as of now the largest machine learning dataset ever released; see the following blog announcements: - http://labs.criteo.com/2015/03/criteo-releases-its-new-dataset/ -

Re: Passing Elastic Search Mappings in Spark Conf

2015-04-15 Thread Nick Pentreath
If you want to specify mapping you must first create the mappings for your index types before indexing. As far as I know there is no way to specify this via ES-hadoop. But it's best practice to explicitly create mappings prior to indexing, or to use index templates when dynamically creating

Re: Dataset announcement

2015-04-15 Thread Simon Edelhaus
Greetings! How about medical data sets, and specifically longitudinal vital signs. Can people send good pointers? Thanks in advance, -- ttfn Simon Edelhaus California 2015 On Wed, Apr 15, 2015 at 6:01 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Very neat, Olivier; thanks for sharing

Re: Actor not found

2015-04-15 Thread Canoe
13119 Exception in thread main akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://sparkdri...@dmslave13.et2.tbsi te.net:5908/), Path(/user/OutputCommitCoordinator)] 13120 at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65)

Re: Dataset announcement

2015-04-15 Thread Matei Zaharia
Very neat, Olivier; thanks for sharing this. Matei On Apr 15, 2015, at 5:58 PM, Olivier Chapelle oliv...@chapelle.cc wrote: Dear Spark users, I would like to draw your attention to a dataset that we recently released, which is as of now the largest machine learning dataset ever released;

Re: SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0

2015-04-15 Thread Nathan McCarthy
The problem lies with getting the driver classes into the primordial class loader when running on YARN. Basically I need to somehow set the SPARK_CLASSPATH or compute_classpath.sh when running on YARN. I’m not sure how to do this when YARN is handling all the file copy. From: Nathan

Re: Can Spark 1.0.2 run on CDH-4.3.0 with yarn? And Will Spark 1.2.0 support CDH5.1.2 with yarn?

2015-04-15 Thread Canoe
now we have spark 1.3.0 on chd 5.1.0 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-Spark-1-0-2-run-on-CDH-4-3-0-with-yarn-And-Will-Spark-1-2-0-support-CDH5-1-2-with-yarn-tp20760p22509.html Sent from the Apache Spark User List mailing list archive at

Re: Dataset announcement

2015-04-15 Thread Krishna Sankar
Thanks Olivier. Good work. Interesting in more than one ways - including training, benchmarking, testing new releases et al. One quick question - do you plan to make it available as an S3 bucket ? Cheers k/ On Wed, Apr 15, 2015 at 5:58 PM, Olivier Chapelle oliv...@chapelle.cc wrote: Dear Spark

Re: SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0

2015-04-15 Thread ๏̯͡๏
Can you provide the JDBC connector jar version. Possibly the full JAR name and full command you ran Spark with ? On Wed, Apr 15, 2015 at 11:27 AM, Nathan McCarthy nathan.mccar...@quantium.com.au wrote: Just an update, tried with the old JdbcRDD and that worked fine. From: Nathan

Parquet Partition Size are different when using Dataframe's save append funciton

2015-04-15 Thread 顾亮亮
Hi, When I use Dataframe’s save append function, I find that the parquet partition size are very different. Part-r-1 to 00021 are generated at the first time save append function is called. Part-r-00022 to 00042 is generated at the second time save append function is called. As you can

Re: Saving RDDs as custom output format

2015-04-15 Thread Akhil Das
You can try using ORCOutputFormat with yourRDD.saveAsNewAPIHadoopFile Thanks Best Regards On Tue, Apr 14, 2015 at 9:29 PM, Daniel Haviv daniel.ha...@veracity-group.com wrote: Hi, Is it possible to store RDDs as custom output formats, For example ORC? Thanks, Daniel

Re: spark streaming printing no output

2015-04-15 Thread Shushant Arora
Yes only Time: 142905487 ms strings gets printed on console. No output is getting printed. And timeinterval between two strings of form ( time:ms)is very less than Streaming Duration set in program. On Wed, Apr 15, 2015 at 5:11 AM, Shixiong Zhu zsxw...@gmail.com wrote: Could you see

Re: spark streaming printing no output

2015-04-15 Thread Akhil Das
Just make sure you have atleast 2 cores available for processing. You can try launching it in local[2] and make sure its working fine. Thanks Best Regards On Tue, Apr 14, 2015 at 11:41 PM, Shushant Arora shushantaror...@gmail.com wrote: Hi I am running a spark streaming application but on

RE: SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0

2015-04-15 Thread Wang, Daoyuan
Can you provide your spark version? Thanks, Daoyuan From: Nathan McCarthy [mailto:nathan.mccar...@quantium.com.au] Sent: Wednesday, April 15, 2015 1:57 PM To: Nathan McCarthy; user@spark.apache.org Subject: Re: SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0 Just an update,

Re: OOM in SizeEstimator while using combineByKey

2015-04-15 Thread Xianjin YE
what is your JVM heap size settings? The OOM in SIzeEstimator is caused by a lot of entry in IdentifyHashMap. A quick guess is that the object in your dataset is a custom class and you didn't implement the hashCode and equals method correctly. On Wednesday, April 15, 2015 at 3:10 PM,

Re: spark streaming printing no output

2015-04-15 Thread Shixiong Zhu
So the time niterval is much less than 1000 ms as you set in the code? That's weird. Could you check the whole outputs to confirm that the content won't be flushed by logs? Best Regards, Shixiong(Ryan) Zhu 2015-04-15 15:04 GMT+08:00 Shushant Arora shushantaror...@gmail.com: Yes only Time:

Re: Running Spark on Gateway - Connecting to Resource Manager Retries

2015-04-15 Thread Akhil Das
Make sure your yarn service is running on 8032. Thanks Best Regards On Tue, Apr 14, 2015 at 12:35 PM, Vineet Mishra clearmido...@gmail.com wrote: Hi Team, I am running Spark Word Count example( https://github.com/sryza/simplesparkapp), if I go with master as local it works fine. But when

OOM in SizeEstimator while using combineByKey

2015-04-15 Thread Aniket Bhatnagar
I am aggregating a dataset using combineByKey method and for a certain input size, the job fails with the following error. I have enabled head dumps to better analyze the issue and will report back if I have any findings. Meanwhile, if you guys have any idea of what could possibly result in this

Do multiple ipython notebooks work on yarn in one cluster?

2015-04-15 Thread aihe
My colleagues and I work on spark recently. We just setup a new cluster on yarn over which we can run spark. We basically use ipython and write program in the notebook in a specific port(like ) via http. We have our own notebooks and the odd thing is that if I run my notebook first, my

Re: Running Spark on Gateway - Connecting to Resource Manager Retries

2015-04-15 Thread Vineet Mishra
Hi Akhil, Its running fine when running through Namenode(RM) but fails while running through Gateway, if I add hadoop-core jars to the hadoop directory(/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/) it works fine. Its really strange that I am running the job through Spark-Submit

Re: spark streaming printing no output

2015-04-15 Thread Shushant Arora
When I launched spark-shell using, spark-shell ---master local[2]. Same behaviour, no output on console but only timestamps. When I did, lines.saveAsTextFiles(hdfslocation,suffix); I get empty files of 0 bytes on hdfs On Wed, Apr 15, 2015 at 12:46 PM, Akhil Das ak...@sigmoidanalytics.com wrote:

Re: Re: spark streaming printing no output

2015-04-15 Thread bit1...@163.com
Looks the message is consumed by the another console?( can see messages typed on this port from another console.) bit1...@163.com From: Shushant Arora Date: 2015-04-15 17:11 To: Akhil Das CC: user@spark.apache.org Subject: Re: spark streaming printing no output When I launched spark-shell

Re: Re: spark streaming printing no output

2015-04-15 Thread Shushant Arora
Its printing on console but on HDFS all folders are still empty . On Wed, Apr 15, 2015 at 2:29 PM, Shushant Arora shushantaror...@gmail.com wrote: Thanks !! Yes message types on this console is seen on another console. When I closed another console, spark streaming job is printing messages on

RE: Running beyond physical memory limits

2015-04-15 Thread Brahma Reddy Battula
Thanks lot for your reply.. There is no issue with spark1.1..Following issue came when I upgrade to spark2.0...Hence I did not decrease spark.executor.memory... I mean to say, used same config for spark1.1 and spark1.2.. Is there any issue with spark1.2..? Or Yarn will lead this..? And why

Re: Running beyond physical memory limits

2015-04-15 Thread Sean Owen
This is not related to executor memory, but the extra overhead subtracted from the executor's size in order to avoid using more than the physical memory that YARN allows. That is, if you declare a 32G executor YARN lets you use 32G physical memory but your JVM heap must be significantly less than

Running beyond physical memory limits

2015-04-15 Thread Brahma Reddy Battula
Hello Sparkers I am newbie to spark and need help.. We are using spark 1.2, we are getting the following error and executor is getting killed..I seen SPARK-1930 and it should be in 1.2.. Any pointer to following error, like what might lead this error.. 2015-04-15 11:55:39,697 | WARN |

Re: Re: spark streaming printing no output

2015-04-15 Thread Shushant Arora
Thanks !! Yes message types on this console is seen on another console. When I closed another console, spark streaming job is printing messages on console . Isn't the message written on a port using netcat be avaible for multiple consumers? On Wed, Apr 15, 2015 at 2:22 PM, bit1...@163.com

Re: Running beyond physical memory limits

2015-04-15 Thread Akhil Das
Did you try reducing your spark.executor.memory? Thanks Best Regards On Wed, Apr 15, 2015 at 2:29 PM, Brahma Reddy Battula brahmareddy.batt...@huawei.com wrote: Hello Sparkers I am newbie to spark and need help.. We are using spark 1.2, we are getting the following error and executor is

Re: Running beyond physical memory limits

2015-04-15 Thread Sean Owen
All this means is that your JVM is using more memory than it requested from YARN. You need to increase the YARN memory overhead setting, perhaps. On Wed, Apr 15, 2015 at 9:59 AM, Brahma Reddy Battula brahmareddy.batt...@huawei.com wrote: Hello Sparkers I am newbie to spark and need help.. We

Re: spark streaming with kafka

2015-04-15 Thread Akhil Das
Once you start your streaming application to read from Kafka, it will launch receivers on the executor nodes. And you can see them on the streaming tab of your driver ui (runs on 4040). [image: Inline image 1] These receivers will be fixed till the end of your pipeline (unless its crashed etc.)

Re: Opening many Parquet files = slow

2015-04-15 Thread Masf
Hi guys Regarding to parquet files. I have Spark 1.2.0 and reading 27 parquet files (250MB/file), it lasts 4 minutes. I have a cluster with 4 nodes and it seems me too slow. The load function is not available in Spark 1.2, so I can't test it Regards. Miguel. On Mon, Apr 13, 2015 at 8:12 PM,

Re: spark streaming with kafka

2015-04-15 Thread Shushant Arora
So receivers will be fixed for every run of streaming interval job. Say I have set stream Duration to be 10 minutes, then after each 10 minute job will be created and same executor nodes say in your case(spark-akhil-slave2.c.neat-axis-616.internal and spark-akhil-slave1.c.neat-axis-616.internal)

Re: OOM in SizeEstimator while using combineByKey

2015-04-15 Thread Aniket Bhatnagar
I am setting spark.executor.memory as 1024m on a 3 node cluster with each node having 4 cores and 7 GB RAM. The combiner functions are taking scala case classes as input and are generating mutable.ListBuffer of scala case classes. Therefore, I am guessing hashCode and equals should be taken care

spark streaming with kafka

2015-04-15 Thread Shushant Arora
Hi I want to understand the flow of spark streaming with kafka. In spark Streaming is the executor nodes at each run of streaming interval same or At each stream interval cluster manager assigns new executor nodes for processing this batch input. If yes then at each batch interval new executors

Re: Re: spark streaming with kafka

2015-04-15 Thread Akhil Das
@Shushant: In my case, the receivers will be fixed till the end of the application. This one's for Kafka case only, if you have a filestream application, you will not have any receivers. Also, for kafka, next time you run the application, it's not fixed that the receivers will get launched on the