RDD.cacheDataSet() not working intermittently

2017-05-09 Thread jasbir.sing
Hi, I have a scenario in which I am caching my RDDs for future use. But I observed that when I use my RDD, complete DAG is re-executed and RDD gets created again. How can I avoid this scenario and make sure that RDD.cacheDataSet() caches RDD every time. Regards, Jasbir Singh

Partitioning strategy

2017-04-02 Thread jasbir.sing
Hi, I have RDD with 4 years’ data with suppose 20 partitions. On runtime, user can decide to select few months or years of RDD. That means, based upon user time selection RDD is being filtered and on filtered RDD further transformations and actions are performed. And, as spark says, child RDD

RE: Fast write datastore...

2017-03-15 Thread jasbir.sing
Hi, Will MongoDB not fit this solution? From: Vova Shelgunov [mailto:vvs...@gmail.com] Sent: Wednesday, March 15, 2017 11:51 PM To: Muthu Jayakumar Cc: vincent gromakowski ; Richard Siebeling ; user

RE: Check if dataframe is empty

2017-03-06 Thread jasbir.sing
Dataframe.take(1) is faster. From: ashaita...@nz.imshealth.com [mailto:ashaita...@nz.imshealth.com] Sent: Tuesday, March 07, 2017 9:22 AM To: user@spark.apache.org Subject: Check if dataframe is empty Hello! I am pretty sure that I am asking something which has been already asked lots of

RE: Having multiple spark context

2017-01-30 Thread jasbir.sing
Is there any way in which my application can connect to multiple Spark Clusters? Or is communication between Spark clusters possible? Regards, Jasbir From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com] Sent: Monday, January 30, 2017 1:33 PM To: vincent gromakowski

Having multiple spark context

2017-01-29 Thread jasbir.sing
Hi, I have a requirement in which, my application creates one Spark context in Distributed mode whereas another Spark context in local mode. When I am creating this, my complete application is working on only one SparkContext (created in Distributed mode). Second spark context is not getting

RE: Spark #cores

2017-01-18 Thread jasbir.sing
Are you talking here of Spark SQL ? If yes, spark.sql.shuffle.partitions needs to be changed. From: Saliya Ekanayake [mailto:esal...@gmail.com] Sent: Wednesday, January 18, 2017 8:56 PM To: User Subject: Spark #cores Hi, I am running a Spark application setting the

RE: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread jasbir.sing
Hi, Coalesce is used to decrease the number of partitions. If you give the value of numPartitions greater than the current partition, I don’t think RDD number of partitions will be increased. Thanks, Jasbir From: Fei Hu [mailto:hufe...@gmail.com] Sent: Sunday, January 15, 2017 10:10 PM To: