Re: Map-Side Join in Spark

2015-04-21 Thread ayan guha
If you are using a pairrdd, then you can use partition by method to provide your partitioner On 21 Apr 2015 15:04, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: What is re-partition ? On Tue, Apr 21, 2015 at 10:23 AM, ayan guha guha.a...@gmail.com wrote: In my understanding you need to create a

Re: Spark and accumulo

2015-04-21 Thread Akhil Das
You can simply use a custom inputformat (AccumuloInputFormat) with the hadoop RDDs (sc.newApiHadoopFile etc) for that, all you need to do is to pass the jobConfs. Here's pretty clean discussion:

WebUI shows poor locality when task schduling

2015-04-21 Thread eric wong
Hi, When running a exprimental KMeans job for expriment, the Cached RDD is original Points data. I saw poor locality in Task details from WebUI. Almost one half of the input of task is Network instead of Memory. And Task with network input consumes almost the same time compare with the task

Re: how to make a spark cluster ?

2015-04-21 Thread Reynold Xin
Actually if you only have one machine, just use the Spark local mode. Just download the Spark tarball, untar it, set master to local[N], where N = number of cores. You are good to go. There is no setup of job tracker or Hadoop. On Mon, Apr 20, 2015 at 3:21 PM, haihar nahak harihar1...@gmail.com

Re: Understanding the build params for spark with sbt.

2015-04-21 Thread Akhil Das
With maven you could like: mvn -Dhadoop.version=2.3.0 -DskipTests clean package -pl core Thanks Best Regards On Mon, Apr 20, 2015 at 8:10 PM, Shiyao Ma i...@introo.me wrote: Hi. My usage is only about the spark core and hdfs, so no spark sql or mlib or other components invovled. I saw

Re: Map-Side Join in Spark

2015-04-21 Thread ๏̯͡๏
These are pair RDDs (itemId, item) (itemId, listing). What do you mean by re-partitioning of these RDDS ? Now what you mean by your partitioner Can you elaborate ? On Tue, Apr 21, 2015 at 11:18 AM, ayan guha guha.a...@gmail.com wrote: If you are using a pairrdd, then you can use partition by

Re: meet weird exception when studying rdd caching

2015-04-21 Thread Akhil Das
It could be a similar issue as https://issues.apache.org/jira/browse/SPARK-4300 Thanks Best Regards On Tue, Apr 21, 2015 at 8:09 AM, donhoff_h 165612...@qq.com wrote: Hi, I am studying the RDD Caching function and write a small program to verify it. I run the program in a Spark1.3.0

Re: Custom paritioning of DSTream

2015-04-21 Thread Akhil Das
I think DStream.transform is the one that you are looking for. Thanks Best Regards On Mon, Apr 20, 2015 at 9:42 PM, Evo Eftimov evo.efti...@isecc.com wrote: Is the only way to implement a custom partitioning of DStream via the foreach approach so to gain access to the actual RDDs comprising

Re: Configuring logging properties for executor

2015-04-21 Thread Michael Ryabtsev
Hi, I would like to report on trying the first option proposed by Lan - putting the log4j.properties file under the root of my application jar. It doesn't look like it is working on in my case: submitting the application to spark from the application code (not with spark-submit). It seems that in

Re: Spark Scala Version?

2015-04-21 Thread Dean Wampler
Without the rest of your code it's hard to make sense of errors. Why do you need to use reflection? ​Make sure you use the same Scala versions throughout and 2.10.4 is recommended. That's still the official version for Spark, even though provisional​ support for 2.11 exists. Dean Wampler, Ph.D.

Re: Error while running SparkPi in Hadoop HA

2015-04-21 Thread Fernando O.
Solved Looks like it's some incompatibility in the build when using -Phadoop-2.4 , made the distribution with -Phadoop-provided and that fixed the issue On Tue, Apr 21, 2015 at 2:03 PM, Fernando O. fot...@gmail.com wrote: Hi all, I'm wondering if SparkPi works with hadoop HA (I guess

Re: SparkSQL performance

2015-04-21 Thread Michael Armbrust
Here is an example using rows directly: https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#programmatically-specifying-the-schema Avro or parquet input would likely give you the best performance. On Tue, Apr 21, 2015 at 4:28 AM, Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com

Spark Scala Version?

2015-04-21 Thread ๏̯͡๏
While running a my Spark Application over 1.3.0 with Scala 2.10.0 i encountered 15/04/21 09:13:21 ERROR executor.Executor: Exception in task 7.0 in stage 2.0 (TID 28) java.lang.UnsupportedOperationException: tail of empty list at scala.collection.immutable.Nil$.tail(List.scala:339) at

Re: Updating a Column in a DataFrame

2015-04-21 Thread Reynold Xin
You can use df.withColumn(a, df.b) to make column a having the same value as column b. On Mon, Apr 20, 2015 at 3:38 PM, ARose ashley.r...@telarix.com wrote: In my Java application, I want to update the values of a Column in a given DataFrame. However, I realize DataFrames are immutable, and

Re: Spark 1.3.1 Dataframe breaking ALS.train?

2015-04-21 Thread Xiangrui Meng
SchemaRDD subclasses RDD in 1.2, but DataFrame is no longer an RDD in 1.3. We should allow DataFrames in ALS.train. I will submit a patch. You can use `ALS.train(training.rdd, ...)` for now as a workaround. -Xiangrui On Tue, Apr 21, 2015 at 10:51 AM, Joseph Bradley jos...@databricks.com wrote:

Re: Join on DataFrames from the same source (Pyspark)

2015-04-21 Thread Michael Armbrust
This is https://issues.apache.org/jira/browse/SPARK-6231 Unfortunately this is pretty hard to fix as its hard for us to differentiate these without aliases. However you can add an alias as follows: from pyspark.sql.functions import * df.alias(a).join(df.alias(b), col(a.col1) == col(b.col1)) On

Re: Column renaming after DataFrame.groupBy

2015-04-21 Thread Reynold Xin
You can use the more verbose syntax: d.groupBy(_1).agg(d(_1), sum(_1).as(sum_1), sum(_2).as(sum_2)) On Tue, Apr 21, 2015 at 1:06 AM, Justin Yip yipjus...@prediction.io wrote: Hello, I would like rename a column after aggregation. In the following code, the column name is SUM(_1#179), is

Re: Spark 1.3.1 Dataframe breaking ALS.train?

2015-04-21 Thread Joseph Bradley
Hi Ayan, If you want to use DataFrame, then you should use the Pipelines API (org.apache.spark.ml.*) which will take DataFrames: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.recommendation.ALS In the examples/ directory for ml/, you can find a MovieLensALS

Re: Instantiating/starting Spark jobs programmatically

2015-04-21 Thread Steve Loughran
On 21 Apr 2015, at 17:34, Richard Marscher rmarsc...@localytics.commailto:rmarsc...@localytics.com wrote: - There are System.exit calls built into Spark as of now that could kill your running JVM. We have shadowed some of the most offensive bits within our own application to work around this.

Re: Spark Unit Testing

2015-04-21 Thread James King
Hi Emre, thanks for the help will have a look. Cheers! On Tue, Apr 21, 2015 at 1:46 PM, Emre Sevinc emre.sev...@gmail.com wrote: Hello James, Did you check the following resources: - https://github.com/apache/spark/tree/master/streaming/src/test/java/org/apache/spark/streaming -

Column renaming after DataFrame.groupBy

2015-04-21 Thread Justin Yip
Hello, I would like rename a column after aggregation. In the following code, the column name is SUM(_1#179), is there a way to rename it to a more friendly name? scala val d = sqlContext.createDataFrame(Seq((1, 2), (1, 3), (2, 10))) scala d.groupBy(_1).sum().printSchema root |-- _1: integer

Re: Compression and Hive with Spark 1.3

2015-04-21 Thread Ophir Cohen
Some more info: I'm putting the compressions values on hive-site.xml and running spark job. hc.sql(set ) returns the expected (compression) configuration but looking at the logs, it create the tables without compression: 15/04/21 13:14:19 INFO metastore.HiveMetaStore: 0: create_table:

Re: Shuffle files not cleaned up (Spark 1.2.1)

2015-04-21 Thread N B
We already do have a cron job in place to clean just the shuffle files. However, what I would really like to know is whether there is a proper way of telling spark to clean up these files once its done with them? Thanks NB On Mon, Apr 20, 2015 at 10:47 AM, Jeetendra Gangele gangele...@gmail.com

Re: Running spark over HDFS

2015-04-21 Thread madhvi
On Tuesday 21 April 2015 12:12 PM, Akhil Das wrote: Your spark master should be spark://swetha:7077 :) Thanks Best Regards On Mon, Apr 20, 2015 at 2:44 PM, madhvi madhvi.gu...@orkash.com mailto:madhvi.gu...@orkash.com wrote: PFA screenshot of my cluster UI Thanks On Monday 20

Re: Parquet Hive table become very slow on 1.3?

2015-04-21 Thread Rex Xiong
We have the similar issue with massive parquet files, Cheng Lian, could you have a look? 2015-04-08 15:47 GMT+08:00 Zheng, Xudong dong...@gmail.com: Hi Cheng, I tried both these patches, and seems still not resolve my issue. And I found the most time is spend on this line in

Features scaling

2015-04-21 Thread Denys Kozyr
Hi! I want to normalize features before train logistic regression. I setup scaler: scaler2 = StandardScaler(withMean=True, withStd=True).fit(features) and apply it to a dataset: scaledData = dataset.map(lambda x: LabeledPoint(x.label, scaler2.transform(Vectors.dense(x.features.toArray()

Re: Can't get SparkListener to work

2015-04-21 Thread Shixiong Zhu
You need to call sc.stop() to wait for the notifications to be processed. Best Regards, Shixiong(Ryan) Zhu 2015-04-21 4:18 GMT+08:00 Praveen Balaji secondorderpolynom...@gmail.com: Thanks Shixiong. I tried it out and it works. If you're looking at this post, here a few points you may be

Re: Spark and accumulo

2015-04-21 Thread andy petrella
Hello Madvi, Some work has been done by @pomadchin using the spark notebook, maybe you should come on https://gitter.im/andypetrella/spark-notebook and poke him? There are some discoveries he made that might be helpful to know. Also you can poke @lossyrob from Azavea, he did that for geotrellis

Different behavioural of HiveContext vs. Hive?

2015-04-21 Thread Ophir Cohen
Lately we upgraded our Spark to 1.3. Not surprisingly, over the way I find few incomputability between the versions and quite expected. I found change that I'm interesting to understand it origin. env: Amazon EMR, Spark 1.3, Hive 0.13, Hadoop 2.4 In Spark 1.2.1 I ran from the code query such:

Re: Running spark over HDFS

2015-04-21 Thread Akhil Das
Your spark master should be spark://swetha:7077 :) Thanks Best Regards On Mon, Apr 20, 2015 at 2:44 PM, madhvi madhvi.gu...@orkash.com wrote: PFA screenshot of my cluster UI Thanks On Monday 20 April 2015 02:27 PM, Akhil Das wrote: Are you seeing your task being submitted to the UI?

Re: Different behavioural of HiveContext vs. Hive?

2015-04-21 Thread Ophir Cohen
BTW This: hc.sql(show tables).collect Works great! On Tue, Apr 21, 2015 at 10:49 AM, Ophir Cohen oph...@gmail.com wrote: Lately we upgraded our Spark to 1.3. Not surprisingly, over the way I find few incomputability between the versions and quite expected. I found change that I'm

Re: MLlib - Collaborative Filtering - trainImplicit task size

2015-04-21 Thread Sean Owen
I think maybe you need more partitions in your input, which might make for smaller tasks? On Tue, Apr 21, 2015 at 2:56 AM, Christian S. Perone christian.per...@gmail.com wrote: I keep seeing these warnings when using trainImplicit: WARN TaskSetManager: Stage 246 contains a task of very large

Cassandra Connection Issue with Spark-jobserver

2015-04-21 Thread Anand
*I am new to Spark world and Job Server My Code :* package spark.jobserver import java.nio.ByteBuffer import scala.collection.JavaConversions._ import scala.collection.mutable.ListBuffer import scala.collection.immutable.Map import org.apache.cassandra.hadoop.ConfigHelper import

Re: Custom Partitioning Spark

2015-04-21 Thread MUHAMMAD AAMIR
Hi Archit, Thanks a lot for your reply. I am using rdd.partitions.length to check the number of partitions. rdd.partitions return the array of partitions. I would like to add one more question here do you have any idea how to get the objects in each partition ? Further is there any way to figure

Re: implicits is not a member of org.apache.spark.sql.SQLContext

2015-04-21 Thread Ted Yu
Have you tried the following ? import sqlContext._ import sqlContext.implicits._ Cheers On Tue, Apr 21, 2015 at 7:54 AM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: I tried to convert an RDD to a data frame using the example codes on spark website case class

implicits is not a member of org.apache.spark.sql.SQLContext

2015-04-21 Thread Wang, Ningjun (LNG-NPV)
I tried to convert an RDD to a data frame using the example codes on spark website case class Person(name: String, age: Int) val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._ val people =

Re: Join on DataFrames from the same source (Pyspark)

2015-04-21 Thread Karlson
Sorry, my code actually was df_one = df.select('col1', 'col2') df_two = df.select('col1', 'col3') But in Spark 1.4.0 this does not seem to make any difference anyway and the problem is the same with both versions. On 2015-04-21 17:04, ayan guha wrote: your code should be df_one =

Re: Clustering algorithms in Spark

2015-04-21 Thread Jeetendra Gangele
The problem with k means is we have to define the no of cluster which I dont want in this case So thinking for something like hierarchical clustering any idea and suggestions? On 21 April 2015 at 20:51, Jeetendra Gangele gangele...@gmail.com wrote: I have a requirement in which I want to

Re: Join on DataFrames from the same source (Pyspark)

2015-04-21 Thread ayan guha
your code should be df_one = df.select('col1', 'col2') df_two = df.select('col1', 'col3') Your current code is generating a tupple, and of course df_1 and df_2 are different, so join is yielding to cartesian. Best Ayan On Wed, Apr 22, 2015 at 12:42 AM, Karlson ksonsp...@siberie.de wrote:

Re: Custom Partitioning Spark

2015-04-21 Thread ayan guha
Are you looking for? *mapPartitions*(*func*)Similar to map, but runs separately on each partition (block) of the RDD, so *func* must be of type IteratorT = IteratorU when running on an RDD of type T.*mapPartitionsWithIndex*(*func* )Similar to mapPartitions, but also provides *func* with an

Re: Meet Exception when learning Broadcast Variables

2015-04-21 Thread Ted Yu
Does line 27 correspond to brdcst.value ? Cheers On Apr 21, 2015, at 3:19 AM, donhoff_h 165612...@qq.com wrote: Hi, experts. I wrote a very little program to learn how to use Broadcast Variables, but met an exception. The program and the exception are listed as following. Could

Problem with using Spark ML

2015-04-21 Thread Staffan
Hi, I've written an application that performs some machine learning on some data. I've validated that the data _should_ give a good output with a decent RMSE by using Lib-SVM: Mean squared error = 0.00922063 (regression) Squared correlation coefficient = 0.9987 (regression) When I try to use

Spark Streaming updatyeStateByKey throws OutOfMemory Error

2015-04-21 Thread Sourav Chandra
Hi, We are building a spark streaming application which reads from kafka, does updateStateBykey based on the received message type and finally stores into redis. After running for few seconds the executor process get killed by throwing OutOfMemory error. The code snippet is below:

Re: Column renaming after DataFrame.groupBy

2015-04-21 Thread ayan guha
Hi There are 2 ways of doing it. 1. Using SQL - this method directly creates another dataframe object. 2. Using methods of the DF object, but in that case you have to provide the schema through a row object. In this case you need to explicitly call createDataFrame again which will infer the

Re: Custom Partitioning Spark

2015-04-21 Thread Archit Thakur
Hi, This should work. How are you checking the no. of partitions.? Thanks and Regards, Archit Thakur. On Mon, Apr 20, 2015 at 7:26 PM, mas mas.ha...@gmail.com wrote: Hi, I aim to do custom partitioning on a text file. I first convert it into pairRDD and then try to use my custom

Spark 1.3.1 Dataframe breaking ALS.train?

2015-04-21 Thread ayan guha
Hi I am getting an error Also, I am getting an error in mlib.ALS.train function when passing dataframe (do I need to convert the DF to RDD?) Code: training = ssc.sql(select userId,movieId,rating from ratings where partitionKey 6).cache() print type(training) model =

RE:RE:maven compile error

2015-04-21 Thread Shuai Zheng
I have similar issue (I failed on the spark core project but with same exception as you). Then I follow the below steps (I am working on windows): Delete the maven repository, and re-download all dependency. The issue sounds like a corrupted jar can’t be opened by maven. Other than this,

Re: Understanding the build params for spark with sbt.

2015-04-21 Thread Sree V
Hi Shiyao, From the same page you referred:Maven is the official recommendation for packaging Spark, and is the “build of reference”. But SBT is supported for day-to-day development since it can provide much faster iterative compilation. More advanced developers may wish to use SBT. For maven,

Re: HiveContext setConf seems not stable

2015-04-21 Thread Michael Armbrust
As a workaround, can you call getConf first before any setConf? On Tue, Apr 21, 2015 at 1:58 AM, Ophir Cohen oph...@gmail.com wrote: I think I encounter the same problem, I'm trying to turn on the compression of Hive. I have the following lines: def initHiveContext(sc: SparkContext):

Re: Task not Serializable: Graph is unexpectedly null when DStream is being serialized

2015-04-21 Thread Jean-Pascal Billaud
At this point I am assuming that nobody has an idea... I am still going to give it a last shot just in case it was missed by some people :) Thanks, On Mon, Apr 20, 2015 at 2:20 PM, Jean-Pascal Billaud j...@tellapart.com wrote: Hey, so I start the context at the very end when all the piping is

Re: SparkSQL performance

2015-04-21 Thread Renato Marroquín Mogrovejo
Thanks Michael! I have tried applying my schema programatically but I didn't get any improvement on performance :( Could you point me to some code examples using Avro please? Many thanks again! Renato M. 2015-04-21 20:45 GMT+02:00 Michael Armbrust mich...@databricks.com: Here is an example

Re: Features scaling

2015-04-21 Thread DB Tsai
Hi Denys, I don't see any issue in your python code, so maybe there is a bug in python wrapper. If it's in scala, I think it should work. BTW, LogsticRegressionWithLBFGS does the standardization internally, so you don't need to do it yourself. It worths giving it a try! Sincerely, DB Tsai

Re: Spark Performance on Yarn

2015-04-21 Thread hnahak
Try --executor-memory 5g , because you have 8 gb RAM in each machine -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p22603.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Streaming updatyeStateByKey throws OutOfMemory Error

2015-04-21 Thread Olivier Girardot
Hi Sourav, Can you post your updateFunc as well please ? Regards, Olivier. Le mar. 21 avr. 2015 à 12:48, Sourav Chandra sourav.chan...@livestream.com a écrit : Hi, We are building a spark streaming application which reads from kafka, does updateStateBykey based on the received message type

Re: Spark 1.3.1 Dataframe breaking ALS.train?

2015-04-21 Thread ayan guha
Thank you all. On 22 Apr 2015 04:29, Xiangrui Meng men...@gmail.com wrote: SchemaRDD subclasses RDD in 1.2, but DataFrame is no longer an RDD in 1.3. We should allow DataFrames in ALS.train. I will submit a patch. You can use `ALS.train(training.rdd, ...)` for now as a workaround. -Xiangrui

Re: how to make a spark cluster ?

2015-04-21 Thread haihar nahak
I did some performance check on socLiveJournal PageRank b/w my local machine (8 cores, 16 gb ) in local mode and my small cluster (4 nodes, 12 cores, 40 gb) and i found cluster mode is way faster than local mode. So I confused. no. of iterations --- Local mode(in mins) -- cluster mode(in mins) 1

Re: Task not Serializable: Graph is unexpectedly null when DStream is being serialized

2015-04-21 Thread Jean-Pascal Billaud
Sure. But in general, I am assuming this Graph is unexpectedly null when DStream is being serialized must mean something. Under which circumstances, such an exception would trigger? On Tue, Apr 21, 2015 at 4:12 PM, Tathagata Das t...@databricks.com wrote: Yeah, I am not sure what is going on.

Re: Task not Serializable: Graph is unexpectedly null when DStream is being serialized

2015-04-21 Thread Tathagata Das
It is kind of unexpected, i can imagine a real scenario under which it should trigger. But obviously I am missing something :) TD On Tue, Apr 21, 2015 at 4:23 PM, Jean-Pascal Billaud j...@tellapart.com wrote: Sure. But in general, I am assuming this Graph is unexpectedly null when DStream is

Re: sparksql - HiveConf not found during task deserialization

2015-04-21 Thread Manku Timma
Akhil, Thanks for the suggestions. I tried out sc.addJar, --jars, --conf spark.executor.extraClassPath and none of them helped. I added stuff into compute-classpath.sh. That did not change anything. I checked the classpath of the running executor and made sure that the hive jars are in that dir.

problem writing to s3

2015-04-21 Thread Daniel Mahler
I am having a strange problem writing to s3 that I have distilled to this minimal example: def jsonRaw = s${outprefix}-json-raw def jsonClean = s${outprefix}-json-clean val txt = sc.textFile(inpath)//.coalesce(shards, false) txt.count val res = txt.saveAsTextFile(jsonRaw) val txt2 =

Efficient saveAsTextFile by key, directory for each key?

2015-04-21 Thread Arun Luthra
Is there an efficient way to save an RDD with saveAsTextFile in such a way that the data gets shuffled into separated directories according to a key? (My end goal is to wrap the result in a multi-partitioned Hive table) Suppose you have: case class MyData(val0: Int, val1: string, directory_name:

Re: Task not Serializable: Graph is unexpectedly null when DStream is being serialized

2015-04-21 Thread Tathagata Das
Yeah, I am not sure what is going on. The only way to figure to take a look at the disassembled bytecodes using javap. TD On Tue, Apr 21, 2015 at 1:53 PM, Jean-Pascal Billaud j...@tellapart.com wrote: At this point I am assuming that nobody has an idea... I am still going to give it a last

Re: Number of input partitions in SparkContext.sequenceFile

2015-04-21 Thread Archit Thakur
Hi, It should generate the same no of partitions as the no. of splits. Howd you check no of partitions.? Also please paste your file size and hdfs-site.xml and mapred-site.xml here. Thanks and Regards, Archit Thakur. On Sat, Apr 18, 2015 at 6:20 PM, Wenlei Xie wenlei@gmail.com wrote: Hi,

Meet Exception when learning Broadcast Variables

2015-04-21 Thread donhoff_h
Hi, experts. I wrote a very little program to learn how to use Broadcast Variables, but met an exception. The program and the exception are listed as following. Could anyone help me to solve this problem? Thanks! **My Program is as following** object TestBroadcast02 {

Re: Spark Unit Testing

2015-04-21 Thread Emre Sevinc
Hello James, Did you check the following resources: - https://github.com/apache/spark/tree/master/streaming/src/test/java/org/apache/spark/streaming - http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs -- Emre Sevinç

Re: SparkSQL performance

2015-04-21 Thread Renato Marroquín Mogrovejo
Thanks for the hints guys! much appreciated! Even if I just do a something like: Select * from tableX where attribute1 5 I see similar behaviour. @Michael Could you point me to any sample code that uses Spark's Rows? We are at a phase where we can actually change our JavaBeans for something

Spark Unit Testing

2015-04-21 Thread James King
I'm trying to write some unit tests for my spark code. I need to pass a JavaPairDStreamString, String to my spark class. Is there a way to create a JavaPairDStream using Java API? Also is there a good resource that covers an approach (or approaches) for unit testing using Java. Regards jk

Error while running SparkPi in Hadoop HA

2015-04-21 Thread Fernando O.
Hi all, I'm wondering if SparkPi works with hadoop HA (I guess it should) Hadoop's pi example works great on my cluster, so after having that done I installed spark and in the worker log I'm seeing two problems that might be related. Versions: Hadoop 2.6.0 Spark 1.3.1 I'm

Re: Join on DataFrames from the same source (Pyspark)

2015-04-21 Thread ayan guha
you are correct. Just found the same thing. You are better off with sql, then. userSchemaDF = ssc.createDataFrame(userRDD) userSchemaDF.registerTempTable(users) #print userSchemaDF.take(10) #SQL API works as expected sortedDF = ssc.sql(SELECT userId,age,gender,work from users order

Re: Instantiating/starting Spark jobs programmatically

2015-04-21 Thread Richard Marscher
Could you possibly describe what you are trying to learn how to do in more detail? Some basics of submitting programmatically: - Create a SparkContext instance and use that to build your RDDs - You can only have 1 SparkContext per JVM you are running, so if you need to satisfy concurrent job