Sorting the dataframe

2016-03-04 Thread Angel Angel
hello sir, i want to sort the following table as per the *count* value count 52639 22 75243 4 13 55 56 5 185463 45 324364 32 So first i convert the my dataframe to to rdd to sort the table. val k = table.rdd convert the rdd array into key value pair. val s =k.take(6) val rdd = s.map(x=> x(

Re: Do we need schema for Parquet files with Spark?

2016-03-04 Thread ashokkumar rajendran
Thanks for the clarification Xinh. On Fri, Mar 4, 2016 at 12:30 PM, Xinh Huynh wrote: > Hi Ashok, > > On the Spark SQL side, when you create a dataframe, it will have a schema > (each column has a type such as Int or String). Then when you save that > dataframe as parquet format, Spark transla

Re: Sorting the dataframe

2016-03-04 Thread Mich Talebzadeh
Try this example, similar to yours. DF should sufficient val a = Seq(("Mich",20), ("Christian", 18), ("James",13), ("Richard",16)) // Sort option 1 using tempTable val b = a.toDF("Name","score").registerTempTable("tmp") sql("select Name,score from tmp order by score desc").show // Sort option 2 wi

Facing issue with floor function in spark SQL query

2016-03-04 Thread ashokkumar rajendran
Hi, I load json file that has timestamp (as long in milliseconds) and several other attributes. I would like to group them by 5 minutes and store them as separate file. I am facing couple of problems here.. 1. Using Floor function at select clause (to bucket by 5mins) gives me error saying "java.

Re: Sorting the dataframe

2016-03-04 Thread Mohannad Ali
You can call orderBy on the dataframe directly: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame On Mar 4, 2016 09:18, "Angel Angel" wrote: > hello sir, > > i want to sort the following table as per the *count* > > value count > 52639 22 > 75243 4 > 13 55 >

Re: Sorting the dataframe

2016-03-04 Thread Mohammad Tariq
You could try DataFrame.sort() to sort your data based on a column. [image: http://] Tariq, Mohammad about.me/mti [image: http://] On Fri, Mar 4, 2016 at 1:48 PM, Angel Angel wrote: > hello sir, > > i want to sort the following table as per the *count* > > value count

[Kinesis] multiple KinesisRecordProcessor threads.

2016-03-04 Thread Li Ming Tsai
Hi, @chris @tdas Referring to the latest integration documentation, it states the following: "A single Kinesis input DStream can read from multiple shards of a Kinesis stream by creating multiple KinesisRecordProcessor threads." But looking at the API and the example, each time we call Ki

Number Of Jobs In Spark Streaming

2016-03-04 Thread Sandip Mehta
Hi All, Is it fair to say that, number of jobs in a given spark streaming application is equal to number of actions in an application? Regards Sandeep - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional co

Re: Facing issue with floor function in spark SQL query

2016-03-04 Thread ayan guha
Most likely you are missing import of org.apache.spark.sql.functions. In any case, you can write your own function for floor and use it as UDF. On Fri, Mar 4, 2016 at 7:34 PM, ashokkumar rajendran < ashokkumar.rajend...@gmail.com> wrote: > Hi, > > I load json file that has timestamp (as long in

Re: How to display the web ui when running Spark on YARN?

2016-03-04 Thread Steve Loughran
On 3 Mar 2016, at 09:17, Shady Xu mailto:shad...@gmail.com>> wrote: Hi all, I am running Spark in yarn-client mode, but every time I access the web ui, the browser redirect me to one of the worker nodes and shows nothing. The url looks like http://hadoop-node31.company.com:8088/proxy/applica

Re: Job fails at saveAsHadoopDataset stage due to Lost Executor due to reason unknown so far

2016-03-04 Thread ayan guha
Hi I doubt if that is a correct use of HBase. In case you are having analytics use case, you would probably better off using Hive. On Fri, Mar 4, 2016 at 3:09 AM, Nirav Patel wrote: > It's write once table. Mainly used for read/query intensive application. > We in fact generate comma separate

Re: Sorting the dataframe

2016-03-04 Thread Gourav Sengupta
Hi, I am completely agree with the use of dataframe for most operations using SPARK, unless you are custom algorithm or algorithms that need use of RDD. Databricks have taken a cue from Apache Flink (I think) and rewritten tungsten as the base engine that drives dataframe, so there is performance

Re: Facing issue with floor function in spark SQL query

2016-03-04 Thread ashokkumar rajendran
Hi Ayan, Thanks for the response. I am using SQL query (not Dataframe). Could you please explain how I should import this sql function to it? Simply importing this class to my driver code does not help here. Many functions that I need are already there in the sql.functions so I do not want to rew

Spark 1.5.2 -Better way to create custom schema

2016-03-04 Thread Divya Gehlot
Hi , I have a data set in HDFS . Is there any better any to define the custom schema for the data set having more 100+ fields of different data types. Thanks, Divya

DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

2016-03-04 Thread James Hammerton
Hi, I've come across some strange behaviour with Spark 1.6.0. In the code below, the filtering by "eventName" only seems to work if I called .cache on the resulting DataFrame. If I don't do this, the code crashes inside the UDF because it processes an event that the filter should get rid off. A

Re: Mapper side join with DataFrames API

2016-03-04 Thread Deepak Gopalakrishnan
Have added this to SO, can you guys share any thoughts ? http://stackoverflow.com/questions/35795518/spark-1-6-spills-to-disk-even-when-there-is-enough-memory

Re: DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

2016-03-04 Thread Ted Yu
If you can reproduce the following with a unit test, I suggest you open a JIRA. Thanks > On Mar 4, 2016, at 4:01 AM, James Hammerton wrote: > > Hi, > > I've come across some strange behaviour with Spark 1.6.0. > > In the code below, the filtering by "eventName" only seems to work if I > ca

Re: Facing issue with floor function in spark SQL query

2016-03-04 Thread Ajay Chander
Hi Ashok, Try using hivecontext instead of sqlcontext. I suspect sqlcontext doesnot have that functionality. Let me know if it works. Thanks, Ajay On Friday, March 4, 2016, ashokkumar rajendran < ashokkumar.rajend...@gmail.com> wrote: > Hi Ayan, > > Thanks for the response. I am using SQL query

Re: AVRO vs Parquet

2016-03-04 Thread Paul Leclercq
Nice article about Parquet *with* Avro : - https://dzone.com/articles/understanding-how-parquet - http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/ Nice video from the good folks of Cloudera for the *differences* between "Avrow" and Parquet - https://www.youtube.com/watch?v=AY1

Re: Facing issue with floor function in spark SQL query

2016-03-04 Thread Mich Talebzadeh
Spark sql has both FLOOR and CEILING functions spark-sql> select FLOOR(11.95),CEILING(11.95); 11.012.0 Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

2016-03-04 Thread James Hammerton
Sure thing, I'll see if I can isolate this. Regards. James On 4 March 2016 at 12:24, Ted Yu wrote: > If you can reproduce the following with a unit test, I suggest you open a > JIRA. > > Thanks > > On Mar 4, 2016, at 4:01 AM, James Hammerton wrote: > > Hi, > > I've come across some strange be

Re: Spark 1.5 on Mesos

2016-03-04 Thread Ashish Soni
It did not helped , same error , Is this the issue i am running into https://issues.apache.org/jira/browse/SPARK-11638 *Warning: Local jar /mnt/mesos/sandbox/spark-examples-1.6.0-hadoop2.6.0.jar does not exist, skipping.* java.lang.ClassNotFoundException: org.apache.spark.examples.SparkPi On Thu,

Spark SQL - udf with entire row as parameter

2016-03-04 Thread Nisrina Luthfiyati
Hi all, I'm using spark sql in python and want to write a udf that takes an entire Row as the argument. I tried something like: def functionName(row): ... return a_string udfFunctionName=udf(functionName, StringType()) df.withColumn('columnName', udfFunctionName('*')) but this gives an e

Issue with sbt failing on amplab succinct

2016-03-04 Thread Mich Talebzadeh
Hi, I have a simple Scala code that I want to use it in an sbt project. It is pretty simple but imports the following: // Import SuccinctRDD import edu.berkeley.cs.succinct._ name := "Simple Project" version := "1.0" scalaVersion := "2.10.5" libraryDependencies += "org.apache.spark" %% "spark-

Re: Issue with sbt failing on amplab succinct

2016-03-04 Thread Ted Yu
Can you show the complete stack trace ? It was clear which class whose definition was not found. On Fri, Mar 4, 2016 at 6:46 AM, Mich Talebzadeh wrote: > Hi, > > I have a simple Scala code that I want to use it in an sbt project. > > It is pretty simple but imports the following: > > // Import

Re: Issue with sbt failing on amplab succinct

2016-03-04 Thread Mich Talebzadeh
here you are sbt package [info] Set current project to Simple Project (in build file:/home/hduser/dba/bin/scala/) [success] Total time: 1 s, completed Mar 4, 2016 2:50:16 PM hduser@rhes564::/home/hduser/dba/bin/scala> spark-submit --class "SimpleApp" --master local target/scala-2.10/simple-project

Re: Issue with sbt failing on amplab succinct

2016-03-04 Thread Ted Yu
Maybe leave a comment on http://spark-packages.org/package/amplab/succinct ? On Fri, Mar 4, 2016 at 7:22 AM, Mich Talebzadeh wrote: > here you are > > sbt package > [info] Set current project to Simple Project (in build > file:/home/hduser/dba/bin/scala/) > [success] Total time: 1 s, completed M

Re: Spark Streaming

2016-03-04 Thread anbucheeralan
Hi, Were you able to solve this issue? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-tp24058p26396.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

Re: Issue with sbt failing on amplab succinct

2016-03-04 Thread Luciano Resende
Have you tried adding it as --packages at the beginning of you spark-submit ? --packages amplab:succinct:0.1.5 Also, I would usually have the Spark dependencies as "provided" in the build.sbt On Fri, Mar 4, 2016 at 6:46 AM, Mich Talebzadeh wrote: > Hi, > > I have a simple Scala code that I wa

Re: Issue with sbt failing on amplab succinct

2016-03-04 Thread Mich Talebzadeh
Thanks Luciano. This seems to work now although some spurious errors! *spark-submit --packages amplab:succinct:0.1.5 --class "SimpleApp" --master local target/scala-2.10/simple-project_2.10-1.0.jar* Ivy Default Cache set to: /home/hduser/.ivy2/cache The jars for the packages stored in: /home/hdu

1.6.0 spark.sql datetime conversion problem

2016-03-04 Thread Michal Vince
Hi guys I`m using spark 1.6.0 and I`m not sure if I found a bug or I`m doing something wrong I`m playing with dataframes and I`m converting iso 8601 with millis to my timezone - which is Europe/Bratislava with fromt_utc_timestamp function from spark.sql.functions the problem is that Europe/

Re: Do we need schema for Parquet files with Spark?

2016-03-04 Thread Ryan Blue
Hi Ashok, The schema for your data comes from the data frame you're using in Spark and resolved with a Hive table schema if you are writing to one. For encodings, you don't need to configure them because they are selected for your data automatically. For example, Parquet will try dictionary-encodi

Does Spark 1.5.x really still support Hive 0.12?

2016-03-04 Thread Yong Zhang
When I tried to compile the Spark 1.5.2 with -Phive-0.12.0, maven gave me back an error that profile doesn't exist any more. But when I read the Spark SQL programming guide here: http://spark.apache.org/docs/1.5.2/sql-programming-guide.htmlIt keeps mentioning Spark 1.5.2 still can work with Hive

Installing Spark on Mac

2016-03-04 Thread Aida
Hi all, I am a complete novice and was wondering whether anyone would be willing to provide me with a step by step guide on how to install Spark on a Mac; on standalone mode btw. I downloaded a prebuilt version, the second version from the top. However, I have not installed Hadoop and am not plan

Re: Installing Spark on Mac

2016-03-04 Thread Simon Hafner
I'd try `brew install spark` or `apache-spark` and see where that gets you. https://github.com/Homebrew/homebrew 2016-03-04 21:18 GMT+01:00 Aida : > Hi all, > > I am a complete novice and was wondering whether anyone would be willing to > provide me with a step by step guide on how to install Spar

Re: Installing Spark on Mac

2016-03-04 Thread Eduardo Costa Alfaia
Hi Aida Run only "build/mvn -DskipTests clean package” BR Eduardo Costa Alfaia Ph.D. Student in Telecommunications Engineering Università degli Studi di Brescia Tel: +39 3209333018 On 3/4/16, 16:18, "Aida" wrote: >Hi all, > >I am a complete novice and was wondering whether anyone would

Re: Installing Spark on Mac

2016-03-04 Thread Vishnu Viswanath
Installing spark on mac is similar to how you install it on Linux. I use mac and have written a blog on how to install spark here is the link : http://vishnuviswanath.com/spark_start.html Hope this helps. On Fri, Mar 4, 2016 at 2:29 PM, Simon Hafner wrote: > I'd try `brew install spark` or `ap

Re: Does Spark 1.5.x really still support Hive 0.12?

2016-03-04 Thread Michael Armbrust
Read the docs at the link that you pasted: http://spark.apache.org/docs/latest/sql-programming-guide.html#interacting-with-different-versions-of-hive-metastore Spark will always compile against the same version of Hive (1.2.1), but it can dynamically load jars to speak to other versions. On Fri,

S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

2016-03-04 Thread Jelez Raditchkov
Working on a streaming job with DirectParquetOutputCommitter to S3I need to use PartitionBy and hence SaveMode.Append Apparently when using SaveMode.Append spark automatically defaults to the default parquet output committer and ignores DirectParquetOutputCommitter. My problems are:1. the copying

How to get the singleton instance of SQLContext/HiveContext: val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)

2016-03-04 Thread Jelez Raditchkov
What is the best approach to use getOrCreate for streaming job with HiveContext.It seems for SQLContext the recommended approach is to use getOrCreate: https://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operationsval sqlContext = SQLContext.getOrCreate(rdd.s

Re: Spark SQL - udf with entire row as parameter

2016-03-04 Thread Michael Armbrust
You have to use SQL to call it (but you will be able to do it with dataframes in Spark 2.0 due to a better parser). You need to construct a struct(*) and then pass that to your function since a function must have a fixed number of arguments. Here is an example

Re: Spark 1.5.2 : change datatype in programaticallly generated schema

2016-03-04 Thread Michael Armbrust
Change the type of a subset of the columns using withColumn, after you have loaded the DataFrame. Here is an example. On Thu, Mar 3,

Best way to merge files from streaming jobs

2016-03-04 Thread Jelez Raditchkov
My streaming job is creating files on S3.The problem is that those files end up very small if I just write them to S3 directly.This is why I use coalesce() to reduce the number of files and make them larger. However, coalesce shuffles data and my job processing time ends up higher than sparkBatc

Spark reduce serialization question

2016-03-04 Thread James Jia
I'm running a distributed KMeans algorithm with 4 executors. I have a RDD[Data]. I use mapPartition to run a learner on each data partition, and then call reduce with my custom model reduce function to reduce the result of the model to start a new iteration. The model size is around ~330 MB. I wo

Error building a self contained Spark app

2016-03-04 Thread Mich Talebzadeh
Hi, I have a simple Scala program as below import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf import org.apache.spark.sql.SQLContext object Sequence { def main(args: Array[String]) { val conf = new SparkConf().setAppName("Sequence")

SSL support for Spark Thrift Server

2016-03-04 Thread Sourav Mazumder
Hi All, While starting the Spark Thrift Server I don't see any option to start it with SSL support. Is that support currently there ? Regards, Sourav

How to get the singleton instance of SQLContext/HiveContext: val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)‏

2016-03-04 Thread jelez
What is the best approach to use getOrCreate for streaming job with HiveContext. It seems for SQLContext the recommended approach is to use getOrCreate: https://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations val sqlContext = SQLContext.getOrCreate(rdd

Best way to merge files from streaming jobs‏ on S3

2016-03-04 Thread jelez
My streaming job is creating files on S3. The problem is that those files end up very small if I just write them to S3 directly. This is why I use coalesce() to reduce the number of files and make them larger. However, coalesce shuffles data and my job processing time ends up higher than sparkBatc

spark driver in docker

2016-03-04 Thread yanlin wang
We would like to run multiple spark driver in docker container. Any suggestion for the port expose and network settings for docker so driver is reachable by the worker nodes? —net=“hosts” is the last thing we want to do. Thx Yanlin

Re: Error building a self contained Spark app

2016-03-04 Thread Ted Yu
Can you add the following into your code ? import sqlContext.implicits._ On Fri, Mar 4, 2016 at 1:14 PM, Mich Talebzadeh wrote: > Hi, > > I have a simple Scala program as below > > import org.apache.spark.SparkContext > import org.apache.spark.SparkContext._ > import org.apache.spark.SparkConf

Use cases for kafka direct stream messageHandler

2016-03-04 Thread Cody Koeninger
Wanted to survey what people are using the direct stream messageHandler for, besides just extracting key / value / offset. Would your use case still work if that argument was removed, and the stream just contained ConsumerRecord objects (http://kafka.apache.org/090/javadoc/org/apache/kafka/clients

Re: How to get the singleton instance of SQLContext/HiveContext: val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)‏

2016-03-04 Thread Ted Yu
bq. However the method does not seem inherited to HiveContext. Can you clarify the above observation ? HiveContext extends SQLContext . On Fri, Mar 4, 2016 at 1:23 PM, jelez wrote: > What is the best approach to use getOrCreate for streaming job with > HiveContext. > It seems for SQLContext the

Re: Error building a self contained Spark app

2016-03-04 Thread Mich Talebzadeh
Hi Ted, I am getting the following error after adding that import [error] /home/hduser/dba/bin/scala/Sequence/src/main/scala/Sequence.scala:5: not found: object sqlContext [error] import sqlContext.implicits._ [error]^ [error] /home/hduser/dba/bin/scala/Sequence/src/main/scala/Sequence.s

FW: How to get the singleton instance of SQLContext/HiveContext: val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)‏

2016-03-04 Thread Jelez Raditchkov
From: je...@hotmail.com To: yuzhih...@gmail.com Subject: RE: How to get the singleton instance of SQLContext/HiveContext: val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)‏ Date: Fri, 4 Mar 2016 14:09:20 -0800 Below code is from the soruces, is this what you ask? class HiveContext

Re: Error building a self contained Spark app

2016-03-04 Thread Ted Yu
Can you show your code snippet ? Here is an example: val sqlContext = new SQLContext(sc) import sqlContext.implicits._ On Fri, Mar 4, 2016 at 1:55 PM, Mich Talebzadeh wrote: > Hi Ted, > > I am getting the following error after adding that import > > [error] > /home/hduser/dba/bin/s

RE: Error building a self contained Spark app

2016-03-04 Thread Jelez Raditchkov
Ok this is what I have: object SQLHiveContextSingleton { @transient private var instance: HiveContext = _ def getInstance(sparkContext: SparkContext): HiveContext = { synchronized { if (instance == null || sparkContext.isStopped) { instance = new HiveCo

Re: Error building a self contained Spark app

2016-03-04 Thread Mich Talebzadeh
Hi Ted, This is my code import org.apache.spark.SparkConf import org.apache.spark.sql.Row import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.types._ import org.apache.spark.sql.SQLContext // object Sequence { def main(args: Array[String]) { val conf = new SparkConf().set

Re: Using netlib-java in Spark 1.6 on linux

2016-03-04 Thread Chris Fregly
I have all of this pre-wired up and Docker-ized for your instant enjoyment here: https://github.com/fluxcapacitor/pipeline/wiki You can review the Dockerfile for the details (Ubuntu 14.04-based). This is easy BREEZEy. Also, here

Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-04 Thread Chris Fregly
Hot off the presses... Here's the closest we have to Python GraphX (and Cypher) support: https://databricks.com/blog/2016/03/03/introducing-graphframes.html This was demo'd at Spark Summit NYC 2016. I'm migrating all of my GraphX code to this now. Reminder that GraphX is a batch graph analytics

RethinkDB as a Datasource

2016-03-04 Thread pnakibar
Hi, I see that there is no way to use RethinkDB as a datasource in Spark. I really like this database and use it everyday, there is any way for me to write a plugin so I can use it in Apache Spark? I'm really interested into writing this plugin and contributing to the Spark community and general.

Re: Error building a self contained Spark app

2016-03-04 Thread Ted Yu
After: val sqlContext = new org.apache.spark.sql.SQLContext(sc) Please add: import sqlContext.implicits._ On Fri, Mar 4, 2016 at 3:03 PM, Mich Talebzadeh wrote: > Hi Ted, > > This is my code > > import org.apache.spark.SparkConf > import org.apache.spark.sql.Row > import org.apache.s

Re: Error building a self contained Spark app

2016-03-04 Thread Chandeep Singh
This is what you need: val sc = new SparkContext(sparkConf) val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._ > On Mar 4, 2016, at 11:03 PM, Mich Talebzadeh > wrote: > > Hi Ted, > > This is my code > > import org.apache.spark.SparkConf > impor

Re: Error building a self contained Spark app

2016-03-04 Thread Mich Talebzadeh
thanks. It is like war of attrition. I always thought that you add import before the class itself not within the class? w3hat is the reason for it please? this is my code import org.apache.spark.SparkContext import org.apache.spark.SparkConf import org.apache.spark.sql.Row import org.apache.spar

Re: Error building a self contained Spark app

2016-03-04 Thread Ted Yu
Please import: import org.apache.spark.sql.functions._ On Fri, Mar 4, 2016 at 3:35 PM, Mich Talebzadeh wrote: > thanks. It is like war of attrition. I always thought that you add import > before the class itself not within the class? w3hat is the reason for it > please? > > this is my code > >

Re: Error building a self contained Spark app

2016-03-04 Thread Chandeep Singh
That is because an instance of org.apache.spark.sql.SQLContext doesn’t exist in the current context and is required before you can use any of its implicit methods. As Ted mentioned importing org.apache.spark.sql.functions._ will take care of the below error. > On Mar 4, 2016, at 11:35 PM, Mich

Re: RethinkDB as a Datasource

2016-03-04 Thread Burak Yavuz
Hi, You can always write it as a data source and share it on Spark Packages. There are many data source connectors available already: http://spark-packages.org/?q=tags%3A%22Data%20Sources%22 Best, Burak On Fri, Mar 4, 2016 at 3:19 PM, pnakibar wrote: > Hi, > I see that there is no way to use R

Re: Error building a self contained Spark app

2016-03-04 Thread Mich Talebzadeh
Thanks now all working. Also select from tmp tables are part of sqlContext not HiveContext This is the final code that works in blue Couple of questions if I may 1. This works pretty effortless in spark-shell. Is this because $CLASSPATH already includes all the needed jars? 2. The im

Re: Error building a self contained Spark app

2016-03-04 Thread Ted Yu
Answers to first two questions are 'yes' Not clear on what the 3rd question is asking. On Fri, Mar 4, 2016 at 4:28 PM, Mich Talebzadeh wrote: > Thanks now all working. Also select from tmp tables are part > of sqlContext not HiveContext > > This is the final code that works in blue > > > Coupl

Re: Error building a self contained Spark app

2016-03-04 Thread Mich Talebzadeh
Hi Ted, I meant as we have the spark-shell and spark-sql, what is the advantage of building self contained applications? We still need to submit it via spark-submit. Basically the use case for self contained programs. That is we build the code, create the class and run it independently of spark-s

Re: Error building a self contained Spark app

2016-03-04 Thread Chandeep Singh
#3 If your code is dependent on other projects you will need to package everything together in order to distribute over a Spark cluster. In your example below I don’t see much of an advantage by building a package. > On Mar 5, 2016, at 12:32 AM, Ted Yu wrote: > > Answers to first two questions

Re: Error building a self contained Spark app

2016-03-04 Thread Mich Talebzadeh
great thanks. so roughly this is in line with the usual building of Java package Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzad

OOM When Running with Mesos Fine-grained Mode

2016-03-04 Thread SLiZn Liu
Hi Spark Mailing List, I’m running terabytes of text files with Spark on Mesos, the job runs fine until we decided to switch to Mesos fine-grained mode. At first glance, we spotted massive number of task lost errors in logs: 16/03/05 04:01:20 ERROR TaskSchedulerImpl: Ignoring update with state L

Re: Best way to merge files from streaming jobs‏ on S3

2016-03-04 Thread Chris Miller
Why does the order matter? Coalesce runs in parallel and if it's just writing to the file, then I imagine it would do it in whatever order it happens to be executed in each thread. If you want to sort the resulting data, I imagine you'd need to save it to some sort of data structure instead of writ