Re: Pasting into spark-shell doesn't work for Databricks example

2016-11-22 Thread Denis Bolshakov
Hello Zhu, Thank you very much for such detailed explanation and providing workaround, it works fine. But since the problem is related to scala issue can we expect the fix in Spark 2.0? Or it's not a good idea to update such important dependency as scala in minor maintenance release? Kind

Re: Pregel Question

2016-11-22 Thread Saliya Ekanayake
Just realized the attached file has text formatting wrong. The github link to the file is https://github.com/esaliya/graphxprimer/blob/master/src/main/scala-2.10/org/saliya/graphxprimer/PregelExample2.scala On Tue, Nov 22, 2016 at 3:08 PM, Saliya Ekanayake wrote: > Hi, > >

[Spark MLlib]: Does Spark MLlib supports nonlinear optimization with nonlinear constraints.

2016-11-22 Thread himanshu.gpt
Hi, Component: Spark MLlib Level: Beginner Scenario: Does Spark MLlib supports nonlinear optimization with nonlinear constraints? Our business application supports two types of function convex and S-shaped curves and linear & non-linear constraints. These constraints can be combined with any

Re: Best practice for preprocessing feature with DataFrame

2016-11-22 Thread Yan Facai
Thanks, White. On Thu, Nov 17, 2016 at 11:15 PM, Stuart White wrote: > Sorry. Small typo. That last part should be: > > val modifiedRows = rows > .select( > substring('age, 0, 2) as "age", > when('gender === 1, "male").otherwise(when('gender === 2, >

Re: Re: Re: Multiple streaming aggregations in structured streaming

2016-11-22 Thread Reynold Xin
It's just the "approx_count_distinct" aggregate function. On Tue, Nov 22, 2016 at 6:51 PM, Xinyu Zhang wrote: > Could you please tell me how to use the approximate count distinct? Is > there any docs? > > Thanks > > > At 2016-11-21 15:56:21, "Reynold Xin"

Re:Re: Re: Multiple streaming aggregations in structured streaming

2016-11-22 Thread Xinyu Zhang
Could you please tell me how to use the approximate count distinct? Is there any docs? Thanks At 2016-11-21 15:56:21, "Reynold Xin" wrote: Can you use the approximate count distinct? On Sun, Nov 20, 2016 at 11:51 PM, Xinyu Zhang wrote:

Re: RDD Partitions on HDFS file in Hive on Spark Query

2016-11-22 Thread yeshwanth kumar
Hi Ayan, , thanks for the explanation, I am aware of compression codecs. How does locality level set? Is it done by Spark or yarn? Please let me know, Thanks, Yesh On Nov 22, 2016 5:13 PM, "ayan guha" wrote: Hi RACK_LOCAL = Task running on the same rack but not on

Re: How do I persist the data after I process the data with Structured streaming...

2016-11-22 Thread shyla deshpande
Has anyone written a custom sink to persist data to Cassandra from structured streaming. Please provide me any link or reference. Thanks On Tue, Nov 22, 2016 at 2:40 PM, Michael Armbrust wrote: > Forgot the link: https://spark.apache.org/docs/latest/structured- >

Re: getting error on spark streaming : java.lang.OutOfMemoryError: unable to create new native thread

2016-11-22 Thread Shixiong(Ryan) Zhu
Possibly https://issues.apache.org/jira/browse/SPARK-17396 On Tue, Nov 22, 2016 at 1:42 PM, Mohit Durgapal wrote: > Hi Everyone, > > > I am getting the following error while running a spark streaming example > on my local machine, the being ingested is only 506kb. > > >

Re: RDD Partitions on HDFS file in Hive on Spark Query

2016-11-22 Thread ayan guha
Hi RACK_LOCAL = Task running on the same rack but not on the same node where data is NODE_LOCAL = task and data is co-located. Probably you were looking for this one? GZIP - Read is through GZIP codec, but because it is non-splittable, so you can have atmost 1 task reading a gzip file. Now, the

Re: RDD Partitions on HDFS file in Hive on Spark Query

2016-11-22 Thread yeshwanth kumar
Hi Ayan, we have default rack topology. -Yeshwanth Can you Imagine what I would do if I could do all I can - Art of War On Tue, Nov 22, 2016 at 6:37 AM, ayan guha wrote: > Because snappy is not splittable, so single task makes sense. > > Are sure about rack topology?

Re: Any equivalent method lateral and explore

2016-11-22 Thread Michael Armbrust
Both collect_list and explode are available in the function library . The following is an example of using it: df.select($"*", explode($"myArray") as 'arrayItem) On Tue, Nov 22, 2016 at 2:42 PM, Mahender Sarangam

Any equivalent method lateral and explore

2016-11-22 Thread Mahender Sarangam
Hi, We are converting our hive logic which is using lateral view and explode functions. Is there any builtin function in scala for performing lateral view explore. Below is our query in Hive. temparray is temp table with c0 and c1 columns SELECT id, CONCAT_WS(',', collect_list(LineID)) as

Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

2016-11-22 Thread Michael Armbrust
The first release candidate should be coming out this week. You can subscribe to the dev list if you want to follow the release schedule. On Mon, Nov 21, 2016 at 9:34 PM, kant kodali wrote: > Hi Michael, > > I only see spark 2.0.2 which is what I am using currently. Any idea

Re: How do I persist the data after I process the data with Structured streaming...

2016-11-22 Thread Michael Armbrust
We are looking to add a native JDBC sink in Spark 2.2. Until then you can write your own connector using df.writeStream.foreach. On Tue, Nov 22, 2016 at 12:55 PM, shyla deshpande wrote: > Hi, > > Structured streaming works great with Kafka source but I need to persist

Re: How do I persist the data after I process the data with Structured streaming...

2016-11-22 Thread Michael Armbrust
Forgot the link: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach On Tue, Nov 22, 2016 at 2:40 PM, Michael Armbrust wrote: > We are looking to add a native JDBC sink in Spark 2.2. Until then you can > write your own

getting error on spark streaming : java.lang.OutOfMemoryError: unable to create new native thread

2016-11-22 Thread Mohit Durgapal
Hi Everyone, I am getting the following error while running a spark streaming example on my local machine, the being ingested is only 506kb. *16/11/23 03:05:54 INFO MappedDStream: Slicing from 1479850537180 ms to 1479850537235 ms (aligned to 1479850537180 ms and 1479850537235 ms)* *Exception

How do I persist the data after I process the data with Structured streaming...

2016-11-22 Thread shyla deshpande
Hi, Structured streaming works great with Kafka source but I need to persist the data after processing in some database like Cassandra or at least Postgres. Any suggestions, help please. Thanks

Re: How to write a custom file system?

2016-11-22 Thread Steve Loughran
On 21 Nov 2016, at 17:26, Samy Dindane > wrote: Hi, I'd like to extend the file:// file system and add some custom logic to the API that lists files. I think I need to extend FileSystem or LocalFileSystem from org.apache.hadoop.fs, but I am not sure

Fault-tolerant Accumulators in stateful operators.

2016-11-22 Thread Amit Sela
Hi all, To recover (functionally) Accumulators from Driver failure in a streaming application, we wrap them in a "getOrCreate" Singleton as shown here .

Pregel Question

2016-11-22 Thread Saliya Ekanayake
Hi, I've created a graph with vertex data of the form (Int, Array[Int]). I am using the pregel operator to update values of the array for each vertex. So my vprog has the following signature. Note the message is a map of arrays from neighbors def vprog(vertexId: VertexId, value: (Int,

Re: Pasting into spark-shell doesn't work for Databricks example

2016-11-22 Thread Shixiong(Ryan) Zhu
The workaround is defining the imports and class together using ":paste". On Tue, Nov 22, 2016 at 11:12 AM, Shixiong(Ryan) Zhu < shixi...@databricks.com> wrote: > This relates to a known issue: https://issues.apache. > org/jira/browse/SPARK-14146 and https://issues.scala-lang. >

Re: Pasting into spark-shell doesn't work for Databricks example

2016-11-22 Thread Shixiong(Ryan) Zhu
This relates to a known issue: https://issues.apache.org/jira/browse/SPARK-14146 and https://issues.scala-lang.org/browse/SI-9799 On Tue, Nov 22, 2016 at 6:37 AM, dbolshak wrote: > Hello, > > We have the same issue, > > We use latest release 2.0.2. > > Setup with

parallelizing model training ..

2016-11-22 Thread debasishg
Hello - I have a question on parallelization of model training in Spark .. Suppose I have this code fragment for training a model with KMeans .. labeledData.foreachRDD { rdd => val normalizedData: RDD[Vector] = normalize(rdd) val trainedModel: KMeansModel = trainModel(normalizedData,

Re: two spark-shells spark on mesos not working

2016-11-22 Thread Michael Gummelt
What are the full driver logs? If you enable DEBUG logging, it should give you more information about the rejected offers. This can also happen if offers are being accepted, but tasks immediately die for some reason. You should check the Mesos UI for failed tasks. If they exist, please include

Re: Cluster deploy mode driver location

2016-11-22 Thread Masood Krohy
You may also try distributing your JARS along with your Spark app; see options below. You put on the client node whatever that is necessary and submit them all in each run. There is also a --files option which you can remove below, but may be helpful for some configs. You do not need to

how does create dataframe from scala collection handle executor failure?

2016-11-22 Thread Mendelson, Assaf
Lets say I have loop that reads some data from somewhere, stores it in a collection and creates a dataframe from it. Then an executor containing part of the dataframe dies. How does spark handle it? For example: val dfSeq = for { I <- 0 to 1000

Re: find outliers within data

2016-11-22 Thread Yong Zhang
Spark Dataframe window functions? https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html Introducing Window Functions in Spark SQL - Databricks databricks.com To use window

Re: Is there a processing speed difference between DataFrames and Datasets?

2016-11-22 Thread Sean Owen
DataFrames are a narrower, more specific type of abstraction, for tabular data. Where your data is tabular, it makes more sense to use, especially because this knowledge means a lot more can be optimized under the hood for you, whereas the framework can do nothing with an RDD of arbitrary objects.

Is there a processing speed difference between DataFrames and Datasets?

2016-11-22 Thread jggg777
I've seen a number of visuals showing the processing time benefits of using Datasets+DataFrames over RDDs, but I'd assume that there are performance benefits to using a defined case class instead a generic Dataset[Row]. The tale of three Spark APIs post mentions "If you want higher degree of

Re: Pasting into spark-shell doesn't work for Databricks example

2016-11-22 Thread dbolshak
Hello, We have the same issue, We use latest release 2.0.2. Setup with 1.6.1 works fine. Could somebody provide a workaround how to fix that? Kind regards, Denis -- View this message in context:

Re: Pasting into spark-shell doesn't work for Databricks example

2016-11-22 Thread Denis Bolshakov
Hello, We have the same issue, We use latest release 2.0.2. Setup with 1.6.1 works fine. Could somebody provide a workaround how to fix that? Kind regards, Denis On 21 November 2016 at 20:23, jggg777 wrote: > I'm simply pasting in the UDAF example from this page and

[Spark Streaming] map and window operation on DStream only process one batch

2016-11-22 Thread Hao Ren
Spark Streaming v 1.6.2 Kafka v0.10.1 I am reading msgs from Kafka. What surprised me is the following DStream only process the first batch. KafkaUtils.createDirectStream[ String, String, StringDecoder, StringDecoder](streamingContext, kafkaParams, Set(topic)) .map(_._2)

Re: Cluster deploy mode driver location

2016-11-22 Thread Silvio Fiorito
Hi Saif! Unfortunately, I don't think this is possible for YARN driver-cluster mode. Regarding the JARs you're referring to, can you place them on HDFS so you can then have them in a central location and refer to them that way for dependencies?

two spark-shells spark on mesos not working

2016-11-22 Thread John Yost
Hi Everyone, There is probably an obvious answer to this, but not sure what it is. :) I am attempting to launch 2..n spark shells using Mesos as the master (this is to support 1..n researchers running pyspark stuff on our data). I can launch two or more spark shells without any problem. But,

Re: RDD Partitions on HDFS file in Hive on Spark Query

2016-11-22 Thread ayan guha
Because snappy is not splittable, so single task makes sense. Are sure about rack topology? Ie 225 is in a different rack than 227 or 228? What does your topology file says? On 22 Nov 2016 10:14, "yeshwanth kumar" wrote: > Thanks for your reply, > > i can definitely

Re: Kafka direct approach,App UI shows wrong input rate

2016-11-22 Thread Julian Keppel
Oh, sorry. I made a mistake... It's spark version 2.0.1, not 2.0.2. When I wrote the initial message I built my app with 2.0.2 and deployed it on a cluster with 2.0.1. So I thought this could be the problem. But now I changed it and build my app with 2.0.1 but the problem still remains.

Re: newbie question about RDD

2016-11-22 Thread Mohit Durgapal
Hi Raghav, Please refer to the following code: SparkConf sparkConf = new SparkConf().setMaster("local[2]").setAppName("PersonApp"); //creating java spark context JavaSparkContext sc = new JavaSparkContext(sparkConf); //reading file from hfs into spark rdd , the name node is localhost JavaRDD