Re: mapwithstate Hangs with Error cleaning broadcast

2016-11-02 Thread manasdebashiskar
Yes, In my case, my StateSpec had a small partition size. I increased the numPartitions and the problem went away. (Details of why the problem was happening in the first place is elided.) TL;DR StateSpec takes a "numPartitions" which can be set to high enough number. -- View this message

RuntimeException: Null value appeared in non-nullable field when holding Optional Case Class

2016-11-02 Thread Aniket Bhatnagar
Hi all I am running into a runtime exception when a DataSet is holding an Empty object instance for an Option type that is holding non-nullable field. For instance, if we have the following case class: case class DataRow(id: Int, value: String) Then, DataSet[Option[DataRow]] can only hold

Re: Quirk in how Spark DF handles JSON input records?

2016-11-02 Thread Michael Segel
On Nov 2, 2016, at 2:22 PM, Daniel Siegmann > wrote: Yes, it needs to be on a single line. Spark (or Hadoop really) treats newlines as a record separator by default. While it is possible to use a different string as a

Re: Quirk in how Spark DF handles JSON input records?

2016-11-02 Thread Daniel Siegmann
Yes, it needs to be on a single line. Spark (or Hadoop really) treats newlines as a record separator by default. While it is possible to use a different string as a record separator, what would you use in the case of JSON? If you do some Googling I suspect you'll find some possible solutions.

BiMap BroadCast Variable - Kryo Serialization Issue

2016-11-02 Thread Kalpana Jalawadi
Hi, I am getting Nullpointer exception due to Kryo Serialization issue, while trying to read a BiMap broadcast variable. Attached is the code snippets. Pointers shared here didn't help - link1 , link2

Re: Quirk in how Spark DF handles JSON input records?

2016-11-02 Thread Michael Segel
ARGH!! Looks like a formatting issue. Spark doesn’t like ‘pretty’ output. So then the entire record which defines the schema has to be a single line? Really? On Nov 2, 2016, at 1:50 PM, Michael Segel > wrote: This may be a silly

Quirk in how Spark DF handles JSON input records?

2016-11-02 Thread Michael Segel
This may be a silly mistake on my part… Doing an example using Chicago’s Crime data.. (There’s a lot of it going around. ;-) The goal is to read a file containing a JSON record that describes the crime data.csv for ingestion into a data frame, then I want to output to a Parquet file. (Pretty

unsubscribe

2016-11-02 Thread Venkatesh Seshan
unsubscribe

Re: Spark ML - Is it rule of thumb that all Estimators should only be Fit on Training data

2016-11-02 Thread Sean Owen
I would also only fit these on training data. There are probably some corner cases where letting these ancillary transforms see test data results in a target leak. Though I can't really think of a good example. More to the point, you're probably fitting these as part of a pipeline and that

Spark ML - Is it rule of thumb that all Estimators should only be Fit on Training data

2016-11-02 Thread Nirav Patel
It is very clear that for ML algorithms (classification, regression) that Estimator only fits on training data but it's not very clear of other estimators like IDF for example. IDF is a feature transformation model but having IDF estimator and transformer makes it little confusing that what

Re: How to return a case class in map function?

2016-11-02 Thread Michael Armbrust
Thats a bug. Which version of Spark are you running? Have you tried 2.0.2? On Wed, Nov 2, 2016 at 12:01 AM, 颜发才(Yan Facai) wrote: > Hi, all. > When I use a case class as return value in map function, spark always > raise a ClassCastException. > > I write an demo, like: > >

Re: error: Unable to find encoder for type stored in a Dataset. when trying to map through a DataFrame

2016-11-02 Thread Michael Armbrust
Spark doesn't know how to turn a Seq[Any] back into a row. You would need to create a case class or something where we can figure out the schema. What are you trying to do? If you don't care about specifics fields and you just want to serialize the type you can use kryo: implicit val anyEncoder

Re: Spark ML - CrossValidation - How to get Evaluation metrics of best model

2016-11-02 Thread Nirav Patel
Thanks! On Tue, Nov 1, 2016 at 6:30 AM, Sean Owen wrote: > CrossValidator splits the data into k sets, and then trains k times, > holding out one subset for cross-validation each time. You are correct that > you should actually withhold an additional test set, before you use

Re: Custom receiver for WebSocket in Spark not working

2016-11-02 Thread kant kodali
I don't see a store() call in your receive(). Search for store() in here http://spark.apache.org/ docs/latest/streaming-custom-receivers.html On Wed, Nov 2, 2016 at 10:23 AM, Cassa L wrote: > Hi, > I am using spark 1.6. I wrote a custom receiver to read from WebSocket. > But

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-02 Thread Michael David Pedersen
Awesome, thank you Michael for the detailed example! I'll look into whether I can use this approach for my use case. If so, I could avoid the overhead of repeatedly registering a temp table for one-off queries, instead registering the table once and relying on the injected strategy. Don't know

Re: Load whole ALS MatrixFactorizationModel into memory

2016-11-02 Thread Sean Owen
You can cause the underlying RDDs in the model to be cached in memory. That would be necessary but not sufficient to make it go fast; it should at least get rid of a lot of I/O. I think making recommendations one at a time is never going to scale to moderate load this way; one request means one

Custom receiver for WebSocket in Spark not working

2016-11-02 Thread Cassa L
Hi, I am using spark 1.6. I wrote a custom receiver to read from WebSocket. But when I start my spark job, it connects to the WebSocket but doesn't get any message. Same code, if I write as separate scala class, it works and prints messages from WebSocket. Is anything missing in my Spark Code?

error: Unable to find encoder for type stored in a Dataset. when trying to map through a DataFrame

2016-11-02 Thread Daniel Haviv
Hi, I have the following scenario: scala> val df = spark.sql("select * from danieltest3") df: org.apache.spark.sql.DataFrame = [iid: string, activity: string ... 34 more fields] Now I'm trying to map through the rows I'm getting: scala> df.map(r=>r.toSeq) :32: error: Unable to find encoder for

Load whole ALS MatrixFactorizationModel into memory

2016-11-02 Thread Mikael Ståldal
import org.apache.spark.mllib.recommendation.ALS import org.apache.spark.mllib.recommendation.MatrixFactorizationModel I build a MatrixFactorizationModel with ALS.trainImplicit(), then I save it with its save method. Later, in an other process on another machine, I load the model with

Use a specific partition of dataframe

2016-11-02 Thread Yanwei Zhang
Is it possible to retrieve a specific partition (e.g., the first partition) of a DataFrame and apply some function there? My data is too large, and I just want to get some approximate measures using the first few partitions in the data. I'll illustrate what I want to accomplish using the

Re: How to avoid unnecessary spark starkups on every request?

2016-11-02 Thread Vadim Semenov
Take a look at https://github.com/spark-jobserver/spark-jobserver or https://github.com/cloudera/livy you can launch a persistent spark context and then submit your jobs using a already running context On Wed, Nov 2, 2016 at 3:34 AM, Fanjin Zeng wrote: > Hi, > > I

Unsubscribe

2016-11-02 Thread srikrishna chaitanya garimella
Unsubscribe

Re: Running Google Dataflow on Spark

2016-11-02 Thread Sean Owen
This is a Dataflow / Beam question, not a Spark question per se. On Wed, Nov 2, 2016 at 11:48 AM Ashutosh Kumar wrote: > I am trying to run Google Dataflow code on Spark. It works fine as google > dataflow on google cloud platform. But while running on Spark I am

Running Google Dataflow on Spark

2016-11-02 Thread Ashutosh Kumar
I am trying to run Google Dataflow code on Spark. It works fine as google dataflow on google cloud platform. But while running on Spark I am getting following error 16/11/02 11:14:32 INFO com.cloudera.dataflow.spark.SparkPipelineRunner: Evaluating ParDo(GroupByKeyHashAndSortByKeyAndWindow)

Need to know about GraphX and Streaming

2016-11-02 Thread Md. Mahedi Kaysar
Hi All, I am new in Spark GraphX. I am trying to understand it for analysing graph streaming data. I know Spark has streaming modules that works on both Tabular and DStream mechanism. I am wondering if it is possible to leverage streaming APIs in GraphX for analysing the real-time graph streams.

unsubscribe

2016-11-02 Thread Kunal Gaikwad
unsubscribe Regards, Kunal Gaikwad

[Spark2] huge BloomFilters

2016-11-02 Thread ponkin
Hi, I need to build huge BloomFilter with 150 millions or even more insertions import org.apache.spark.util.sketch.BloomFilter val bf = spark.read.avro("/hdfs/path").filter("some == 1").stat.bloomFilter("id", 15000, 0.01) if I use keys for serialization implicit val bfEncoder =

Re: Add jar files on classpath when submitting tasks to Spark

2016-11-02 Thread Mich Talebzadeh
Well if you look at the spark-shell script this is what it says # SPARK-4161: scala does not assume use of the java classpath, # so we need to add the "-Dscala.usejavacp=true" flag manually. We # do this specifically for the Spark shell because the scala REPL # has its own class loader, and any

RE: Add jar files on classpath when submitting tasks to Spark

2016-11-02 Thread Jan Botorek
Thank you for the example. I am able to submit the task when using the –jars parameter as followed: spark-submit --class com.infor.skyvault.tests.LinearRegressionTest --master local –jars path/to/jar/one;path/to/jar/two C:\_resources\spark-1.0-SNAPSHOT.jar -DtrainDataPath="/path/to/model/data"

random idea

2016-11-02 Thread kant kodali
Hi Guys, I have a random idea and it would be great to receive some input. Can we have a HTTP2 Based receiver for Spark Streaming? I am wondering why not build micro services using Spark when needed? I can see it is not meant for that but I like to think it can be possible. To be more concrete,

Re: How to avoid unnecessary spark starkups on every request?

2016-11-02 Thread vincent gromakowski
Hi I am currently using akka http sending requests to multiple spark actors that use a preloaded spark context and fair scheduler. It's only a prototype and I haven't tested the concurrency but it seems one of the rigth way to do. Complete processing time is arround 600 ms.The other way would be

How to avoid unnecessary spark starkups on every request?

2016-11-02 Thread Fanjin Zeng
Hi, I am working on a project that takes requests from HTTP server and computes accordingly on spark. But the problem is when I receive many request at the same time, users have to waste a lot of time on the unnecessary startups that occur on each request. Does Spark have built-in job

How to return a case class in map function?

2016-11-02 Thread Yan Facai
Hi, all. When I use a case class as return value in map function, spark always raise a ClassCastException. I write an demo, like: scala> case class Record(key: Int, value: String) scala> case class ID(key: Int) scala> val df = Seq(Record(1, "a"), Record(2, "b")).toDF scala> df.map{x =>