Re: spark job automatically killed without rhyme or reason

2016-06-23 Thread Zhiliang Zhu
Thanks a lot for all  the comments, and the useful  information .  Yes, I have much experience to write and run spark jobs, something unstable will be there while it run on more data or more time. Sometimes it would be not okay while reset some parameter in command line, but will be okay while

Re: Building Spark 2.X in Intellij

2016-06-23 Thread Jeff Zhang
You need to spark/external/flume-sink/target/scala-2.11/src_managed/main/compiled_avro under build path, this is the only thing you need to do manually if I remember correctly. On Thu, Jun 23, 2016 at 2:30 PM, Stephen Boesch wrote: > Hi Jeff, > I'd like to understand what

Re: Spark ml and PMML export

2016-06-23 Thread Jayant Shekhar
Thanks a lot Nick! Its very helpful. On Wed, Jun 22, 2016 at 11:47 PM, Nick Pentreath wrote: > Currently there is no way within Spark itself. You may want to check out > this issue (https://issues.apache.org/jira/browse/SPARK-11171) and here > is an external project

categoricalFeaturesInfo

2016-06-23 Thread pseudo oduesp
Hi, i am pyspark user and i want test the Randoforest algrithmes. i found this parmeters categoricalFeaturesInfo how i can use it from list of categoriels variables . thanks.

Re: Spark ml and PMML export

2016-06-23 Thread Nick Pentreath
Currently there is no way within Spark itself. You may want to check out this issue (https://issues.apache.org/jira/browse/SPARK-11171) and here is an external project working on it (https://github.com/jpmml/jpmml-sparkml), that covers quite a number of transformers and models (but not all). On

Re: Building Spark 2.X in Intellij

2016-06-23 Thread Stephen Boesch
Hi Jeff, I'd like to understand what may be different. I have rebuilt and reimported many times. Just now I blew away the .idea/* and *.iml to start from scratch. I just opened the $SPARK_HOME directory from intellij File | Open . After it finished the initial import I tried to run one of

Re: Building Spark 2.X in Intellij

2016-06-23 Thread Stephen Boesch
Thanks Jeff - I remember that now from long time ago. After making that change the next errors are: Error:scalac: missing or invalid dependency detected while loading class file 'RDDOperationScope.class'. Could not access term fasterxml in package com, because it (or its dependencies) are

How does Spark Streaming updateStateByKey or mapWithState scale with state size?

2016-06-23 Thread Martin Eden
Hi all, It is currently difficult to understand from the Spark docs or the materials online that I came across, how the updateStateByKey and mapWithState operators in Spark Streaming scale with the size of the state and how to reason about sizing the cluster appropriately. According to this

Re: Spark ml and PMML export

2016-06-23 Thread Jayant Shekhar
Thanks Philippe! Looking forward to trying it out. I am on >= 1.6 Jayant On Thu, Jun 23, 2016 at 1:24 AM, philippe v wrote: > Hi, > > You can try this lib : https://github.com/jpmml/jpmml-sparkml > > I'll try it soon... you need to be in >=1.6 > > Philippe > > > > -- >

Re: Spark ml and PMML export

2016-06-23 Thread philippe v
Hi, You can try this lib : https://github.com/jpmml/jpmml-sparkml I'll try it soon... you need to be in >=1.6 Philippe -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-ml-and-PMML-export-tp26773p27215.html Sent from the Apache Spark User List

LogisticRegression.scala ERROR, require(Predef.scala)

2016-06-23 Thread Ascot Moss
Hi, My Spark is 1.5.2, when trying MLLib, I got the following error. Any idea to fix it? Regards == 16/06/23 16:26:20 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 5) java.lang.IllegalArgumentException: requirement failed at

Re: Does saveAsHadoopFile depend on master?

2016-06-23 Thread Pierre Villard
Hi guys! I solved my issue, that was indeed a permission problem. Thanks for your help. 2016-06-22 9:27 GMT+02:00 Spico Florin : > Hi! > I had a similar issue when the user that submit the job to the spark > cluster didn't have permission to write into the hdfs. If you

Confusion regarding sc.accumulableCollection(mutable.ArrayBuffer[String]()) type

2016-06-23 Thread Daniel Haviv
Hi, I want to to use an accumulableCollection of type mutable.ArrayBuffer[String ] to return invalid records detected during transformations but I don't quite get it's type: val errorAccumulator: Accumulable[ArrayBuffer[String], String] = sc.accumulableCollection(mutable.ArrayBuffer[String]())

Performance issue with spark ml model to make single predictions on server side

2016-06-23 Thread philippe v
Hello, I trained a linear regression model with spark-ml. I serialized the model pipeline with classical java serialization. Then I loaded it in a webservice to compute predictions. For each request recieved by the webservice I create a 1 row dataframe to compute that prediction. Probleme is

Spark Thrift Server Concurrency

2016-06-23 Thread Prabhu Joseph
Hi All, On submitting 20 parallel same SQL query to Spark Thrift Server, the query execution time for some queries are less than a second and some are more than 2seconds. The Spark Thrift Server logs shows all 20 queries are submitted at same time 16/06/23 12:12:01 but the result schema are at

Re: Spark Thrift Server Concurrency

2016-06-23 Thread Mich Talebzadeh
which version of spark and are you using YARN in client mode or cluster mode? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Change from distributed.MatrixEntry to Vector

2016-06-23 Thread Pasquinell Urbani
Hello all, I have to build a item-based recommendation system. First I obtained the similarity matrix with CosineSimilarity DIMSUM by twitter solution ( https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum). The similarity matrix is in the following format:

Re: spark job automatically killed without rhyme or reason

2016-06-23 Thread Aakash Basu
Hey, I've come across this. There's a command called "yarn application -kill ", which kills the application with a one liner 'Killed'. If it is memory issue, the error shows up in form of 'GC Overhead' or forming up tree or something of the sort. So, I think someone killed your job by that

Re: Spark Thrift Server Concurrency

2016-06-23 Thread Michael Segel
Hi, There are a lot of moving parts and a lot of unknowns from your description. Besides the version stuff. How many executors, how many cores? How much memory? Are you persisting (memory and disk) or just caching (memory) During the execution… same tables… are you seeing a lot of

Ideas to put a Spark ML model in production

2016-06-23 Thread Saurabh Sardeshpande
Hi all, How do you reliably deploy a spark model in production? Let's say I've done a lot of analysis and come up with a model that performs great. I have this "model file" and I'm not sure what to do with it. I want to build some kind of service around it that takes some inputs, converts them

Option Encoder

2016-06-23 Thread Richard Marscher
Is there a proper way to make or get an Encoder for Option in Spark 2.0? There isn't one by default and while ExpressionEncoder from catalyst will work, it is private and unsupported. -- *Richard Marscher* Senior Software Engineer Localytics Localytics.com | Our Blog

Re: Building Spark 2.X in Intellij

2016-06-23 Thread Stephen Boesch
I just checked out completely fresh directory and created new IJ project. Then followed your tip for adding the avro source. Here is an additional set of errors Error:(31, 12) object razorvine is not a member of package net import net.razorvine.pickle.{IObjectPickler, Opcodes, Pickler}

Re: Spark 1.5.2 - Different results from reduceByKey over multiple iterations

2016-06-23 Thread Nirav Patel
SO it was indeed my merge function. I created new result object for every merge and its working now. Thanks On Wed, Jun 22, 2016 at 3:46 PM, Nirav Patel wrote: > PS. In my reduceByKey operation I have two mutable object. What I do is > merge mutable2 into mutable1 and

Data Frames Join by more than one column

2016-06-23 Thread Andrés Ivaldi
Hello, I'v been trying to join ( left_outer) dataframes by two columns, and the result is not as expected. I'm doing this dfo1.get.join(dfo2.get, dfo1.get.col("Col1Left").equalTo(dfo2.get.col("Col1Right")).and(dfo1.get.col("Col2Left").equalTo(dfo2.get.col("Col2Right"))) ,

Re: Option Encoder

2016-06-23 Thread Koert Kuipers
an implicit encoder for Option[X] given an implicit encoder for X would be nice, i run into this often too. i do not think it exists. your best is to hope ExpressionEncoder will do... On Thu, Jun 23, 2016 at 2:16 PM, Richard Marscher wrote: > Is there a proper way to

Re: Multiple compute nodes in standalone mode

2016-06-23 Thread Ted Yu
Have you looked at: https://spark.apache.org/docs/latest/spark-standalone.html On Thu, Jun 23, 2016 at 12:28 PM, avendaon wrote: > Hi all, > > I have a cluster that has multiple nodes, and the data partition is > unified, > therefore all my nodes in my computer can access

Re: Kryo ClassCastException during Serialization/deserialization in Spark Streaming

2016-06-23 Thread Ted Yu
Can you illustrate how sampleMap is populated ? Thanks On Thu, Jun 23, 2016 at 12:34 PM, SRK wrote: > Hi, > > I keep getting the following error in my Spark Streaming every now and then > after the job runs for say around 10 hours. I have those 2 classes >

Confusion about spark.shuffle.memoryFraction and spark.storage.memoryFraction

2016-06-23 Thread Darshan Singh
These are 2 parameters and default value for these are 0.6 and 0.2 which is around 80%. I am wondering where remaining 0.2 % goes. Is it for JVM for other memory requirements? If yes, then what is spark.memory.fraction used for. My understanding is that if we have 10GB of memory per executor

Re: NullPointerException when starting StreamingContext

2016-06-23 Thread Sunita Arvind
Also, just to keep it simple, I am trying to use 1.6.0CDH5.7.0 in the pom.xml as the cluster I am trying to run on is CDH5.7.0 with spark 1.6.0. Here is my pom setting: 1.6.0-cdh5.7.0 org.apache.spark spark-core_2.10 ${cdh.spark.version} compile org.apache.spark

Re: Data Frames Join by more than one column

2016-06-23 Thread Andrés Ivaldi
Ahh, My mistake I'm think, as the data came from SQL, one of the column is char(7) but not all characters are occuped, and the join was with a column varchar(30), so what what append was the comparison looks like "xx "=="xx", apply trim to char column works. On Thu, Jun 23, 2016 at 2:57 PM,

Partitioning in spark

2016-06-23 Thread Darshan Singh
Hi, My default parallelism is 100. Now I join 2 dataframes with 20 partitions each , joined dataframe has 100 partition. I want to know what is the way to keep it to 20 (except re-partition and coalesce. Also, when i join these 2 dataframes I am using 4 columns as joined columns. The dataframes

Re: Kryo ClassCastException during Serialization/deserialization in Spark Streaming

2016-06-23 Thread swetha kasireddy
sampleMap is populated from inside a method that is getting called from updateStateByKey On Thu, Jun 23, 2016 at 1:13 PM, Ted Yu wrote: > Can you illustrate how sampleMap is populated ? > > Thanks > > On Thu, Jun 23, 2016 at 12:34 PM, SRK wrote:

destroyPythonWorker job in PySpark

2016-06-23 Thread Krishna
Hi, I am running a PySpark app with 1000's of cores (partitions is a small multiple of # of cores) and the overall application performance is fine. However, I noticed that, at the end of the job, PySpark initiates job clean-up procedures and as part of this procedure, PySpark executes a job shown

Custom Optimizer

2016-06-23 Thread Stephen Boesch
My team has a custom optimization routine that we would have wanted to plug in as a replacement for the default LBFGS / OWLQN for use by some of the ml/mllib algorithms. However it seems the choice of optimizer is hard-coded in every algorithm except LDA: and even in that one it is only a

RDD, Dataframe and Parquet order

2016-06-23 Thread tuxx
1. Does rdd.collect() return the lines in the same order as they are in input file? 2. Does df1.collect() return the rows in the same order as they are in rdd.collect()? 3. Does df2.collect() return the rows in the same order as they are in df1.collect()? Please argument your answers with

Re: LogisticRegression.scala ERROR, require(Predef.scala)

2016-06-23 Thread Bryan Cutler
The stack trace you provided seems to hint that you are calling "predict" on an RDD with Vectors that are not the same size as the number of features in your trained model, they should be equal. If that's not the issue, it would be easier to troubleshoot if you could share your code and possibly

Databricks' 2016 Survey on Apache Spark

2016-06-23 Thread Jules Damji
Hi All, We at Databricks are running a short survey to understand users’ needs and usage of Apache Spark. Because we value community feedback, this survey will help us both to understand usage of Spark and to direct our future contributions to it. If you have a moment, please take some time

Spark SQL Hive Authorization

2016-06-23 Thread rmenon
Hello, We are trying to configure Spark SQL + Ranger for hive authorization with no progress. Spark SQL is communicating with hive metastore with no authorization without any issues. However, all authorization setup is being ignored. We have currently tried by setting configuring following

Multiple compute nodes in standalone mode

2016-06-23 Thread avendaon
Hi all, I have a cluster that has multiple nodes, and the data partition is unified, therefore all my nodes in my computer can access to the data I am working on. Right now, I run Spark in a single node, and it work beautifully. My question is, Is it possible to run Spark using multiple compute

Kryo ClassCastException during Serialization/deserialization in Spark Streaming

2016-06-23 Thread SRK
Hi, I keep getting the following error in my Spark Streaming every now and then after the job runs for say around 10 hours. I have those 2 classes registered in kryo as shown below. sampleMap is a field in SampleSession as shown below. Any suggestion as to how to avoid this would be of great

Re: Partitioning in spark

2016-06-23 Thread ayan guha
You can change paralllism like following: conf = SparkConf() conf.set('spark.sql.shuffle.partitions',10) sc = SparkContext(conf=conf) On Fri, Jun 24, 2016 at 6:46 AM, Darshan Singh wrote: > Hi, > > My default parallelism is 100. Now I join 2 dataframes with 20