Re: Is there synchronous way to predict against model for real time data

2016-12-15 Thread vincent gromakowski
Something like that ? https://spark-summit.org/eu-2015/events/real-time-anomaly-detection-with-spark-ml-and-akka/ Le 16 déc. 2016 1:08 AM, "suyogchoudh...@gmail.com" < suyogchoudh...@gmail.com> a écrit : > Hi, > > I have question about, how can I real time make decision using a model I > have

Re: When will multiple aggregations be supported in Structured Streaming?

2016-12-15 Thread Lawrence Wagerfield
We have a stream of products, each with an ID, and each product has a price which may be updated. We want a running count of the number of products over £30. Schema: Product(productID: Int, price: Int) To handle these updates, we currently have… —— val products =

Re: When will multiple aggregations be supported in Structured Streaming?

2016-12-15 Thread Mattz
I had the same question too. My use case is to take a streaming source and perform few steps (some aggregations and transformations) and send it to multiple output sources. On Fri, Dec 16, 2016 at 3:58 AM, Michael Armbrust wrote: > What is your use case? > > On Thu, Dec

Can't use RandomForestClassificationModel.predict(Vector v) in scala

2016-12-15 Thread Patrick Chen
Hi All I'm writing Java to use spark 2.0 RandomForestClassificationModel. After I trained the model, I can use below code to predict : RandomForestClassificationModel rfModel = RandomForestClassificationModel.load(modelPath); Vector v = Vectors.sparse(FeatureIndex.TOTAL_INDEX.getIndex(),

Re: Spark Dataframe: Save to hdfs is taking long time

2016-12-15 Thread KhajaAsmath Mohammed
I am trying to save the files as Paraquet. On Thu, Dec 15, 2016 at 10:41 PM, Felix Cheung wrote: > What is the format? > > > -- > *From:* KhajaAsmath Mohammed > *Sent:* Thursday, December 15, 2016 7:54:27 PM >

Re: Spark Dataframe: Save to hdfs is taking long time

2016-12-15 Thread Felix Cheung
What is the format? From: KhajaAsmath Mohammed Sent: Thursday, December 15, 2016 7:54:27 PM To: user @spark Subject: Spark Dataframe: Save to hdfs is taking long time Hi, I am using issue while saving the dataframe back to HDFS. It's

Re: How to load edge with properties file useing GraphX

2016-12-15 Thread Felix Cheung
Have you checked out https://github.com/graphframes/graphframes? It might be easier to work with DataFrame. From: zjp_j...@163.com Sent: Thursday, December 15, 2016 7:23:57 PM To: user Subject: How to load edge with properties file useing

Spark Dataframe: Save to hdfs is taking long time

2016-12-15 Thread KhajaAsmath Mohammed
Hi, I am using issue while saving the dataframe back to HDFS. It's taking long time to run. val results_dataframe = sqlContext.sql("select gt.*,ct.* from PredictTempTable pt,ClusterTempTable ct,GamificationTempTable gt where gt.vin=pt.vin and pt.cluster=ct.cluster")

How to load edge with properties file useing GraphX

2016-12-15 Thread zjp_j...@163.com
Hi, I want to load a edge file and vertex attriInfos file as follow ,how can i use these two files create Graph ? edge file -> "SrcId,DestId,propertis... " vertex attriInfos file-> "VID, properties..." I learned about there have a GraphLoader object that can load edge file

How to reflect dynamic registration udf?

2016-12-15 Thread 李斌松
How to reflect dynamic registration udf? java.lang.UnsupportedOperationException: Schema for type _$13 is not supported at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:153) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:29)

how to see cpu time of application on yarn mode

2016-12-15 Thread 黄晓萌
Hi, I run spark job on yarn mode, and want to collect the total cpu time of the application. I can't find cpu time in spark historyserver. Is there a way to see the cpu time? -- Thanks, Xiaomeng

Re: Belief propagation algorithm is open sourced

2016-12-15 Thread Ulanov, Alexander
Hi Bertrand, We only do inference. We do not do structure or parameter estimation (or learning) - that for the MRF would be estimation of the factors, and the structure of the graphical model. The parameters can be estimated using maximum likelihood if data is available for all the nodes, or

Is there synchronous way to predict against model for real time data

2016-12-15 Thread suyogchoudh...@gmail.com
Hi, I have question about, how can I real time make decision using a model I have created with Spark ML. 1. I have some data and created model using it. // Train the model val model = new LogisticRegressionWithLBFGS().setNumClasses(2).run(trainingData) 2. I believe, I can use spark

Why foreachPartition function make duplicate invocation to map function for every message ? (Spark 2.0.2)

2016-12-15 Thread Michael Nguyen
I have the following sequence of Spark Java API calls (Spark 2.0.2): 1. Kafka stream that is processed via a map function, which returns the string value from tuple2._2() for JavaDStream as in return tuple2._2(); 1. The returned JavaDStream is then processed by foreachPartition,

Re: Negative values of predictions in ALS.tranform

2016-12-15 Thread Manish Tripathi
ok. Thanks. So here is what I understood. Input data to Als.fit(implicitPrefs=True) is the actual strengths (count data). So if I have a matrix of (user,item,views/purchases) I pass that as the input and not the binarized one (preference). This signifies the strength. 2) Since we also pass the

Re: Negative values of predictions in ALS.tranform

2016-12-15 Thread Sean Owen
No, input are weights or strengths. The output is a factorization of the binarization of that to 0/1, not probs or a factorization of the input. This explains the range of the output. On Thu, Dec 15, 2016, 23:43 Manish Tripathi wrote: > when you say *implicit ALS *is*

Re: Dataset encoders for further types?

2016-12-15 Thread Michael Armbrust
I would have sworn there was a ticket, but I can't find it. So here you go: https://issues.apache.org/jira/browse/SPARK-18891 A work around until that is fixed would be for you to manually specify the kryo encoder

Re: Negative values of predictions in ALS.tranform

2016-12-15 Thread Manish Tripathi
when you say *implicit ALS *is* factoring the 0/1 matrix. , are you saying for implicit feedback algorithm we need to pass the input data as the preference matrix i.e a matrix of 0 and 1?. * Then how will they calculate the confidence matrix which is basically =1+alpha*count matrix. If we don't

Re: spark reshape hive table and save to parquet

2016-12-15 Thread Anton Kravchenko
Hi Divya, Thanks, it is exactly what I am looking for! Anton On Wed, Dec 14, 2016 at 6:01 PM, Divya Gehlot wrote: > you can use udfs to do it > http://stackoverflow.com/questions/31615657/how-to-add- > a-new-struct-column-to-a-dataframe > > Hope it will help. > > >

Re: Negative values of predictions in ALS.tranform

2016-12-15 Thread Sean Owen
No, you can't interpret the output as probabilities at all. In particular they may be negative. It is not predicting rating but interaction. Negative means very strongly not predicted to interact. No, implicit ALS *is* factoring the 0/1 matrix. On Thu, Dec 15, 2016, 23:31 Manish Tripathi

Re: Negative values of predictions in ALS.tranform

2016-12-15 Thread Manish Tripathi
Ok. So we can kind of interpret the output as probabilities even though it is not modeling probabilities. This is to be able to use it for binaryclassification evaluator. So the way I understand is and as per the algo, the predicted matrix is basically a dot product of user factor and item factor

Financial fraud detection using streaming RDBMS data into Spark & Hbase

2016-12-15 Thread Mich Talebzadeh
I am not talking about Credit Card fraud etc. In the complex fraud cases like that one in UBS , the rogue trader over a period of time manipulated the figures. Although there is a lot of talk about using elaborate set-ups to predict

Re: How to control saveAsTable() warehouse path?

2016-12-15 Thread epettijohn
I don't profess to be an expert on this, but I did face the same problem. A couple of possibilities: 1. If your default Hive database is stored in "/tmp/hive/warehouse/...", then that could be the issue. I recommend creating a database on s3a and then storing the table there (

Re: spark on yarn can't load kafka dependency jar

2016-12-15 Thread Mich Talebzadeh
try this it should work and yes they are comma separated spark-streaming-kafka_2.10-1.5.1.jar Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Re: Negative values of predictions in ALS.tranform

2016-12-15 Thread Sean Owen
No, ALS is not modeling probabilities. The outputs are reconstructions of a 0/1 matrix. Most values will be in [0,1], but, it's possible to get values outside that range. On Thu, Dec 15, 2016 at 10:21 PM Manish Tripathi wrote: > Hi > > ran the ALS model for implicit

Re: spark on yarn can't load kafka dependency jar

2016-12-15 Thread neil90
Don't the jars need to be comma sperated when you pass? i.e. --jars "hdfs://zzz:8020/jars/kafka_2.10-0.8.2.2.jar", /opt/bigdevProject/sparkStreaming_jar4/sparkStreaming.jar -- View this message in context:

Re: Spark Batch checkpoint

2016-12-15 Thread Selvam Raman
I am using java. I will try and let u know. On Dec 15, 2016 8:45 PM, "Irving Duran" wrote: > Not sure what programming language you are using, but in python you can do > "sc.setCheckpointDir('~/apps/spark-2.0.1-bin-hadoop2.7/checkpoint/')". > This will store checkpoints

Re: Cached Tables SQL Performance Worse than Uncached

2016-12-15 Thread Mich Talebzadeh
How many tables are involved in the SQL join and how do you cache them? If you do unpersist on the DF(s) and run the same SQL query (the same sesiion) what do you see with explain? HTH Dr Mich Talebzadeh LinkedIn *

Re: When will multiple aggregations be supported in Structured Streaming?

2016-12-15 Thread Michael Armbrust
What is your use case? On Thu, Dec 15, 2016 at 10:43 AM, ljwagerfield wrote: > The current version of Spark (2.0.2) only supports one aggregation per > structured stream (and will throw an exception if multiple aggregations are > applied). > > Roughly when will

Re: [DataFrames] map function - 2.0

2016-12-15 Thread Michael Armbrust
Experimental in Spark really just means that we are not promising binary compatibly for those functions in the 2.x release line. For Datasets in particular, we want a few releases to make sure the APIs don't have any major gaps before removing the experimental tag. On Thu, Dec 15, 2016 at 1:17

Re: Cached Tables SQL Performance Worse than Uncached

2016-12-15 Thread Mich Talebzadeh
How many tables are involved in the SQL join and how do you cache them? If you do unpersist on the DF and run the sdame Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Cached Tables SQL Performance Worse than Uncached

2016-12-15 Thread Michael Armbrust
Its hard to comment on performance without seeing query plans. I'd suggest posting the result of an explain. On Thu, Dec 15, 2016 at 2:14 PM, Warren Kim wrote: > Playing with TPC-H and comparing performance between cached (serialized > in-memory tables) and

Negative values of predictions in ALS.tranform

2016-12-15 Thread Manish Tripathi
Hi ran the ALS model for implicit feedback thing. Then I used the .transform method of the model to predict the ratings for the original dataset. My dataset is of the form (user,item,rating) I see something like below: predictions.show(5,truncate=False) Why is the last prediction value

Cached Tables SQL Performance Worse than Uncached

2016-12-15 Thread Warren Kim
Playing with TPC-H and comparing performance between cached (serialized in-memory tables) and uncached (DF from parquet) results in various SQL queries performing much worse, duration-wise. I see some physical plans have an extra layer of shuffle/sort/merge under cached scenario. I could do

PowerIterationClustering Benchmark

2016-12-15 Thread Lydia Ickler
Hi all, I have a question regarding the PowerIterationClusteringExample. I have adjusted the code so that it reads a file via „sc.textFile(„path/to/input“)“ which works fine. Now I wanted to benchmark the algorithm using different number of nodes to see how well the implementation scales. As a

Re: Is restarting of SparkContext allowed?

2016-12-15 Thread Marcelo Vanzin
(-dev, +user. dev is for Spark development, not for questions about using Spark.) You haven't posted code here or the actual error. But you might be running into SPARK-15754. Or into other issues with yarn-client mode and "--principal / --keytab" (those have known issues in client mode). If you

Re: Output Side Effects for different chain of operations

2016-12-15 Thread Chawla,Sumit
I am already creating these files on slave. How can i create an RDD from these slaves? Regards Sumit Chawla On Thu, Dec 15, 2016 at 11:42 AM, Reynold Xin wrote: > You can just write some files out directly (and idempotently) in your > map/mapPartitions functions. It is

Is there synchronous way to predict against model for real time data

2016-12-15 Thread suyog choudhari
Hi, I have question about, how can I real time make decision using a model I have created with Spark ML. 1. I have some data and created model using it. // Train the model val model = new LogisticRegressionWithLBFGS().setNumClasses(2).run( trainingData) 2. I believe, I can use spark

[DataFrames] map function - 2.0

2016-12-15 Thread Ninad Shringarpure
Hi Team, When going through Dataset class for Spark 2.0 it comes across that both overloaded map functions with encoder and without are marked as experimental. Is there a reason and issues that developers whould be aware of when using this for production applications. Also is there a

Re: Spark Batch checkpoint

2016-12-15 Thread Irving Duran
Not sure what programming language you are using, but in python you can do " sc.setCheckpointDir('~/apps/spark-2.0.1-bin-hadoop2.7/checkpoint/')". This will store checkpoints on that directory that I called checkpoint. Thank You, Irving Duran On Thu, Dec 15, 2016 at 10:33 AM, Selvam Raman

Re: Output Side Effects for different chain of operations

2016-12-15 Thread Reynold Xin
You can just write some files out directly (and idempotently) in your map/mapPartitions functions. It is just a function that you can run arbitrary code after all. On Thu, Dec 15, 2016 at 11:33 AM, Chawla,Sumit wrote: > Any suggestions on this one? > > Regards > Sumit

Re: Output Side Effects for different chain of operations

2016-12-15 Thread Chawla,Sumit
Any suggestions on this one? Regards Sumit Chawla On Tue, Dec 13, 2016 at 8:31 AM, Chawla,Sumit wrote: > Hi All > > I have a workflow with different steps in my program. Lets say these are > steps A, B, C, D. Step B produces some temp files on each executor node. >

When will multiple aggregations be supported in Structured Streaming?

2016-12-15 Thread ljwagerfield
The current version of Spark (2.0.2) only supports one aggregation per structured stream (and will throw an exception if multiple aggregations are applied). Roughly when will Spark support multiple aggregations? -- View this message in context:

Spark Batch checkpoint

2016-12-15 Thread Selvam Raman
Hi, is there any provision in spark batch for checkpoint. I am having huge data, it takes more than 3 hours to process all data. I am currently having 100 partitions. if the job fails after two hours, lets say it has processed 70 partition. should i start spark job from the beginning or is

Dataset encoders for further types?

2016-12-15 Thread Jakub Dubovsky
Hey, I want to ask whether there is any roadmap/plan for adding Encoders for further types in next releases of Spark. Here is a list of currently supported types. We would like to use Datasets with our internally defined

Re: Few questions on reliability of accumulators value.

2016-12-15 Thread Steve Loughran
On 12 Dec 2016, at 19:57, Daniel Siegmann > wrote: Accumulators are generally unreliable and should not be used. The answer to (2) and (4) is yes. The answer to (3) is both. Here's a more in-depth explanation:

spark use /tmp directory instead of directory from spark.local.dir

2016-12-15 Thread AlexModestov
Hello! I want to use another dir instaed of /tmp directory for all stuff... I set spark.local.dir and -Djava.io.tmpdir=/... but I see that Spark uses /tmp for some data... What does Spark do? And what should I do my Spark uses only my directories? Thank you! -- View this message in context:

"remember" vs "window" in Spark Streaming

2016-12-15 Thread Mattz
Hello, Can someone please help me understand the different scenarios when I could use "remember" vs "window" in Spark streaming? Thanks!

Re: Belief propagation algorithm is open sourced

2016-12-15 Thread Bertrand Dechoux
Nice! I am especially interested in Bayesian Networks, which are only one of the many models that can be expressed by a factor graph representation. Do you do Bayesian Networks learning at scale (parameters and structure) with latent variables? Are you using publicly available tools for that?