Re: Catalyst dependency on Spark Core

2014-07-14 Thread Yanbo Liang
Make Catalyst independent of Spark is the goal of Catalyst, maybe need time and evolution. I awared that package org.apache.spark.sql.catalyst.util embraced org.apache.spark.util.{Utils = SparkUtils}, so that Catalyst has a dependency on Spark core. I'm not sure whether it will be replaced by

Re: Dividing tasks among Spark workers

2014-07-18 Thread Yanbo Liang
23:00 GMT+08:00 Shannon Quinn squ...@gatech.edu: The default # of partitions is the # of cores, correct? On 7/18/14, 10:53 AM, Yanbo Liang wrote: check how many partitions in your program. If only one, change it to more partitions will make the execution parallel. 2014-07-18 20:57 GMT

Re: Spark Function setup and cleanup

2014-07-24 Thread Yanbo Liang
If you want to connect to DB in program, you can use JdbcRDD ( https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala ) 2014-07-24 18:32 GMT+08:00 Yosi Botzer yosi.bot...@gmail.com: Hi, I am using the Java api of Spark. I wanted to know if there

Re: Spark Function setup and cleanup

2014-07-24 Thread Yanbo Liang
record for further prcessing On Thu, Jul 24, 2014 at 9:11 AM, Yanbo Liang yanboha...@gmail.com wrote: If you want to connect to DB in program, you can use JdbcRDD ( https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala ) 2014-07-24 18:32 GMT+08

Re: Broadcast vs simple variable

2014-08-21 Thread Yanbo Liang
In Spark/MLlib, task serialization such as cluster centers of k-means was replaced by broadcast variables due to performance. You can refer this PR https://github.com/apache/spark/pull/1427 And current k-means implementation of MLlib, it's benefited from sparse vector computing.

Re: Pair RDD

2014-08-26 Thread Yanbo Liang
val node = textFile.map(line = { val fileds = line.split(\\s+) (fileds(1),fileds(2)) }) then you can manipulate node RDD with PairRDD function. 2014-08-26 12:55 GMT+08:00 Deep Pradhan pradhandeep1...@gmail.com: Hi, I have an input file of a graph in the format source_node

Re: How to join two PairRDD together?

2014-08-28 Thread Yanbo Liang
Maybe you can refer sliding method of RDD, but it's right now mllib private method. Look at org.apache.spark.mllib.rdd.RDDFunctions. 2014-08-26 12:59 GMT+08:00 Vida Ha v...@databricks.com: Can you paste the code? It's unclear to me how/when the out of memory is occurring without seeing the

Re: how to specify columns in groupby

2014-08-28 Thread Yanbo Liang
For your reference: val d1 = textFile.map(line = { val fileds = line.split(,) ((fileds(0),fileds(1)), fileds(2).toDouble) }) val d2 = d1.reduceByKey(_+_) d2.foreach(println) 2014-08-28 20:04 GMT+08:00 MEETHU MATHEW meethu2...@yahoo.co.in: Hi all, I have an RDD

Re: How to debug this error?

2014-08-29 Thread Yanbo Liang
It's not allowed to use RDD in map function. RDD can only operated at driver of spark program. At your case, group RDD can't be found at every executor. I guess you want to implement subquery like operation, try to use RDD.intersection() or join() 2014-08-29 12:43 GMT+08:00 Gary Zhao

Re: How to save mllib model to hdfs and reload it

2014-09-13 Thread Yanbo Liang
Shixiong, These two snippets behave different in Scala. In the second snippet, you define variable named m and does evaluate the right hand size as part of the definition. In other words, the variable was replaced by the pre-computed value of Array(1.0) in the subsequently code. So in the second

Re: [mllib] LogisticRegressionWithLBFGS interface is not consistent with LogisticRegressionWithSGD

2014-09-13 Thread Yanbo Liang
I also found https://github.com/apache/spark/commit/8f6e2e9df41e7de22b1d1cbd524e20881f861dd0 had resolve this issue but it seems that right code snippet not occurs in master or 1.1 release. 2014-09-13 17:12 GMT+08:00 Yanbo Liang yanboha...@gmail.com: Hi All, I found

[mllib] LogisticRegressionWithLBFGS interface is not consistent with LogisticRegressionWithSGD

2014-09-13 Thread Yanbo Liang
Hi All, I found that LogisticRegressionWithLBFGS interface is not consistent with LogisticRegressionWithSGD in master and 1.1 release. https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala#L199 In the above code snippet,

Re: About SparkSQL 1.1.0 join between more than two table

2014-09-15 Thread Yanbo Liang
Spark SQL can support SQL and HiveSQL which used SQLContext and HiveContext separate. As far as I know, SQLContext of Spark SQL 1.1.0 can not support three table join directly. However you can modify your query with subquery such as SELECT * FROM (SELECT * FROM youhao_data left join youhao_age on

Re: Recommended ways to pass functions

2014-09-23 Thread Yanbo Liang
All these two kinds of function is OK but you need to make your class extends Serializable. But all these kinds of pass functions can not save data which will be send. If you define a function which will not use member parameter of a class or object, you can use val like definition method. For

Re: Spark SQL use of alias in where clause

2014-09-24 Thread Yanbo Liang
Maybe it's the way SQL works. The select part is executed after the where filter is applied, so you cannot use alias declared in select part in where clause. Hive and Oracle behavior the same as Spark SQL. 2014-09-25 8:58 GMT+08:00 Du Li l...@yahoo-inc.com.invalid: Hi, The following query

[MLlib] LogisticRegressionWithSGD and LogisticRegressionWithLBFGS converge with different weights.

2014-09-28 Thread Yanbo Liang
Hi We have used LogisticRegression with two different optimization method SGD and LBFGS in MLlib. With the same dataset and the same training and test split, but get different weights vector. For example, we use spark-1.1.0/data/mllib/sample_binary_classification_data.txt as our training and

Re: [MLlib] LogisticRegressionWithSGD and LogisticRegressionWithLBFGS converge with different weights.

2014-09-29 Thread Yanbo Liang
by multiply a constant to the weights. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sun, Sep 28, 2014 at 11:48 AM, Yanbo Liang yanboha...@gmail.com wrote: Hi We have used

Re: spark sql query optimization , and decision tree building

2014-10-27 Thread Yanbo Liang
If you want to calculate mean, variance, minimum, maximum and total count for each columns, especially for features of machine learning, you can try MultivariateOnlineSummarizer. MultivariateOnlineSummarizer implements a numerically stable algorithm to compute sample mean and variance by column in

Re: How to import mllib.rdd.RDDFunctions into the spark-shell

2014-10-28 Thread Yanbo Liang
Because that org.apache.spark.mllib.rdd.RDDFunctions._ is mllib private class, it can only be called by function in mllib. 2014-10-28 17:09 GMT+08:00 Stephen Boesch java...@gmail.com: I seem to recall there were some specific requirements on how to import the implicits. Here is the issue:

Re: Is There Any Benchmarks Comparing Spark SQL and Hive.

2014-10-28 Thread Yanbo Liang
You can refer the compare between different sql on hadoop solution such as hive, spark sql, shark, impala and so on. There are two main works which may be not very objectively, for your reference: Cloudera benchmark:

Re: newbie question quickstart example sbt issue

2014-10-28 Thread Yanbo Liang
Maybe you had wrong configuration of sbt proxy. 2014-10-28 18:27 GMT+08:00 nl19856 hanspeter.sl...@gmail.com: Hi, I have downloaded the binary spark distribution. When building the package with sbt package I get the following: [root@nlvora157 ~]# sbt package [info] Set current project to

Re: How to set Spark to perform only one map at once at each cluster node

2014-10-28 Thread Yanbo Liang
The number of tasks is decided by the input partition numbers. If you want only one map or flatMap at once, just call coalesce() or repartition() to associate data into one partition. However, this is not recommend because it was not executed parallel efficiently. 2014-10-28 17:27 GMT+08:00

Re: SparkSql OutOfMemoryError

2014-10-28 Thread Yanbo Liang
Try to increase the driver memory. 2014-10-28 17:33 GMT+08:00 Zhanfeng Huo huozhanf...@gmail.com: Hi,friends: I use spark(spark 1.1) sql operate data in hive-0.12, and the job fails when data is large. So how to tune it ? spark-defaults.conf: spark.shuffle.consolidateFiles true

Re: How to set Spark to perform only one map at once at each cluster node

2014-10-28 Thread Yanbo Liang
It's not very difficult to implement by properly set parameter of application. Some basic knowledge you should know: An application can have only one executor at each machine or container (YARN). So you just set executor-cores as 1, then each executor will make only one task at once. 2014-10-28

Re: How many executor process does an application receives?

2014-10-28 Thread Yanbo Liang
An application can have only one executor at each machine or container (YARN). How many thread that each executor have is determined by the parameter executor-cores. There are also other parameter setting method that you can specify total- executor-cores and each executor cores will be determined

Re: How to import mllib.rdd.RDDFunctions into the spark-shell

2014-10-28 Thread Yanbo Liang
in package rdd cannot be accessed in package org.apache.spark.mllib.rdd import org.apache.spark.mllib.rdd.RDDFunctions._ It has to do with the implicits. 2014-10-28 2:25 GMT-07:00 Yanbo Liang yanboha...@gmail.com: Because that org.apache.spark.mllib.rdd.RDDFunctions._ is mllib

Re: Use RDD like a Iterator

2014-10-29 Thread Yanbo Liang
RDD.toLocalIterator() is the suitable solution. But I doubt whether it conform with the design principle of spark and RDD. All RDD transform is lazily computed until it end with some actions. 2014-10-29 15:28 GMT+08:00 Sean Owen so...@cloudera.com: Call RDD.toLocalIterator()?

Re: Using a Database to persist and load data from

2014-10-30 Thread Yanbo Liang
AFAIK, you can read data from DB with JdbcRDD, but there is no interface for writing to DB. JdbcRDD has some restrict such as SQL must with where clause. For writing to DB, you can use mapPartitions or foreachPartition to implement. You can refer this example:

Re: how to know the Spark worker Mechanism

2014-11-18 Thread Yanbo Liang
Did you set spark.executor.extraLibraryPath to the directory which your native library exists? 2014-11-18 16:13 GMT+08:00 tangweihan tangwei...@360.cn: I'm a newbee in Spark. I know that what the work should do is written in RDD. But I want to make the worker load a native lib and I can do

Re: ClassNotFoundException in standalone mode

2014-11-20 Thread Yanbo Liang
Looks like it can not found class or jar in your Driver machine. Are you sure that the corresponding jar file exist in Driver machine rather than your develop machine? 2014-11-21 11:16 GMT+08:00 angel2014 angel.alvarez.pas...@gmail.com: Can you make sure the class SimpleApp$$anonfun$1 is

Re: beeline via spark thrift doesn't retain cache

2014-11-21 Thread Yanbo Liang
1) make sure your beeline client connected to Hiveserver2 of Spark SQL. You can found execution logs of Hiveserver2 in the environment of start-thriftserver.sh. 2) what about your scale of data. If cache with small data, it will take more time to schedule workload between different executors. Look

Re: Determine number of running executors

2014-11-21 Thread Yanbo Liang
You can get parameter such as spark.executor.memory, but you can not get executor or core numbers. Because executor and core are parameters of spark deploy environment not spark context. val conf = new SparkConf().set(spark.executor.memory,2G) val sc = new SparkContext(conf)

Re: Spark saveAsText file size

2014-11-24 Thread Yanbo Liang
In memory cache may be blow up the size of RDD. It's general condition that RDD will take more space in memory than disk. There are options to configure and optimize storage space efficiency in Spark, take a look at this https://spark.apache.org/docs/latest/tuning.html 2014-11-25 10:38 GMT+08:00

Re: streaming linear regression is not building the model

2014-11-25 Thread Yanbo Liang
Computing will be triggered by new files added in the directory. If you place new files to the directory and it will start training the model. 2014-11-11 5:03 GMT+08:00 Bui, Tri tri@verizonwireless.com.invalid: Hi, The model weight is not updating for streaming linear regression. The

Re: Inaccurate Estimate of weights model from StreamingLinearRegressionWithSGD

2014-11-25 Thread Yanbo Liang
The case run correctly in my environment. 14/11/25 17:48:20 INFO regression.StreamingLinearRegressionWithSGD: Model updated at time 141690890 ms 14/11/25 17:48:20 INFO regression.StreamingLinearRegressionWithSGD: Current model: weights, [0.8588] Can you provide more detail

Re: Inaccurate Estimate of weights model from StreamingLinearRegressionWithSGD

2014-11-25 Thread Yanbo Liang
) But still get compilation error. Thanks Tri *From:* Yanbo Liang [mailto:yanboha...@gmail.com] *Sent:* Tuesday, November 25, 2014 4:08 AM *To:* Bui, Tri *Cc:* user@spark.apache.org *Subject:* Re: Inaccurate Estimate of weights model from StreamingLinearRegressionWithSGD

Re: IDF model error

2014-11-25 Thread Yanbo Liang
Hi Shivani, You misunderstand the parameter of SparseVector. class SparseVector( override val size: Int, val indices: Array[Int], val values: Array[Double]) extends Vector { } The first parameter is the total length of the Vector rather than the length of non-zero elements. So it

Re: k-means clustering

2014-11-25 Thread Yanbo Liang
Pre-processing is major workload before training model. MLlib provide TD-IDF calculation, StandardScaler and Normalizer which is essential for preprocessing and would be great help to the model training. Take a look at this http://spark.apache.org/docs/latest/mllib-feature-extraction.html

Re: Inaccurate Estimate of weights model from StreamingLinearRegressionWithSGD

2014-11-26 Thread Yanbo Liang
() [error] ^ [error] two errors found [error] (compile:compile) Compilation failed Thanks Tri *From:* Yanbo Liang [mailto:yanboha...@gmail.com] *Sent:* Tuesday, November 25, 2014 8:57 PM *To:* Bui, Tri *Cc:* user@spark.apache.org *Subject:* Re: Inaccurate Estimate of weights

Re: Is it possible to just change the value of the items in RDD without making a full copy?

2014-12-02 Thread Yanbo Liang
You can not modify one RDD in mapPartitions due to RDD is immutable. Once you apply transform function on RDDs, they will produce new RDDs. If you just want to modify only a fraction of the total RDD, try to collect the new value list to driver or use broadcast variable after each iteration, not

Re: MLLib in Production

2014-12-10 Thread Yanbo Liang
Hi Klaus, There is no ideal method but some workaround. Train model in Spark cluster or YARN cluster, then use RDD.saveAsTextFile to store this model which include weights and intercept to HDFS. Load weights file and intercept file from HDFS, construct a GLM model, and then run model.predict()

Re: Adding a column to a SchemaRDD

2014-12-12 Thread Yanbo Liang
RDD is immutable so you can not modify it. If you want to modify some value or schema in RDD, using map to generate a new RDD. The following code for your reference: def add(a:Int,b:Int):Int = { a + b } val d1 = sc.parallelize(1 to 10).map { i = (i, i+1, i+2) } val d2 = d1.map { i = (i._1,

Re: Adding a column to a SchemaRDD

2014-12-15 Thread Yanbo Liang
to do 3, but 1 and 2 elude me. Is there more complete documentation somewhere for the DSL portion? Anyone have a clue about any of the above? On Fri, Dec 12, 2014 at 6:01 AM, Yanbo Liang yanboha...@gmail.com wrote: RDD is immutable so you can not modify it. If you want to modify some value

Re: spark-sql problem with textfile separator

2015-02-19 Thread Yanbo Liang
This is because of each line will be separated into 4 columns instead of 3 columns. If you want to use comma to separate different columns, each column will be not allowed to include commas. 2015-02-19 18:12 GMT+08:00 sparkino francescoboname...@gmail.com: Hello everybody, I'm quite new to

Re: spark-sql problem with textfile separator

2015-02-19 Thread Yanbo Liang
PM, Francesco Bonamente francescoboname...@gmail.com wrote: Hi Yanbo, unfortunately all csv files contain comma inside some columns and I can't change the structure. How can I work with this kind of textfile and spark-sql? Thank you again 2015-02-19 14:38 GMT+01:00 Yanbo Liang

Re: Should Spark SQL support retrieve column value from Row by column name?

2015-03-22 Thread Yanbo Liang
If you use the latest version Spark 1.3, you can use the DataFrame API like val results = sqlContext.sql(SELECT name FROM people) results.select(name).show() 2015-03-22 15:40 GMT+08:00 amghost zhengweita...@gmail.com: I would like to retrieve column value from Spark SQL query result. But

Re: Submitting Spark Applications using Spark Submit

2015-06-16 Thread Yanbo Liang
If you run Spark on YARN, the simplest way is replace the $SPARK_HOME/lib/spark-.jar with your own version spark jar file and run your application. The spark-submit script will upload this jar to YARN cluster automatically and then you can run your application as usual. It does not care about

Re: Can it works in load the MatrixFactorizationModel and predict product with Spark Streaming?

2015-06-17 Thread Yanbo Liang
The logs have told you what cause the error that you can not invoke RDD transformations and actions in other transformations. You have not do this explicitly but the implementation of MatrixFactorizationModel .recommendProducts do that, you can refer

Re: Random Forest and StringIndexer in pyspark ML Pipeline

2015-08-21 Thread Yanbo Liang
ML plans to make Machine Learning pipeline that users can make machine learning more efficient. It's more general to make StringIndexer chain with any kinds of Estimators. I think we can make StringIndexer and reverse process automatic in the future. If you want to know your original labels, you

Re: Convert mllib.linalg.Matrix to Breeze

2015-08-20 Thread Yanbo Liang
You can use Matrix.toBreeze() https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala#L56 . 2015-08-20 18:24 GMT+08:00 Naveen nav...@formcept.com: Hi All, Is there anyway to convert a mllib matrix to a Dense Matrix of Breeze? Any leads

Re: TFIDF Transformation

2015-08-04 Thread Yanbo Liang
It can not translate the number back to the word except you store the in map by yourself. 2015-07-31 1:45 GMT+08:00 hans ziqiu li thenewh...@gmail.com: Hello spark users! I am having some troubles with the TFIDF in MLlib and was wondering if anyone can point me to the right direction. The

Re: Extremely poor predictive performance with RF in mllib

2015-08-04 Thread Yanbo Liang
It looks like the predicted result just opposite with expectation, so could you check whether the label is right? Or could you share several data which can help to reproduce this output? 2015-08-03 19:36 GMT+08:00 Barak Gitsis bar...@similarweb.com: hi, I've run into some poor RF behavior,

Re: Difference between RandomForestModel and RandomForestClassificationModel

2015-08-04 Thread Yanbo Liang
The old mllib API will use RandomForest.trainClassifier() to train a RandomForestModel; the new mllib API (AKA ML) will use RandomForestClassifier.train() to train a RandomForestClassificationModel. They will produce the same result for a given dataset. 2015-07-31 1:34 GMT+08:00 Bryan Cutler

Re: Extremely poor predictive performance with RF in mllib

2015-08-06 Thread Yanbo Liang
unless rf somehow switches the labels, it should be correct. I have posted a sample dataset and sample code to reproduce what I'm getting here: https://github.com/pkphlam/spark_rfpredict On Tue, Aug 4, 2015 at 6:42 AM, Yanbo Liang yblia...@gmail.com wrote: It looks like the predicted result

Re: Retrieving Spark Configuration properties

2015-07-16 Thread Yanbo Liang
This is because that you did not set the parameter spark.sql. hive.metastore.version. You can check other parameters that you have set, it will work well. Or you can first set this parameter, and then get it. 2015-07-17 11:53 GMT+08:00 RajG rjk...@gmail.com: I am using this version of Spark :

Re: Getting info from DecisionTreeClassificationModel

2015-10-28 Thread Yanbo Liang
AFAIK, you can not traverse the tree from the rootNode of DecisionTreeClassificationModel, because type Node does not have information of its children. Type InternalNode has children information but it's private that users can not access. I think the best way to get the probability of each

Re: Mllib explain feature for tree ensembles

2015-10-28 Thread Yanbo Liang
Spark ML/MLlib has provided featureImportances to estimate the importance of each feature. 2015-10-28 18:29 GMT+08:00 Eugen Cepoi : >

Re: spark not launching in yarn-cluster mode

2015-08-25 Thread Yanbo Liang
spark-shell and spark-sql can not be deployed with yarn-cluster mode, because you need to make spark-shell or spark-sql scripts run on your local machine rather than container of YARN cluster. 2015-08-25 16:19 GMT+08:00 Jeetendra Gangele gangele...@gmail.com: Hi All i am trying to launch the

Re: How to compute the probability of each class in Naive Bayes

2015-09-01 Thread Yanbo Liang
Actually brzPi + brzTheta * testData.toBreeze is the probabilities of the input Vector on each class, however it's a Breeze Vector. Pay attention the index of this Vector need to map to the corresponding label index. 2015-08-28 20:38 GMT+08:00 Adamantios Corais : >

Re: Problem with repartition/OOM

2015-09-05 Thread Yanbo Liang
The Parquet output writer allocates one block for each table partition it is processing and writes partitions in parallel. It will run out of memory if (number of partitions) times (Parquet block size) is greater than the available memory. You can try to decrease the number of partitions. And

Re: Multilabel classification support

2015-09-11 Thread Yanbo Liang
LogisticRegression in MLlib(not ML) package supports both multiclass and multilabel classification. 2015-09-11 16:21 GMT+08:00 Alexis Gillain : > You can try these packages for adaboost.mh : > > https://github.com/BaiGang/spark_multiboost (scala) > or >

Re: Stopping criteria for gradient descent

2015-09-29 Thread Yanbo Liang
Hi Nishanth, The diff of solution vectors is compared to relative tolerance or absolute tolerance, you can set convergenceTol which can affect the convergence criteria of SGD. 2015-09-17 8:31 GMT+08:00 Nishanth P S : > Hi, > > I am running LogisticRegressionWithSGD in

Re: RandomForestClassifer does not recognize number of classes, nor can number of classes be set

2015-09-30 Thread Yanbo Liang
Hi Kristina, Currently StringIndexer is a requirement step before training DecisionTree, RandomForest and GBT related models. Though it does not necessary by other models such as LogisticRegression and NaiveBayes, it also strongly recommend to make this preprocessing step otherwise it may lead

Re: Stopping criteria for gradient descent

2015-09-22 Thread Yanbo Liang
Hi Nishanth, The convergence tolerance is a condition which decides iteration termination. In LogisticRegression with SGD optimization, it depends on the difference of weight vectors. But in GBT it depends on the validate error on the held out test set. 2015-09-18 4:09 GMT+08:00 nishanthps

Re: Creating BlockMatrix with java API

2015-09-22 Thread Yanbo Liang
This is due to the distributed matrices like BlockMatrix/RowMatrix/IndexedRowMatrix/CoordinateMatrix do not provide Java friendly constructors. I have file a SPARK-10757 to track this issue. 2015-09-18 3:36 GMT+08:00 Pulasthi Supun

Re: Python API Documentation Mismatch

2015-12-03 Thread Yanbo Liang
Hi Roberto, There are two ALS available: ml.recommendation.ALS and mllib.recommendation.ALS .

Re: Sparse Vector ArrayIndexOutOfBoundsException

2015-12-04 Thread Yanbo Liang
Could you also print the length of featureSet? I suspect it less than 62. The first argument of Vectors.sparse() is the length of this sparse vector not the length of non-null elements. Yanbo 2015-12-03 22:30 GMT+08:00 nabegh : > I'm trying to run a SVM classifier on unlabeled

Re: SparkR in Spark 1.5.2 jsonFile Bug Found

2015-12-04 Thread Yanbo Liang
I have created SPARK-12146 to track this issue. 2015-12-04 9:16 GMT+08:00 Felix Cheung : > It looks like this has been broken around Spark 1.5. > > Please see JIRA SPARK-10185. This has been fixed in pyspark but > unfortunately SparkR was missed. I have confirmed this

Re: MLlib training time question

2015-12-05 Thread Yanbo Liang
Hi Haoyue, Could you find the time spent on each stage of the LinearRegression model training at the Spark UI? It can tell us which stage is the most time-consuming and help us to analyze the cause. Yanbo 2015-12-05 15:14 GMT+08:00 Haoyue Wang : > Hi all, > I'm doing some

Re: General question on using StringIndexer in SparkML

2015-12-02 Thread Yanbo Liang
in 1.6 version only. > Can you tell me how/when can I download version 1.6? > > Thanks and Regards, > Vishnu Viswanath, > > On Wed, Dec 2, 2015 at 4:37 AM, Yanbo Liang <yblia...@gmail.com> wrote: > >> You can set "handleInvalid" to "skip" w

Re: MLlib: Feature Importances API

2015-12-17 Thread Yanbo Liang
Hi Asim, The "featureImportances" is only exposed at ML not MLlib. You need to update your code to use RandomForestClassifier of ML to train and get one RandomForestClassificationModel. Then you can call RandomForestClassificationModel.featureImportances

Re: java.lang.NoSuchMethodError while saving a random forest model Spark version 1.5

2015-12-17 Thread Yanbo Liang
Spark 1.5 officially use Parquet 1.7.0, but Spark 1.3 use Parquet 1.6.0. It's better to check which version of Parquet is used in your environment. 2015-12-17 10:26 GMT+08:00 Joseph Bradley : > This method is tested in the Spark 1.5 unit tests, so I'd guess it's a >

Re: Need clarifications in Regression

2015-12-17 Thread Yanbo Liang
Hi Arunkumar, There are two implementation for LinearRegression, one under ml package and another one

Re: Linear Regression with OLS

2015-12-17 Thread Yanbo Liang
Hi Arunkumar, You can refer the officially examples of LinearRegression under ML package( https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/LinearRegressionWithElasticNetExample.scala ). If you want to train this LinearRegressionModel with OLS, you

Re: Are there some solution to complete the transform category variables into dummy variable in scala or spark ?

2015-12-17 Thread Yanbo Liang
Hi Minglei, Spark ML provide a transformer named "OneHotEncoder" to map a column of category indices to a column of binary vectors. It's similar with pandas.get_dummies and OneHotEncoder of sklearn, but the output will be a column of vector type rather than multiple columns. You can refer the

Re: Concatenate a string to a Column of type string in DataFrame

2015-12-13 Thread Yanbo Liang
Sorry, it was added since 1.5.0. 2015-12-13 2:07 GMT+08:00 Satish <jsatishchan...@gmail.com>: > Hi, > Will the below mentioned snippet work for Spark 1.4.0 > > Thanks for your inputs > > Regards, > Satish > ------ > From: Yanbo Liang <

Re: SparkML. RandomForest predict performance for small dataset.

2015-12-11 Thread Yanbo Liang
I think you are finding the ability of prediction on single instance. It's a feature on the development, please refer SPARK-10413. 2015-12-10 4:37 GMT+08:00 Eugene Morozov : > Hello, > > I'm using RandomForest pipeline (ml package). Everything is working fine >

Re: Concatenate a string to a Column of type string in DataFrame

2015-12-12 Thread Yanbo Liang
Hi Satish, You can refer the following code snippet: df.select(concat(col("String_Column"), lit("00:00:000"))) Yanbo 2015-12-12 16:01 GMT+08:00 satish chandra j : > HI, > I am trying to update a column value in DataFrame, incrementing a column > of integer data type

Re: GLM in apache spark in MLlib

2015-12-10 Thread Yanbo Liang
Hi Arunkumar, LinearRegression, LogisticRegression and AFTSurvivalRegression are parts of GLMs, they are already parts of MLlib. Actually GLM in SparkR calling MLlib as backend execution engine, but only "gaussian" and "binomial" family are supported currently. MLlib will continue to improve

Re: How to save Multilayer Perceptron Classifier model.

2015-12-13 Thread Yanbo Liang
Hi Vadim, It does not support save/load for Multilayer Perceptron Model currently, you can track the issues at SPARK-11871 . Yanbo 2015-12-14 2:31 GMT+08:00 Vadim Gribanov : > Hey everyone! I’m new with spark and

Re: GLM I'm ml pipeline

2016-01-03 Thread Yanbo Liang
AFAIK, Spark MLlib will improve and support most GLM functions in the next release(Spark 2.0). 2016-01-03 23:02 GMT+08:00 : > keyStoneML could be an alternative. > > Ardo. > > On 03 Jan 2016, at 15:50, Arunkumar Pillai > wrote: > > Is there any road

Re: SparkML algos limitations question.

2016-01-04 Thread Yanbo Liang
can handle large models. (master should > have more memory because it runs LBFGS) In my experiments, I’ve trained the > models 12M and 32M parameters without issues. > > > > Best regards, Alexander > > > > *From:* Yanbo Liang [mailto:yblia...@gmail.com] > *Sent:* Sunda

Re: Problem embedding GaussianMixtureModel in a closure

2016-01-04 Thread Yanbo Liang
thanks for info. Is it likely to change in (near :) ) future? Ability to > call this function only on local data (ie not in rdd) seems to be rather > serious limitation. > > cheers, > Tomasz > > On 02.01.2016 09:45, Yanbo Liang wrote: > >> Hi Tomasz, >> >> The GMM is

Re: finding distinct count using dataframe

2016-01-05 Thread Yanbo Liang
Hi Arunkumar, You can use datasetDF.select(countDistinct(col1, col2, col3, ...)) or approxCountDistinct for a approximate result. 2016-01-05 17:11 GMT+08:00 Arunkumar Pillai : > Hi > > Is there any functions to find distinct count of all the variables in > dataframe.

Re: Spark MLLib KMeans Performance on Amazon EC2 M3.2xlarge

2015-12-30 Thread Yanbo Liang
Hi Jia, You can try to use inputRDD.persist(MEMORY_AND_DISK) and verify whether it can produce stable performance. The storage level of MEMORY_AND_DISK will store the partitions that don't fit on disk and read them from there when they are needed. Actually, it's not necessary to set so large

Re: Problem embedding GaussianMixtureModel in a closure

2016-01-02 Thread Yanbo Liang
Hi Tomasz, The GMM is bind with the peer Java GMM object, so it need reference to SparkContext. Some of MLlib(not ML) models are simple object such as KMeansModel, LinearRegressionModel etc., but others will refer SparkContext. The later ones and corresponding member functions should not called

Re: frequent itemsets

2016-01-02 Thread Yanbo Liang
Hi Roberto, Could you share your code snippet that others can help to diagnose your problems? 2016-01-02 7:51 GMT+08:00 Roberto Pagliari : > When using the frequent itemsets APIs, I’m running into stackOverflow > exception whenever there are too many combinations to

Re: sparkR ORC support.

2016-01-06 Thread Yanbo Liang
You should ensure your sqlContext is HiveContext. sc <- sparkR.init() sqlContext <- sparkRHive.init(sc) 2016-01-06 20:35 GMT+08:00 Sandeep Khurana : > Felix > > I tried the option suggested by you. It gave below error. I am going to > try the option suggested by Prem .

Re: Date Time Regression as Feature

2016-01-07 Thread Yanbo Liang
First extracting year, month, day, time from the datetime. Then you should decide which variables can be treated as category features such as year/month/day and encode them to boolean form using OneHotEncoder. At last using VectorAssembler to assemble the encoded output vector and the other raw

Re: K means clustering in spark

2015-12-31 Thread Yanbo Liang
Hi Anjali, The main output of KMeansModel is clusterCenters which is Array[Vector]. It has k elements where k is the number of clusters and each elements is the center of the specified cluster. Yanbo 2015-12-31 12:52 GMT+08:00 : > Hi, > > I am trying to use kmeans

Re: Spark MLLib KMeans Performance on Amazon EC2 M3.2xlarge

2016-01-01 Thread Yanbo Liang
JavaRDD points = lines.map(new ParsePoint()); > > points.persist(StorageLevel.MEMORY_AND_DISK()); > > KMeansModel model = KMeans.train(points.rdd(), k, iterations, runs, > KMeans.K_MEANS_PARALLEL()); > > > Thank you very much! > > Best Regards, > Jia > >

Re: does HashingTF maintain a inverse index?

2016-01-01 Thread Yanbo Liang
Hi Andy, Spark ML/MLlib does not provide a transformer to map HashingTF generated feature back to words currently. 2016-01-01 8:37 GMT+08:00 Hayri Volkan Agun : > Hi, > > If you are using pipeline api, you do not need to map features back to > documents. > Your input

Re: NotSerializableException exception while using TypeTag in Scala 2.10

2016-01-01 Thread Yanbo Liang
I also hit this bug, have you resolved this issue? Or could you give some suggestions? 2014-07-28 18:33 GMT+08:00 Aniket Bhatnagar : > I am trying to serialize objects contained in RDDs using runtime > relfection via TypeTag. However, the Spark job keeps > failing

Re: How to specify the numFeatures in HashingTF

2016-01-01 Thread Yanbo Liang
You can refer the following code snippet to set numFeatures for HashingTF: val hashingTF = new HashingTF() .setInputCol("words") .setOutputCol("features") .setNumFeatures(n) 2015-10-16 0:17 GMT+08:00 Nick Pentreath : > Setting the numfeatures higher

Re: How to handle categorical variables in Spark MLlib?

2015-12-25 Thread Yanbo Liang
Hi Hokam, You can use OneHotEncoder to encode category variables to feature vector, Spark ML provide this transformer. To weight for individual category, there is no exist method to do this, but you can implement a UDF which can multiple a factor to specified column of a vector. Yanbo

Re: How to ignore case in dataframe groupby?

2015-12-24 Thread Yanbo Liang
You can use DF.groupBy(upper(col("a"))).agg(sum(col("b"))). DataFrame provide function "upper" to update column to uppercase. 2015-12-24 20:47 GMT+08:00 Eran Witkon : > Use DF.withColumn("upper-code",df("countrycode).toUpper)) > or just run a map function that does the same

Re: Retrieving the PCA parameters in pyspark

2015-12-25 Thread Yanbo Liang
Hi Rohit, This is a known bug, but you can get these parameters if you use Scala version. Yanbo 2015-12-03 0:36 GMT+08:00 Rohit Girdhar : > Hi > > I'm using PCA through the python interface for spark, as per the > instructions on this page: >

Re: SparkML algos limitations question.

2015-12-27 Thread Yanbo Liang
Hi Eugene, AFAIK, the current implementation of MultilayerPerceptronClassifier have some scalability problems if the model is very huge (such as >10M), although I think the limitation can cover many use cases already. Yanbo 2015-12-16 6:00 GMT+08:00 Joseph Bradley : > Hi

Re: how to use sparkR or spark MLlib load csv file on hdfs then calculate covariance

2015-12-28 Thread Yanbo Liang
Load csv file: df <- read.df(sqlContext, "file-path", source = "com.databricks.spark.csv", header = "true") Calculate covariance: cov <- cov(df, "col1", "col2") Cheers Yanbo 2015-12-28 17:21 GMT+08:00 zhangjp <592426...@qq.com>: > hi all, > I want to use sparkR or spark MLlib load csv

Re: Creating vectors from a dataframe

2015-12-20 Thread Yanbo Liang
Hi Arunkumar, If you want to create a vector from multiple columns of DataFrame, Spark ML provided VectorAssembler to help us. Yanbo 2015-12-21 13:44 GMT+08:00 Arunkumar Pillai : > Hi > > > I'm trying to use Linear Regression from ml library > > but the problem is the

  1   2   >