Re: how to retain part of the features in LogisticRegressionModel (spark2.0)
Thanks Dhanesh, and how about the features question? > 在 2017年3月19日,19:08,Dhanesh Padmanabhan 写道: > > Dhanesh Thanks, lujinhong
Re: how to retain part of the features in LogisticRegressionModel (spark2.0)
By the way, I found in spark 2.1 I can use setFamily() to decide binomial or multinomial, but how can I do the same thing in spark 2.0.2? If not support , which one is used in spark 2.0.2? binomial or multinomial? > 在 2017年3月19日,18:12,jinhong lu 写道: > > > I train my LogisticRegressionModel like this, I want my model to retain only > some of the features(e.g. 500 of them), not all the features. What shou > I do? > I use .setElasticNetParam(1.0), but still all the features is in > lrModel.coefficients. > > import org.apache.spark.ml.classification.LogisticRegression > val > data=spark.read.format("libsvm").option("numFeatures","").load("/tmp/data/training_data3") > > val Array(trainingData, testData) = data.randomSplit(Array(0.5, 0.5), > seed = 1234L) > > val lr = new LogisticRegression() > val lrModel = lr.fit(trainingData) > println(s"Coefficients: ${lrModel.coefficients} Intercept: > ${lrModel.intercept}") > > val predictions = lrModel.transform(testData) > predictions.show() > > > Thanks, > lujinhong > Thanks, lujinhong
how to retain part of the features in LogisticRegressionModel (spark2.0)
I train my LogisticRegressionModel like this, I want my model to retain only some of the features(e.g. 500 of them), not all the features. What shou I do? I use .setElasticNetParam(1.0), but still all the features is in lrModel.coefficients. import org.apache.spark.ml.classification.LogisticRegression val data=spark.read.format("libsvm").option("numFeatures","").load("/tmp/data/training_data3") val Array(trainingData, testData) = data.randomSplit(Array(0.5, 0.5), seed = 1234L) val lr = new LogisticRegression() val lrModel = lr.fit(trainingData) println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}") val predictions = lrModel.transform(testData) predictions.show() Thanks, lujinhong
Re: how to construct parameter for model.transform() from datafile
Anyone help? > 在 2017年3月13日,19:38,jinhong lu 写道: > > After train the mode, I got the result look like this: > > > scala> predictionResult.show() > > +-++++--+ > |label|features| rawPrediction| > probability|prediction| > > +-++++--+ > | 0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...| > 0.0| > | 0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...| > 0.0| > | 0.0|(144109,[100],[24...|[-146.81612388602...|[9.73704654529197...| > 1.0| > > And then, I transform() the data by these code: > > import org.apache.spark.ml.linalg.Vectors > import org.apache.spark.ml.linalg.Vector > import scala.collection.mutable > > def lineToVector(line:String ):Vector={ > val seq = new mutable.Queue[(Int,Double)] > val content = line.split(" "); > for( s <- content){ > val index = s.split(":")(0).toInt > val value = s.split(":")(1).toDouble > seq += ((index,value)) > } > return Vectors.sparse(144109, seq) > } > >val df = sc.sequenceFile[org.apache.hadoop.io.LongWritable, > org.apache.hadoop.io.Text]("/data/gamein/gameall_sdc/wh/gameall.db/edt_udid_label_format/ds=20170312/001006_0").map(line=>line._2).map(line > => > (line.toString.split("\t")(0),lineToVector(line.toString.split("\t")(1.toDF("udid", > "features") >val predictionResult = model.transform(df) >predictionResult.show() > > > But I got the error look like this: > > Caused by: java.lang.IllegalArgumentException: requirement failed: You may > not write an element to index 804201 because the declared size of your vector > is 144109 > at scala.Predef$.require(Predef.scala:224) > at org.apache.spark.ml.linalg.Vectors$.sparse(Vectors.scala:219) > at lineToVector(:55) > at $anonfun$4.apply(:50) > at $anonfun$4.apply(:50) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:84) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > > So I change > > return Vectors.sparse(144109, seq) > > to > > return Vectors.sparse(804202, seq) > > Another error occurs: > > Caused by: java.lang.IllegalArgumentException: requirement failed: The > columns of A don't match the number of elements of x. A: 144109, x: 804202 > at scala.Predef$.require(Predef.scala:224) > at org.apache.spark.ml.linalg.BLAS$.gemv(BLAS.scala:521) > at > org.apache.spark.ml.linalg.Matrix$class.multiply(Matrices.scala:110) > at org.apache.spark.ml.linalg.DenseMatrix.multiply(Matrices.scala:176) > > what should I do? >> 在 2017年3月13日,16:31,jinhong lu 写道: >> >> Hi, all: >> >> I got these training data: >> >> 0 31607:17 >> 0 111905:36 >> 0 109:3 506:41 1509:1 2106:4 5309:1 7209:5 8406:1 27108:1 27709:1 >> 30209:8 36109:20 41408:1 42309:1 46509:1 47709:5 57809:1 58009:1 58709:2 >> 112109:4 123305:48 142509:1 >> 0 407:14 2905:2 5209:2 6509:2 6909:2 14509:2 18507:10 >> 0 604:3 3505:9 6401:3 6503:2 6505:3 7809:8 10509:3 12109:3 15207:19 >> 31607:19 >> 0 19109:7 29705:4 123305:32 >> 0 15309:1 43005:1 108509:1 >> 1 604:1 6401:1 6503:1 15207:4 31607:40 >> 0 1807:19 >> 0 301:14 501:1 1502:14 2507:12 123305:4 >> 0 607:14 19109:460 123305:448 >> 0 5406:14 7209:4 10509:3 19109:6 24706:10 26106:4 31409:1 123305:48 >> 128209:1 >> 1 1606:1 2306:3 3905:19 4408:3
Re: how to construct parameter for model.transform() from datafile
After train the mode, I got the result look like this: scala> predictionResult.show() +-++++--+ |label|features| rawPrediction| probability|prediction| +-++++--+ | 0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...| 0.0| | 0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...| 0.0| | 0.0|(144109,[100],[24...|[-146.81612388602...|[9.73704654529197...| 1.0| And then, I transform() the data by these code: import org.apache.spark.ml.linalg.Vectors import org.apache.spark.ml.linalg.Vector import scala.collection.mutable def lineToVector(line:String ):Vector={ val seq = new mutable.Queue[(Int,Double)] val content = line.split(" "); for( s <- content){ val index = s.split(":")(0).toInt val value = s.split(":")(1).toDouble seq += ((index,value)) } return Vectors.sparse(144109, seq) } val df = sc.sequenceFile[org.apache.hadoop.io.LongWritable, org.apache.hadoop.io.Text]("/data/gamein/gameall_sdc/wh/gameall.db/edt_udid_label_format/ds=20170312/001006_0").map(line=>line._2).map(line => (line.toString.split("\t")(0),lineToVector(line.toString.split("\t")(1.toDF("udid", "features") val predictionResult = model.transform(df) predictionResult.show() But I got the error look like this: Caused by: java.lang.IllegalArgumentException: requirement failed: You may not write an element to index 804201 because the declared size of your vector is 144109 at scala.Predef$.require(Predef.scala:224) at org.apache.spark.ml.linalg.Vectors$.sparse(Vectors.scala:219) at lineToVector(:55) at $anonfun$4.apply(:50) at $anonfun$4.apply(:50) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:84) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) So I change return Vectors.sparse(144109, seq) to return Vectors.sparse(804202, seq) Another error occurs: Caused by: java.lang.IllegalArgumentException: requirement failed: The columns of A don't match the number of elements of x. A: 144109, x: 804202 at scala.Predef$.require(Predef.scala:224) at org.apache.spark.ml.linalg.BLAS$.gemv(BLAS.scala:521) at org.apache.spark.ml.linalg.Matrix$class.multiply(Matrices.scala:110) at org.apache.spark.ml.linalg.DenseMatrix.multiply(Matrices.scala:176) what should I do? > 在 2017年3月13日,16:31,jinhong lu 写道: > > Hi, all: > > I got these training data: > > 0 31607:17 > 0 111905:36 > 0 109:3 506:41 1509:1 2106:4 5309:1 7209:5 8406:1 27108:1 27709:1 > 30209:8 36109:20 41408:1 42309:1 46509:1 47709:5 57809:1 58009:1 58709:2 > 112109:4 123305:48 142509:1 > 0 407:14 2905:2 5209:2 6509:2 6909:2 14509:2 18507:10 > 0 604:3 3505:9 6401:3 6503:2 6505:3 7809:8 10509:3 12109:3 15207:19 > 31607:19 > 0 19109:7 29705:4 123305:32 > 0 15309:1 43005:1 108509:1 > 1 604:1 6401:1 6503:1 15207:4 31607:40 > 0 1807:19 > 0 301:14 501:1 1502:14 2507:12 123305:4 > 0 607:14 19109:460 123305:448 > 0 5406:14 7209:4 10509:3 19109:6 24706:10 26106:4 31409:1 123305:48 > 128209:1 > 1 1606:1 2306:3 3905:19 4408:3 4506:8 8707:3 19109:50 24809:1 26509:2 > 27709:2 56509:8 122705:62 123305:31 124005:2 > > And then I train the model by spark: > > import org.apache.spark.ml.classification.NaiveBayes > import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator > import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator > import org.apache.spark.sql.SparkSession > > val spark = > SparkSession.builder.appName("NaiveBayesExampl
mllib based on dataset or dataframe
Hi, Since the DataSet will be the major API in spark2.0, why mllib will DataFrame-based, and 'future development will focus on the DataFrame-based API.’ Any plan will change mllib form DataFrame-based to DataSet-based? = Thanks, lujinhong - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org