Re: how to retain part of the features in LogisticRegressionModel (spark2.0)

2017-03-19 Thread jinhong lu
Thanks Dhanesh,  and how about the features question?

> 在 2017年3月19日,19:08,Dhanesh Padmanabhan  写道:
> 
> Dhanesh

Thanks,
lujinhong



Re: how to retain part of the features in LogisticRegressionModel (spark2.0)

2017-03-19 Thread jinhong lu
By the way, I found in spark 2.1 I can use setFamily() to decide binomial or 
multinomial, but how  can I do the same thing in spark 2.0.2?
If  not support , which one is used in spark 2.0.2?  binomial or multinomial?

> 在 2017年3月19日,18:12,jinhong lu  写道:
> 
> 
> I train my LogisticRegressionModel like this,  I want my model to retain only 
> some of the features(e.g. 500 of them), not all the  features. What shou 
> I do? 
> I use .setElasticNetParam(1.0), but still all the features is in 
> lrModel.coefficients.
> 
> import org.apache.spark.ml.classification.LogisticRegression
> val 
> data=spark.read.format("libsvm").option("numFeatures","").load("/tmp/data/training_data3")
>  
> val Array(trainingData, testData) = data.randomSplit(Array(0.5, 0.5), 
> seed = 1234L)
> 
> val lr = new LogisticRegression()
> val lrModel = lr.fit(trainingData)
> println(s"Coefficients: ${lrModel.coefficients} Intercept: 
> ${lrModel.intercept}")
> 
> val predictions = lrModel.transform(testData)
> predictions.show()
> 
> 
> Thanks, 
> lujinhong
> 

Thanks,
lujinhong



how to retain part of the features in LogisticRegressionModel (spark2.0)

2017-03-19 Thread jinhong lu

I train my LogisticRegressionModel like this,  I want my model to retain only 
some of the features(e.g. 500 of them), not all the  features. What shou I 
do? 
I use .setElasticNetParam(1.0), but still all the features is in 
lrModel.coefficients.

  import org.apache.spark.ml.classification.LogisticRegression
  val 
data=spark.read.format("libsvm").option("numFeatures","").load("/tmp/data/training_data3")
 
  val Array(trainingData, testData) = data.randomSplit(Array(0.5, 0.5), 
seed = 1234L)

  val lr = new LogisticRegression()
  val lrModel = lr.fit(trainingData)
  println(s"Coefficients: ${lrModel.coefficients} Intercept: 
${lrModel.intercept}")

  val predictions = lrModel.transform(testData)
  predictions.show()


Thanks, 
lujinhong



Re: how to construct parameter for model.transform() from datafile

2017-03-13 Thread jinhong lu
Anyone help?

> 在 2017年3月13日,19:38,jinhong lu  写道:
> 
> After train the mode, I got the result look like this:
> 
> 
>   scala>  predictionResult.show()
>   
> +-++++--+
>   |label|features|   rawPrediction| 
> probability|prediction|
>   
> +-++++--+
>   |  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|  
>  0.0|
>   |  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|  
>  0.0|
>   |  0.0|(144109,[100],[24...|[-146.81612388602...|[9.73704654529197...|  
>  1.0|
> 
> And then, I transform() the data by these code:
> 
>   import org.apache.spark.ml.linalg.Vectors
>   import org.apache.spark.ml.linalg.Vector
>   import scala.collection.mutable
> 
>  def lineToVector(line:String ):Vector={
>   val seq = new mutable.Queue[(Int,Double)]
>   val content = line.split(" ");
>   for( s <- content){
> val index = s.split(":")(0).toInt
> val value = s.split(":")(1).toDouble
>  seq += ((index,value))
>   }
>   return Vectors.sparse(144109, seq)
> }
> 
>val df = sc.sequenceFile[org.apache.hadoop.io.LongWritable, 
> org.apache.hadoop.io.Text]("/data/gamein/gameall_sdc/wh/gameall.db/edt_udid_label_format/ds=20170312/001006_0").map(line=>line._2).map(line
>  => 
> (line.toString.split("\t")(0),lineToVector(line.toString.split("\t")(1.toDF("udid",
>  "features")
>val predictionResult = model.transform(df)
>predictionResult.show()
> 
> 
> But I got the error look like this:
> 
> Caused by: java.lang.IllegalArgumentException: requirement failed: You may 
> not write an element to index 804201 because the declared size of your vector 
> is 144109
>  at scala.Predef$.require(Predef.scala:224)
>  at org.apache.spark.ml.linalg.Vectors$.sparse(Vectors.scala:219)
>  at lineToVector(:55)
>  at $anonfun$4.apply(:50)
>  at $anonfun$4.apply(:50)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:84)
>  at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
> 
> So I change
> 
>   return Vectors.sparse(144109, seq)
> 
> to 
> 
>   return Vectors.sparse(804202, seq)
> 
> Another error occurs:
> 
>   Caused by: java.lang.IllegalArgumentException: requirement failed: The 
> columns of A don't match the number of elements of x. A: 144109, x: 804202
> at scala.Predef$.require(Predef.scala:224)
> at org.apache.spark.ml.linalg.BLAS$.gemv(BLAS.scala:521)
> at 
> org.apache.spark.ml.linalg.Matrix$class.multiply(Matrices.scala:110)
> at org.apache.spark.ml.linalg.DenseMatrix.multiply(Matrices.scala:176)
> 
> what should I do?
>> 在 2017年3月13日,16:31,jinhong lu  写道:
>> 
>> Hi, all:
>> 
>> I got these training data:
>> 
>>  0 31607:17
>>  0 111905:36
>>  0 109:3 506:41 1509:1 2106:4 5309:1 7209:5 8406:1 27108:1 27709:1 
>> 30209:8 36109:20 41408:1 42309:1 46509:1 47709:5 57809:1 58009:1 58709:2 
>> 112109:4 123305:48 142509:1
>>  0 407:14 2905:2 5209:2 6509:2 6909:2 14509:2 18507:10
>>  0 604:3 3505:9 6401:3 6503:2 6505:3 7809:8 10509:3 12109:3 15207:19 
>> 31607:19
>>  0 19109:7 29705:4 123305:32
>>  0 15309:1 43005:1 108509:1
>>  1 604:1 6401:1 6503:1 15207:4 31607:40
>>  0 1807:19
>>  0 301:14 501:1 1502:14 2507:12 123305:4
>>  0 607:14 19109:460 123305:448
>>  0 5406:14 7209:4 10509:3 19109:6 24706:10 26106:4 31409:1 123305:48 
>> 128209:1
>>  1 1606:1 2306:3 3905:19 4408:3

Re: how to construct parameter for model.transform() from datafile

2017-03-13 Thread jinhong lu
After train the mode, I got the result look like this:


scala>  predictionResult.show()

+-++++--+
|label|features|   rawPrediction| 
probability|prediction|

+-++++--+
|  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|  
 0.0|
|  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|  
 0.0|
|  0.0|(144109,[100],[24...|[-146.81612388602...|[9.73704654529197...|  
 1.0|

And then, I transform() the data by these code:

import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.linalg.Vector
import scala.collection.mutable

   def lineToVector(line:String ):Vector={
val seq = new mutable.Queue[(Int,Double)]
val content = line.split(" ");
for( s <- content){
  val index = s.split(":")(0).toInt
  val value = s.split(":")(1).toDouble
   seq += ((index,value))
}
return Vectors.sparse(144109, seq)
  }

 val df = sc.sequenceFile[org.apache.hadoop.io.LongWritable, 
org.apache.hadoop.io.Text]("/data/gamein/gameall_sdc/wh/gameall.db/edt_udid_label_format/ds=20170312/001006_0").map(line=>line._2).map(line
 => 
(line.toString.split("\t")(0),lineToVector(line.toString.split("\t")(1.toDF("udid",
 "features")
 val predictionResult = model.transform(df)
 predictionResult.show()


But I got the error look like this:

 Caused by: java.lang.IllegalArgumentException: requirement failed: You may not 
write an element to index 804201 because the declared size of your vector is 
144109
  at scala.Predef$.require(Predef.scala:224)
  at org.apache.spark.ml.linalg.Vectors$.sparse(Vectors.scala:219)
  at lineToVector(:55)
  at $anonfun$4.apply(:50)
  at $anonfun$4.apply(:50)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:84)
  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
  at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
  at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)

So I change

return Vectors.sparse(144109, seq)

to 

return Vectors.sparse(804202, seq)

Another error occurs:

Caused by: java.lang.IllegalArgumentException: requirement failed: The 
columns of A don't match the number of elements of x. A: 144109, x: 804202
  at scala.Predef$.require(Predef.scala:224)
  at org.apache.spark.ml.linalg.BLAS$.gemv(BLAS.scala:521)
  at 
org.apache.spark.ml.linalg.Matrix$class.multiply(Matrices.scala:110)
  at org.apache.spark.ml.linalg.DenseMatrix.multiply(Matrices.scala:176)

what should I do?
> 在 2017年3月13日,16:31,jinhong lu  写道:
> 
> Hi, all:
> 
> I got these training data:
> 
>   0 31607:17
>   0 111905:36
>   0 109:3 506:41 1509:1 2106:4 5309:1 7209:5 8406:1 27108:1 27709:1 
> 30209:8 36109:20 41408:1 42309:1 46509:1 47709:5 57809:1 58009:1 58709:2 
> 112109:4 123305:48 142509:1
>   0 407:14 2905:2 5209:2 6509:2 6909:2 14509:2 18507:10
>   0 604:3 3505:9 6401:3 6503:2 6505:3 7809:8 10509:3 12109:3 15207:19 
> 31607:19
>   0 19109:7 29705:4 123305:32
>   0 15309:1 43005:1 108509:1
>   1 604:1 6401:1 6503:1 15207:4 31607:40
>   0 1807:19
>   0 301:14 501:1 1502:14 2507:12 123305:4
>   0 607:14 19109:460 123305:448
>   0 5406:14 7209:4 10509:3 19109:6 24706:10 26106:4 31409:1 123305:48 
> 128209:1
>   1 1606:1 2306:3 3905:19 4408:3 4506:8 8707:3 19109:50 24809:1 26509:2 
> 27709:2 56509:8 122705:62 123305:31 124005:2
> 
> And then I train the model by spark:
> 
>   import org.apache.spark.ml.classification.NaiveBayes
>   import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
>   import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
>   import org.apache.spark.sql.SparkSession
> 
>   val spark = 
> SparkSession.builder.appName("NaiveBayesExampl

mllib based on dataset or dataframe

2016-07-10 Thread jinhong lu
Hi,
Since the DataSet will be the major API in spark2.0,  why mllib will 
DataFrame-based, and 'future development will focus on the DataFrame-based API.’

   Any plan will change mllib form DataFrame-based to DataSet-based?


=
Thanks,
lujinhong


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org