It was a bug in the code, however adding the step parameter got the results
to work.  Mean Squared Error = 2.610379825794694E-5

I've also opened a jira to put the step parameter in the examples so that
people new to mllib have a way to improve the MSE.

https://issues.apache.org/jira/browse/SPARK-5273

On Thu, Jan 15, 2015 at 8:23 PM, Joseph Bradley <jos...@databricks.com>
wrote:

> It looks like you're training on the non-scaled data but testing on the
> scaled data.  Have you tried this training & testing on only the scaled
> data?
>
> On Thu, Jan 15, 2015 at 10:42 AM, Devl Devel <devl.developm...@gmail.com>
> wrote:
>
>> Thanks, that helps a bit at least with the NaN but the MSE is still very
>> high even with that step size and 10k iterations:
>>
>> training Mean Squared Error = 3.3322561285919316E7
>>
>> Does this method need say 100k iterations?
>>
>>
>>
>>
>>
>>
>> On Thu, Jan 15, 2015 at 5:42 PM, Robin East <robin.e...@xense.co.uk>
>> wrote:
>>
>> > -dev, +user
>> >
>> > You’ll need to set the gradient descent step size to something small - a
>> > bit of trial and error shows that 0.00000001 works.
>> >
>> > You’ll need to create a LinearRegressionWithSGD instance and set the
>> step
>> > size explicitly:
>> >
>> > val lr = new LinearRegressionWithSGD()
>> > lr.optimizer.setStepSize(0.00000001)
>> > lr.optimizer.setNumIterations(100)
>> > val model = lr.run(parsedData)
>> >
>> > On 15 Jan 2015, at 16:46, devl.development <devl.developm...@gmail.com>
>> > wrote:
>> >
>> > From what I gather, you use LinearRegressionWithSGD to predict y or the
>> > response variable given a feature vector x.
>> >
>> > In a simple example I used a perfectly linear dataset such that x=y
>> > y,x
>> > 1,1
>> > 2,2
>> > ...
>> >
>> > 10000,10000
>> >
>> > Using the out-of-box example from the website (with and without
>> scaling):
>> >
>> > val data = sc.textFile(file)
>> >
>> >    val parsedData = data.map { line =>
>> >      val parts = line.split(',')
>> >     LabeledPoint(parts(1).toDouble, Vectors.dense(parts(0).toDouble))
>> //y
>> > and x
>> >
>> >    }
>> >    val scaler = new StandardScaler(withMean = true, withStd = true)
>> >      .fit(parsedData.map(x => x.features))
>> >    val scaledData = parsedData
>> >      .map(x =>
>> >      LabeledPoint(x.label,
>> >        scaler.transform(Vectors.dense(x.features.toArray))))
>> >
>> >    // Building the model
>> >    val numIterations = 100
>> >    val model = LinearRegressionWithSGD.train(parsedData, numIterations)
>> >
>> >    // Evaluate model on training examples and compute training error *
>> > tried using both scaledData and parsedData
>> >    val valuesAndPreds = scaledData.map { point =>
>> >      val prediction = model.predict(point.features)
>> >      (point.label, prediction)
>> >    }
>> >    val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p),
>> 2)}.mean()
>> >    println("training Mean Squared Error = " + MSE)
>> >
>> > Both scaled and unscaled attempts give:
>> >
>> > training Mean Squared Error = NaN
>> >
>> > I've even tried x, y+(sample noise from normal with mean 0 and stddev 1)
>> > still comes up with the same thing.
>> >
>> > Is this not supposed to work for x and y or 2 dimensional plots? Is
>> there
>> > something I'm missing or wrong in the code above? Or is there a
>> limitation
>> > in the method?
>> >
>> > Thanks for any advice.
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> >
>> http://apache-spark-developers-list.1001551.n3.nabble.com/LinearRegressionWithSGD-accuracy-tp10127.html
>> > Sent from the Apache Spark Developers List mailing list archive at
>> > Nabble.com.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: dev-h...@spark.apache.org
>> >
>> >
>> >
>>
>
>

Reply via email to