[jira] [Created] (FLINK-9664) FlinkML Quickstart Loading Data section example doesn't work as described

2018-06-26 Thread Mano Swerts (JIRA)
Mano Swerts created FLINK-9664:
--

 Summary: FlinkML Quickstart Loading Data section example doesn't 
work as described
 Key: FLINK-9664
 URL: https://issues.apache.org/jira/browse/FLINK-9664
 Project: Flink
  Issue Type: Bug
  Components: Documentation, Machine Learning Library
Affects Versions: 1.5.0
Reporter: Mano Swerts


The ML documentation example isn't complete: 
[https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/libs/ml/quickstart.html#loading-data]

The referred section loads data from an astroparticle binary classification 
dataset to showcase SVM. The dataset uses 0 and 1 as labels, which doesn't 
produce correct results. The SVM predictor expects -1 and 1 labels to correctly 
predict the label. The documentation, however, doesn't mention that. The 
example therefore doesn't work without a clue why.

The documentation should be updated with an explicit mention to -1 and 1 labels 
and a mapping function that shows the conversion of the labels.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: FlinkML SVM Predictions are always 1.0

2018-06-25 Thread Mano Swerts
Hi all,

This is just getting stranger… After playing a while, it seems that if I have a 
vector that has value of 0 (i.e. all zero’s) it classifies it as -1.0. Any 
other value for the vector causes it to classify as 1.0:


=== Predictions

(DenseVector(0.0, 0.0, 0.0),-1.0)
(DenseVector(0.0, 0.5, 0.0),1.0)
(DenseVector(1.0, 1.0, 1.0),1.0)
(DenseVector(0.0, 0.0, 0.0),-1.0)
(DenseVector(0.0, 0.5, 1.0),1.0)

So it seems that my values need to be binary for this prediction to work, which 
of course does not make sense and doesn’t match the data from the example on 
the Flink website. It gives me the impression that it is using the vector as 
the label instead of the value…

Any insights?

— Mano

On 25 Jun 2018, at 11:40, Mano Swerts 
mailto:mano.swe...@ixxus.com>> wrote:

Hi Rong,

As you can see in my test data example, I did change the labeling data to 8 and 
16 instead of 1 and 0.

If SVM always returns +1.0 or -1.0, that would then indeed explain where the 
1.0 is coming from. But, it never gives me -1.0, so there is still something 
wrong as it classifies everything under the same label.

Thanks.

— Mano

On 23 Jun 2018, at 20:50, Rong Rong 
mailto:walter...@gmail.com>> wrote:

Hi Mano,

For the always positive prediction result. I think the standard svmguide
data [1] is labeling data as 0.0 and 1.0 instead of -1.0 and +1.0. Maybe
correcting that should work for your case.
For the change of eval pairs, I think SVM in FlinkML will always return
a +1.0 or -1.0 when you use it this way as a binary classification.

Thanks,
Rong

[1] https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/svmguide1

On Fri, Jun 22, 2018 at 6:49 AM Mano Swerts 
mailto:mano.swe...@ixxus.com>> wrote:

Hi guys,

Here I am again. I am playing with Flink ML and was just trying to get the
example to work used in the documentation:
https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/libs/ml/quickstart.html#loading-data
(the one using the astroparticle LibSVM data).

My code is basically what you see in the example, with some more output
for verification:


object LearnDocumentEntityRelationship {

  val trainingDataPath = “/data/svmguide1.training.txt"
  val testDataPath = “/data/svmguide1.test.txt"

  def main(args: Array[String]) {
  val env = ExecutionEnvironment.getExecutionEnvironment

  val trainingData: DataSet[LabeledVector] = MLUtils.readLibSVM(env,
trainingDataPath)

  println("")
  println("=== Training Data")
  println("")
  trainingData.print()

  val testData = MLUtils.readLibSVM(env, testDataPath).map(x =>
(x.vector, x.label))

  println("")
  println("=== Test Data")
  println("")
  testData.print()

  val svm = SVM()
  .setBlocks(env.getParallelism)
  .setIterations(100)
  .setRegularization(0.001)
  .setStepsize(0.1)
  .setSeed(42)

  svm.fit(trainingData)

  val evaluationPairs: DataSet[(Double, Double)] =
svm.evaluate(testData)

  println("")
  println("=== Evaluation Pairs")
  println("")
  evaluationPairs.print()

  val realData = MLUtils.readLibSVM(env, testDataPath).map(x =>
x.vector)

  var predictionDS = svm.predict(realData)

  println("")
  println("=== Predictions")
  println("")
  predictionDS.print()

  println("=== End")

  env.execute("Learn Document Entity Relationship Job")
  }
}


The issue is that the predictions (from both the evaluation pairs and the
prediction dataset) are always equal to “1.0”. When I changed the labels in
the data files to 16 and 8 (so 1 is not a valid label anymore) it still
keeps predicting “1.0” for every single record. I also tried with some
other custom datasets, but I always get that same result.

This is a concise part of the output (as the data contains to many records
to put here):


=== Test Data

(SparseVector((0,4.236298), (1,21.9821), (2,-0.3503797),
(3,97.52163)),16.0)
(SparseVector((0,4.236298), (1,21.9821), (2,-0.3503797),
(3,97.52163)),16.0)
(SparseVector((0,77.948), (1,193.678), (2,0.1584834), (3,122.2632)),8.0)
(SparseVector((0,50.24301), (1,312.111), (2,-0.19), (3,179.9808)),8.0)


=== Evaluation Pairs

(16.0,1.0)
(16.0,1.0)
(8.0,1.0)
(8.0,1.0)


=== Predictions

(SparseVector((0,4.236298), (1,21.9821), (2,-0.3503797), (3,97.52163)),1.0)
(SparseVector((0,4.236298), (1,21.9821), (

Re: FlinkML SVM Predictions are always 1.0

2018-06-25 Thread Mano Swerts
Hi Rong,

As you can see in my test data example, I did change the labeling data to 8 and 
16 instead of 1 and 0.

If SVM always returns +1.0 or -1.0, that would then indeed explain where the 
1.0 is coming from. But, it never gives me -1.0, so there is still something 
wrong as it classifies everything under the same label.

Thanks.

— Mano

> On 23 Jun 2018, at 20:50, Rong Rong  wrote:
> 
> Hi Mano,
> 
> For the always positive prediction result. I think the standard svmguide
> data [1] is labeling data as 0.0 and 1.0 instead of -1.0 and +1.0. Maybe
> correcting that should work for your case.
> For the change of eval pairs, I think SVM in FlinkML will always return
> a +1.0 or -1.0 when you use it this way as a binary classification.
> 
> Thanks,
> Rong
> 
> [1] https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/svmguide1
> 
> On Fri, Jun 22, 2018 at 6:49 AM Mano Swerts  wrote:
> 
>> Hi guys,
>> 
>> Here I am again. I am playing with Flink ML and was just trying to get the
>> example to work used in the documentation:
>> https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/libs/ml/quickstart.html#loading-data
>> (the one using the astroparticle LibSVM data).
>> 
>> My code is basically what you see in the example, with some more output
>> for verification:
>> 
>> 
>> object LearnDocumentEntityRelationship {
>> 
>>val trainingDataPath = “/data/svmguide1.training.txt"
>>val testDataPath = “/data/svmguide1.test.txt"
>> 
>>def main(args: Array[String]) {
>>val env = ExecutionEnvironment.getExecutionEnvironment
>> 
>>val trainingData: DataSet[LabeledVector] = MLUtils.readLibSVM(env,
>> trainingDataPath)
>> 
>>println("")
>>println("=== Training Data")
>>println("")
>>trainingData.print()
>> 
>>val testData = MLUtils.readLibSVM(env, testDataPath).map(x =>
>> (x.vector, x.label))
>> 
>>println("")
>>println("=== Test Data")
>>println("")
>>testData.print()
>> 
>>val svm = SVM()
>>.setBlocks(env.getParallelism)
>>.setIterations(100)
>>.setRegularization(0.001)
>>.setStepsize(0.1)
>>.setSeed(42)
>> 
>>svm.fit(trainingData)
>> 
>>val evaluationPairs: DataSet[(Double, Double)] =
>> svm.evaluate(testData)
>> 
>>println("")
>>println("=== Evaluation Pairs")
>>println("")
>>evaluationPairs.print()
>> 
>>val realData = MLUtils.readLibSVM(env, testDataPath).map(x =>
>> x.vector)
>> 
>>var predictionDS = svm.predict(realData)
>> 
>>println("")
>>println("=== Predictions")
>>println("")
>>predictionDS.print()
>> 
>>println("=== End")
>> 
>>env.execute("Learn Document Entity Relationship Job")
>>}
>> }
>> 
>> 
>> The issue is that the predictions (from both the evaluation pairs and the
>> prediction dataset) are always equal to “1.0”. When I changed the labels in
>> the data files to 16 and 8 (so 1 is not a valid label anymore) it still
>> keeps predicting “1.0” for every single record. I also tried with some
>> other custom datasets, but I always get that same result.
>> 
>> This is a concise part of the output (as the data contains to many records
>> to put here):
>> 
>> 
>> === Test Data
>> 
>> (SparseVector((0,4.236298), (1,21.9821), (2,-0.3503797),
>> (3,97.52163)),16.0)
>> (SparseVector((0,4.236298), (1,21.9821), (2,-0.3503797),
>> (3,97.52163)),16.0)
>> (SparseVector((0,77.948), (1,193.678), (2,0.1584834), (3,122.2632)),8.0)
>> (SparseVector((0,50.24301), (1,312.111), (2,-0.19), (3,179.9808)),8.0)
>> 
>> 
>> === Evaluation Pairs
>> 
>> (16.0,1.0)
>> (16.0,1.0)
>> (8.0,1.0)
>> (8.0,1.0)
>> 
>> 
>> === Predictions
>> 
>> (SparseVector((0,4.236298), (1,21.9821), (2,-0.3503797), (3,97.52163)),1.0)
>> (SparseVector((0,4.236298), (1,21.9821), (2,-0.3503797), (3,97.52163)),1.0)
>> (SparseVector((0,77.948), (1,193.678), (2,0.1584834), (3,122.2632)),1.0)
>> (SparseVector((0,50.24301), (1,312.111), (2,-0.19), (3,179.9808)),1.0)
>> 
>> 
>> Am I doing something wrong?
>> 
>> Any pointers are greatly appreciated. Thanks!
>> 
>> — Mano
>> 



FlinkML SVM Predictions are always 1.0

2018-06-22 Thread Mano Swerts
Hi guys,

Here I am again. I am playing with Flink ML and was just trying to get the 
example to work used in the documentation: 
https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/libs/ml/quickstart.html#loading-data
 (the one using the astroparticle LibSVM data).

My code is basically what you see in the example, with some more output for 
verification:


object LearnDocumentEntityRelationship {

val trainingDataPath = “/data/svmguide1.training.txt"
val testDataPath = “/data/svmguide1.test.txt"

def main(args: Array[String]) {
val env = ExecutionEnvironment.getExecutionEnvironment

val trainingData: DataSet[LabeledVector] = MLUtils.readLibSVM(env, 
trainingDataPath)

println("")
println("=== Training Data")
println("")
trainingData.print()

val testData = MLUtils.readLibSVM(env, testDataPath).map(x => 
(x.vector, x.label))

println("")
println("=== Test Data")
println("")
testData.print()

val svm = SVM()
.setBlocks(env.getParallelism)
.setIterations(100)
.setRegularization(0.001)
.setStepsize(0.1)
.setSeed(42)

svm.fit(trainingData)

val evaluationPairs: DataSet[(Double, Double)] = svm.evaluate(testData)

println("")
println("=== Evaluation Pairs")
println("")
evaluationPairs.print()

val realData = MLUtils.readLibSVM(env, testDataPath).map(x => x.vector)

var predictionDS = svm.predict(realData)

println("")
println("=== Predictions")
println("")
predictionDS.print()

println("=== End")

env.execute("Learn Document Entity Relationship Job")
}
}


The issue is that the predictions (from both the evaluation pairs and the 
prediction dataset) are always equal to “1.0”. When I changed the labels in the 
data files to 16 and 8 (so 1 is not a valid label anymore) it still keeps 
predicting “1.0” for every single record. I also tried with some other custom 
datasets, but I always get that same result.

This is a concise part of the output (as the data contains to many records to 
put here):


=== Test Data

(SparseVector((0,4.236298), (1,21.9821), (2,-0.3503797), (3,97.52163)),16.0)
(SparseVector((0,4.236298), (1,21.9821), (2,-0.3503797), (3,97.52163)),16.0)
(SparseVector((0,77.948), (1,193.678), (2,0.1584834), (3,122.2632)),8.0)
(SparseVector((0,50.24301), (1,312.111), (2,-0.19), (3,179.9808)),8.0)


=== Evaluation Pairs

(16.0,1.0)
(16.0,1.0)
(8.0,1.0)
(8.0,1.0)


=== Predictions

(SparseVector((0,4.236298), (1,21.9821), (2,-0.3503797), (3,97.52163)),1.0)
(SparseVector((0,4.236298), (1,21.9821), (2,-0.3503797), (3,97.52163)),1.0)
(SparseVector((0,77.948), (1,193.678), (2,0.1584834), (3,122.2632)),1.0)
(SparseVector((0,50.24301), (1,312.111), (2,-0.19), (3,179.9808)),1.0)


Am I doing something wrong?

Any pointers are greatly appreciated. Thanks!

— Mano


Re: Running a Scala Job doesn't produce print output

2018-06-21 Thread Mano Swerts
Hi guys,

I am going to answer my own question ;) I looked at a Scala example in the 
Flink Github repo, which uses ExecutionEnvironment.getExecutionEnvironment to 
obtain the environment. That apparently doesn’t work.

When I change this to StreamExecutionEnvironment.getExecutionEnvironment, as 
used in the Flink Maven archetype, it works fine.

I don’t know whether this is a bug or the example needs updating. At least now 
this has been recorded for others struggling with the same issue in the future.

— Mano

On 21 Jun 2018, at 11:27, Mano Swerts 
mailto:mano.swe...@ixxus.com>> wrote:

Hi guys,

I have a question. I have been playing around with Fink this week and created 
some basic Java jobs that work fine. Now I am trying to run one in Scala.

Running this code in the Scala REP prints the expected output:

env.fromElements(1, 2, 3).map(i => " Integer: " + i).print()

However, having it packaged in a JAR which I then deploy through the user 
interface doesn’t give me any output at all. I can start the job and it 
finishes without exceptions, but I don’t see the result of the print() 
statement in the log. The class looks like this:


package com.ixxus.playground.fmk.flink

import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.api.scala._

object LearnDocumentEntityRelationship {

def main(args: Array[String]) {
val env = ExecutionEnvironment.getExecutionEnvironment
val params: ParameterTool = ParameterTool.fromArgs(args)

env.fromElements(1, 2, 3).map(i => " Integer: " + i).print()

env.execute("Scala example")
}
}


I did notice that the job name isn’t what I pass to env.execute. It is named 
“Flink Java Job”:




I can’t find anything online however about this phenomenon. Does anyone have 
any idea?

Thanks.

— Mano



Running a Scala Job doesn't produce print output

2018-06-21 Thread Mano Swerts
Hi guys,

I have a question. I have been playing around with Fink this week and created 
some basic Java jobs that work fine. Now I am trying to run one in Scala.

Running this code in the Scala REP prints the expected output:

env.fromElements(1, 2, 3).map(i => " Integer: " + i).print()

However, having it packaged in a JAR which I then deploy through the user 
interface doesn’t give me any output at all. I can start the job and it 
finishes without exceptions, but I don’t see the result of the print() 
statement in the log. The class looks like this:


package com.ixxus.playground.fmk.flink

import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.api.scala._

object LearnDocumentEntityRelationship {

def main(args: Array[String]) {
val env = ExecutionEnvironment.getExecutionEnvironment
val params: ParameterTool = ParameterTool.fromArgs(args)

env.fromElements(1, 2, 3).map(i => " Integer: " + i).print()

env.execute("Scala example")
}
}


I did notice that the job name isn’t what I pass to env.execute. It is named 
“Flink Java Job”:

[cid:2702EFEA-4621-48AD-8259-8671011EB519]


I can’t find anything online however about this phenomenon. Does anyone have 
any idea?

Thanks.

— Mano