Hi guys,
Here I am again. I am playing with Flink ML and was just trying to get the
example to work used in the documentation:
https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/libs/ml/quickstart.html#loading-data
(the one using the astroparticle LibSVM data).
My code is basically what you see in the example, with some more output for
verification:
object LearnDocumentEntityRelationship {
val trainingDataPath = “/data/svmguide1.training.txt"
val testDataPath = “/data/svmguide1.test.txt"
def main(args: Array[String]) {
val env = ExecutionEnvironment.getExecutionEnvironment
val trainingData: DataSet[LabeledVector] = MLUtils.readLibSVM(env,
trainingDataPath)
println("============================")
println("=== Training Data")
println("============================")
trainingData.print()
val testData = MLUtils.readLibSVM(env, testDataPath).map(x =>
(x.vector, x.label))
println("============================")
println("=== Test Data")
println("============================")
testData.print()
val svm = SVM()
.setBlocks(env.getParallelism)
.setIterations(100)
.setRegularization(0.001)
.setStepsize(0.1)
.setSeed(42)
svm.fit(trainingData)
val evaluationPairs: DataSet[(Double, Double)] = svm.evaluate(testData)
println("============================")
println("=== Evaluation Pairs")
println("============================")
evaluationPairs.print()
val realData = MLUtils.readLibSVM(env, testDataPath).map(x => x.vector)
var predictionDS = svm.predict(realData)
println("============================")
println("=== Predictions")
println("============================")
predictionDS.print()
println("=== End")
env.execute("Learn Document Entity Relationship Job")
}
}
The issue is that the predictions (from both the evaluation pairs and the
prediction dataset) are always equal to “1.0”. When I changed the labels in the
data files to 16 and 8 (so 1 is not a valid label anymore) it still keeps
predicting “1.0” for every single record. I also tried with some other custom
datasets, but I always get that same result.
This is a concise part of the output (as the data contains to many records to
put here):
============================
=== Test Data
============================
(SparseVector((0,4.236298), (1,21.9821), (2,-0.3503797), (3,97.52163)),16.0)
(SparseVector((0,4.236298), (1,21.9821), (2,-0.3503797), (3,97.52163)),16.0)
(SparseVector((0,77.948), (1,193.678), (2,0.1584834), (3,122.2632)),8.0)
(SparseVector((0,50.24301), (1,312.111), (2,-0.166669), (3,179.9808)),8.0)
============================
=== Evaluation Pairs
============================
(16.0,1.0)
(16.0,1.0)
(8.0,1.0)
(8.0,1.0)
============================
=== Predictions
============================
(SparseVector((0,4.236298), (1,21.9821), (2,-0.3503797), (3,97.52163)),1.0)
(SparseVector((0,4.236298), (1,21.9821), (2,-0.3503797), (3,97.52163)),1.0)
(SparseVector((0,77.948), (1,193.678), (2,0.1584834), (3,122.2632)),1.0)
(SparseVector((0,50.24301), (1,312.111), (2,-0.166669), (3,179.9808)),1.0)
Am I doing something wrong?
Any pointers are greatly appreciated. Thanks!
— Mano