[jira] [Created] (FLINK-9664) FlinkML Quickstart Loading Data section example doesn't work as described
Mano Swerts created FLINK-9664: -- Summary: FlinkML Quickstart Loading Data section example doesn't work as described Key: FLINK-9664 URL: https://issues.apache.org/jira/browse/FLINK-9664 Project: Flink Issue Type: Bug Components: Documentation, Machine Learning Library Affects Versions: 1.5.0 Reporter: Mano Swerts The ML documentation example isn't complete: [https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/libs/ml/quickstart.html#loading-data] The referred section loads data from an astroparticle binary classification dataset to showcase SVM. The dataset uses 0 and 1 as labels, which doesn't produce correct results. The SVM predictor expects -1 and 1 labels to correctly predict the label. The documentation, however, doesn't mention that. The example therefore doesn't work without a clue why. The documentation should be updated with an explicit mention to -1 and 1 labels and a mapping function that shows the conversion of the labels. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: FlinkML SVM Predictions are always 1.0
Hi all, This is just getting stranger… After playing a while, it seems that if I have a vector that has value of 0 (i.e. all zero’s) it classifies it as -1.0. Any other value for the vector causes it to classify as 1.0: === Predictions (DenseVector(0.0, 0.0, 0.0),-1.0) (DenseVector(0.0, 0.5, 0.0),1.0) (DenseVector(1.0, 1.0, 1.0),1.0) (DenseVector(0.0, 0.0, 0.0),-1.0) (DenseVector(0.0, 0.5, 1.0),1.0) So it seems that my values need to be binary for this prediction to work, which of course does not make sense and doesn’t match the data from the example on the Flink website. It gives me the impression that it is using the vector as the label instead of the value… Any insights? — Mano On 25 Jun 2018, at 11:40, Mano Swerts mailto:mano.swe...@ixxus.com>> wrote: Hi Rong, As you can see in my test data example, I did change the labeling data to 8 and 16 instead of 1 and 0. If SVM always returns +1.0 or -1.0, that would then indeed explain where the 1.0 is coming from. But, it never gives me -1.0, so there is still something wrong as it classifies everything under the same label. Thanks. — Mano On 23 Jun 2018, at 20:50, Rong Rong mailto:walter...@gmail.com>> wrote: Hi Mano, For the always positive prediction result. I think the standard svmguide data [1] is labeling data as 0.0 and 1.0 instead of -1.0 and +1.0. Maybe correcting that should work for your case. For the change of eval pairs, I think SVM in FlinkML will always return a +1.0 or -1.0 when you use it this way as a binary classification. Thanks, Rong [1] https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/svmguide1 On Fri, Jun 22, 2018 at 6:49 AM Mano Swerts mailto:mano.swe...@ixxus.com>> wrote: Hi guys, Here I am again. I am playing with Flink ML and was just trying to get the example to work used in the documentation: https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/libs/ml/quickstart.html#loading-data (the one using the astroparticle LibSVM data). My code is basically what you see in the example, with some more output for verification: object LearnDocumentEntityRelationship { val trainingDataPath = “/data/svmguide1.training.txt" val testDataPath = “/data/svmguide1.test.txt" def main(args: Array[String]) { val env = ExecutionEnvironment.getExecutionEnvironment val trainingData: DataSet[LabeledVector] = MLUtils.readLibSVM(env, trainingDataPath) println("") println("=== Training Data") println("") trainingData.print() val testData = MLUtils.readLibSVM(env, testDataPath).map(x => (x.vector, x.label)) println("") println("=== Test Data") println("") testData.print() val svm = SVM() .setBlocks(env.getParallelism) .setIterations(100) .setRegularization(0.001) .setStepsize(0.1) .setSeed(42) svm.fit(trainingData) val evaluationPairs: DataSet[(Double, Double)] = svm.evaluate(testData) println("") println("=== Evaluation Pairs") println("") evaluationPairs.print() val realData = MLUtils.readLibSVM(env, testDataPath).map(x => x.vector) var predictionDS = svm.predict(realData) println("") println("=== Predictions") println("") predictionDS.print() println("=== End") env.execute("Learn Document Entity Relationship Job") } } The issue is that the predictions (from both the evaluation pairs and the prediction dataset) are always equal to “1.0”. When I changed the labels in the data files to 16 and 8 (so 1 is not a valid label anymore) it still keeps predicting “1.0” for every single record. I also tried with some other custom datasets, but I always get that same result. This is a concise part of the output (as the data contains to many records to put here): === Test Data (SparseVector((0,4.236298), (1,21.9821), (2,-0.3503797), (3,97.52163)),16.0) (SparseVector((0,4.236298), (1,21.9821), (2,-0.3503797), (3,97.52163)),16.0) (SparseVector((0,77.948), (1,193.678), (2,0.1584834), (3,122.2632)),8.0) (SparseVector((0,50.24301), (1,312.111), (2,-0.19), (3,179.9808)),8.0) === Evaluation Pairs (16.0,1.0) (16.0,1.0) (8.0,1.0) (8.0,1.0) === Predictions (SparseVector((0,4.236298), (1,21.9821), (2,-0.3503797), (3,97.52163)),1.0) (SparseVector((0,4.236298), (1,21.9821), (
Re: FlinkML SVM Predictions are always 1.0
Hi Rong, As you can see in my test data example, I did change the labeling data to 8 and 16 instead of 1 and 0. If SVM always returns +1.0 or -1.0, that would then indeed explain where the 1.0 is coming from. But, it never gives me -1.0, so there is still something wrong as it classifies everything under the same label. Thanks. — Mano > On 23 Jun 2018, at 20:50, Rong Rong wrote: > > Hi Mano, > > For the always positive prediction result. I think the standard svmguide > data [1] is labeling data as 0.0 and 1.0 instead of -1.0 and +1.0. Maybe > correcting that should work for your case. > For the change of eval pairs, I think SVM in FlinkML will always return > a +1.0 or -1.0 when you use it this way as a binary classification. > > Thanks, > Rong > > [1] https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/svmguide1 > > On Fri, Jun 22, 2018 at 6:49 AM Mano Swerts wrote: > >> Hi guys, >> >> Here I am again. I am playing with Flink ML and was just trying to get the >> example to work used in the documentation: >> https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/libs/ml/quickstart.html#loading-data >> (the one using the astroparticle LibSVM data). >> >> My code is basically what you see in the example, with some more output >> for verification: >> >> >> object LearnDocumentEntityRelationship { >> >>val trainingDataPath = “/data/svmguide1.training.txt" >>val testDataPath = “/data/svmguide1.test.txt" >> >>def main(args: Array[String]) { >>val env = ExecutionEnvironment.getExecutionEnvironment >> >>val trainingData: DataSet[LabeledVector] = MLUtils.readLibSVM(env, >> trainingDataPath) >> >>println("") >>println("=== Training Data") >>println("") >>trainingData.print() >> >>val testData = MLUtils.readLibSVM(env, testDataPath).map(x => >> (x.vector, x.label)) >> >>println("") >>println("=== Test Data") >>println("") >>testData.print() >> >>val svm = SVM() >>.setBlocks(env.getParallelism) >>.setIterations(100) >>.setRegularization(0.001) >>.setStepsize(0.1) >>.setSeed(42) >> >>svm.fit(trainingData) >> >>val evaluationPairs: DataSet[(Double, Double)] = >> svm.evaluate(testData) >> >>println("") >>println("=== Evaluation Pairs") >>println("") >>evaluationPairs.print() >> >>val realData = MLUtils.readLibSVM(env, testDataPath).map(x => >> x.vector) >> >>var predictionDS = svm.predict(realData) >> >>println("") >>println("=== Predictions") >>println("") >>predictionDS.print() >> >>println("=== End") >> >>env.execute("Learn Document Entity Relationship Job") >>} >> } >> >> >> The issue is that the predictions (from both the evaluation pairs and the >> prediction dataset) are always equal to “1.0”. When I changed the labels in >> the data files to 16 and 8 (so 1 is not a valid label anymore) it still >> keeps predicting “1.0” for every single record. I also tried with some >> other custom datasets, but I always get that same result. >> >> This is a concise part of the output (as the data contains to many records >> to put here): >> >> >> === Test Data >> >> (SparseVector((0,4.236298), (1,21.9821), (2,-0.3503797), >> (3,97.52163)),16.0) >> (SparseVector((0,4.236298), (1,21.9821), (2,-0.3503797), >> (3,97.52163)),16.0) >> (SparseVector((0,77.948), (1,193.678), (2,0.1584834), (3,122.2632)),8.0) >> (SparseVector((0,50.24301), (1,312.111), (2,-0.19), (3,179.9808)),8.0) >> >> >> === Evaluation Pairs >> >> (16.0,1.0) >> (16.0,1.0) >> (8.0,1.0) >> (8.0,1.0) >> >> >> === Predictions >> >> (SparseVector((0,4.236298), (1,21.9821), (2,-0.3503797), (3,97.52163)),1.0) >> (SparseVector((0,4.236298), (1,21.9821), (2,-0.3503797), (3,97.52163)),1.0) >> (SparseVector((0,77.948), (1,193.678), (2,0.1584834), (3,122.2632)),1.0) >> (SparseVector((0,50.24301), (1,312.111), (2,-0.19), (3,179.9808)),1.0) >> >> >> Am I doing something wrong? >> >> Any pointers are greatly appreciated. Thanks! >> >> — Mano >>
FlinkML SVM Predictions are always 1.0
Hi guys, Here I am again. I am playing with Flink ML and was just trying to get the example to work used in the documentation: https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/libs/ml/quickstart.html#loading-data (the one using the astroparticle LibSVM data). My code is basically what you see in the example, with some more output for verification: object LearnDocumentEntityRelationship { val trainingDataPath = “/data/svmguide1.training.txt" val testDataPath = “/data/svmguide1.test.txt" def main(args: Array[String]) { val env = ExecutionEnvironment.getExecutionEnvironment val trainingData: DataSet[LabeledVector] = MLUtils.readLibSVM(env, trainingDataPath) println("") println("=== Training Data") println("") trainingData.print() val testData = MLUtils.readLibSVM(env, testDataPath).map(x => (x.vector, x.label)) println("") println("=== Test Data") println("") testData.print() val svm = SVM() .setBlocks(env.getParallelism) .setIterations(100) .setRegularization(0.001) .setStepsize(0.1) .setSeed(42) svm.fit(trainingData) val evaluationPairs: DataSet[(Double, Double)] = svm.evaluate(testData) println("") println("=== Evaluation Pairs") println("") evaluationPairs.print() val realData = MLUtils.readLibSVM(env, testDataPath).map(x => x.vector) var predictionDS = svm.predict(realData) println("") println("=== Predictions") println("") predictionDS.print() println("=== End") env.execute("Learn Document Entity Relationship Job") } } The issue is that the predictions (from both the evaluation pairs and the prediction dataset) are always equal to “1.0”. When I changed the labels in the data files to 16 and 8 (so 1 is not a valid label anymore) it still keeps predicting “1.0” for every single record. I also tried with some other custom datasets, but I always get that same result. This is a concise part of the output (as the data contains to many records to put here): === Test Data (SparseVector((0,4.236298), (1,21.9821), (2,-0.3503797), (3,97.52163)),16.0) (SparseVector((0,4.236298), (1,21.9821), (2,-0.3503797), (3,97.52163)),16.0) (SparseVector((0,77.948), (1,193.678), (2,0.1584834), (3,122.2632)),8.0) (SparseVector((0,50.24301), (1,312.111), (2,-0.19), (3,179.9808)),8.0) === Evaluation Pairs (16.0,1.0) (16.0,1.0) (8.0,1.0) (8.0,1.0) === Predictions (SparseVector((0,4.236298), (1,21.9821), (2,-0.3503797), (3,97.52163)),1.0) (SparseVector((0,4.236298), (1,21.9821), (2,-0.3503797), (3,97.52163)),1.0) (SparseVector((0,77.948), (1,193.678), (2,0.1584834), (3,122.2632)),1.0) (SparseVector((0,50.24301), (1,312.111), (2,-0.19), (3,179.9808)),1.0) Am I doing something wrong? Any pointers are greatly appreciated. Thanks! — Mano
Re: Running a Scala Job doesn't produce print output
Hi guys, I am going to answer my own question ;) I looked at a Scala example in the Flink Github repo, which uses ExecutionEnvironment.getExecutionEnvironment to obtain the environment. That apparently doesn’t work. When I change this to StreamExecutionEnvironment.getExecutionEnvironment, as used in the Flink Maven archetype, it works fine. I don’t know whether this is a bug or the example needs updating. At least now this has been recorded for others struggling with the same issue in the future. — Mano On 21 Jun 2018, at 11:27, Mano Swerts mailto:mano.swe...@ixxus.com>> wrote: Hi guys, I have a question. I have been playing around with Fink this week and created some basic Java jobs that work fine. Now I am trying to run one in Scala. Running this code in the Scala REP prints the expected output: env.fromElements(1, 2, 3).map(i => " Integer: " + i).print() However, having it packaged in a JAR which I then deploy through the user interface doesn’t give me any output at all. I can start the job and it finishes without exceptions, but I don’t see the result of the print() statement in the log. The class looks like this: package com.ixxus.playground.fmk.flink import org.apache.flink.api.java.utils.ParameterTool import org.apache.flink.api.scala._ object LearnDocumentEntityRelationship { def main(args: Array[String]) { val env = ExecutionEnvironment.getExecutionEnvironment val params: ParameterTool = ParameterTool.fromArgs(args) env.fromElements(1, 2, 3).map(i => " Integer: " + i).print() env.execute("Scala example") } } I did notice that the job name isn’t what I pass to env.execute. It is named “Flink Java Job”: I can’t find anything online however about this phenomenon. Does anyone have any idea? Thanks. — Mano
Running a Scala Job doesn't produce print output
Hi guys, I have a question. I have been playing around with Fink this week and created some basic Java jobs that work fine. Now I am trying to run one in Scala. Running this code in the Scala REP prints the expected output: env.fromElements(1, 2, 3).map(i => " Integer: " + i).print() However, having it packaged in a JAR which I then deploy through the user interface doesn’t give me any output at all. I can start the job and it finishes without exceptions, but I don’t see the result of the print() statement in the log. The class looks like this: package com.ixxus.playground.fmk.flink import org.apache.flink.api.java.utils.ParameterTool import org.apache.flink.api.scala._ object LearnDocumentEntityRelationship { def main(args: Array[String]) { val env = ExecutionEnvironment.getExecutionEnvironment val params: ParameterTool = ParameterTool.fromArgs(args) env.fromElements(1, 2, 3).map(i => " Integer: " + i).print() env.execute("Scala example") } } I did notice that the job name isn’t what I pass to env.execute. It is named “Flink Java Job”: [cid:2702EFEA-4621-48AD-8259-8671011EB519] I can’t find anything online however about this phenomenon. Does anyone have any idea? Thanks. — Mano