Hi All,
I'm trying a simple K-Means example as per the website:
val parsedData = data.map(s => Vectors.dense(s.split(',').map(_.toDouble)))
but I'm trying to write a Java based validation method first so that
missing values are omitted or replaced with 0.
public RDD<Vector> prepareKMeans(JavaRDD<String> data) {
JavaRDD<Vector> words = data.flatMap(new FlatMapFunction<String,
Vector>() {
public Iterable<Vector> call(String s) {
String[] split = s.split(",");
ArrayList<Vector> add = new ArrayList<Vector>();
if (split.length != 2) {
add.add(Vectors.dense(0, 0));
} else
{
add.add(Vectors.dense(Double.parseDouble(split[0]),
Double.parseDouble(split[1])));
}
return add;
}
});
return words.rdd();
}
When I then call from scala:
val parsedData=dc.prepareKMeans(data);
val p=parsedData.collect();
I get Exception in thread "main" java.lang.ClassCastException:
[Ljava.lang.Object; cannot be cast to
[Lorg.apache.spark.mllib.linalg.Vector;
Why is the class tag is object rather than vector?
1) How do I get this working correctly using the Java validation example
above or
2) How can I modify val parsedData = data.map(s =>
Vectors.dense(s.split(',').map(_.toDouble))) so that when s.split size <2 I
ignore the line? or
3) Is there a better way to do input validation first?
Using spark and mlib:
libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.2.0"
libraryDependencies += "org.apache.spark" % "spark-mllib_2.10" % "1.2.0"
Many thanks in advance
Dev