Github user ConeyLiu commented on the issue: https://github.com/apache/spark/pull/19586 Hi @cloud-fan, @jerryshao. The problem of `writeClass` and `readClass` can be solved by register the class: Vector, DenseVector, SparseVector. The follow is the test results: ```scala val conf = new SparkConf().setAppName("Vector Register Test") conf.registerKryoClasses(Array(classOf[Vector], classOf[DenseVector], classOf[SparseVector])) val sc = new SparkContext(conf) val sourceData = sc.sequenceFile[LongWritable, VectorWritable](args(0)) .map { case (k, v) => val vector = v.get() val tmpVector = new Array[Double](v.get().size()) for (i <- 0 until vector.size()) { tmpVector(i) = vector.get(i) } Vectors.dense(tmpVector) } sourceData.persist(StorageLevel.OFF_HEAP) var start = System.currentTimeMillis() sourceData.count() println("First: " + (System.currentTimeMillis() - start)) start = System.currentTimeMillis() sourceData.count() println("Second: " + (System.currentTimeMillis() - start)) sc.stop() ``` Results: serialized size: before 38.4GB after: 30.5GB First time: before 93318ms after: 80708ms Second time: before: 5870ms after: 3382ms Those classes are very common for ML, and also `Matrix`, `DenseMatrix` and `SparseMatrix` too. I'm not sure whether we should register those classes in core directly, because this could introduce extra jar dependency. So could you give some advice? Or else we just remind in the ml doc? The reason shoule be the problem of kryo, it will write the full class name instead of the classID if the class is not registered.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org