[ https://issues.apache.org/jira/browse/SPARK-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15268662#comment-15268662 ]
Nick Pentreath commented on SPARK-15027: ---------------------------------------- I've managed to get it working for the following signature: {code} def train[ID: ClassTag: TypeTag]( // scalastyle:ignore ratings: Dataset[Rating[ID]], rank: Int = 10, numUserBlocks: Int = 10, numItemBlocks: Int = 10, maxIter: Int = 10, regParam: Double = 1.0, implicitPrefs: Boolean = false, alpha: Double = 1.0, nonnegative: Boolean = false, intermediateRDDStorageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK, finalRDDStorageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK, checkpointInterval: Int = 10, seed: Long = 0L)( implicit ord: Ordering[ID]): (Dataset[(ID, Array[Float])], Dataset[(ID, Array[Float])]) = { {code} Currently there are 2 issues: # {{partitioner in returned factors *** FAILED ***}} - because {{dataset.rdd}} seems to not preserve the partitioner. I'm not sure if (a) this isn't an issue because now it's a dataset and catalyst should take care of everything, or (b) the plan will do a scan of the underlying RDD and thus it is important, we just have to update the test somehow to check the underlying RDD. # Passing in a {{Dataset[ID]}} generated from {{Dataset[Int].map(r => Rating(r.user.toLong, r.item.toLong, r.rating)}} leads to an exception related to codegen. But e.g. {{RDD[Rating[ID]].map(r => ...).toDS}} works fine for generic non-Int ids. I'm not sure if this is a bug or an issue with generics and Datasets. Actually #2 above occurs even with {{Dataset[Int].map(r => Rating(r.user, r.item, r.rating))}}. > ALS.train should use DataFrame instead of RDD > --------------------------------------------- > > Key: SPARK-15027 > URL: https://issues.apache.org/jira/browse/SPARK-15027 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark > Affects Versions: 2.0.0 > Reporter: Xiangrui Meng > > We should also update `ALS.train` to use `Dataset/DataFrame` instead of `RDD` > to be consistent with other APIs under spark.ml and it also leaves space for > Tungsten-based optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org