[GitHub] spark pull request: [SPARK-5989] [MLlib] Model save/load for LDA

MechCoder Thu, 25 Jun 2015 22:58:07 -0700

Github user MechCoder commented on a diff in the pull request:

    https://github.com/apache/spark/pull/6948#discussion_r33330342
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala ---
    @@ -184,6 +199,82 @@ class LocalLDAModel private[clustering] (
     
     }
     
    +@Experimental
    +object LocalLDAModel extends Loader[LocalLDAModel]{
    +
    +  private object SaveLoadV1_0 {
    +
    +    val formatVersionV1_0 = "1.0"
    +
    +    val classNameV1_0 = "org.apache.spark.mllib.clustering.LocalLDAModel"
    +
    +    // Store the distribution of terms of each topic as a Row in data.
    +    case class Data(termDistributions: Vector)
    +
    +    def save(sc: SparkContext, path: String, topicsMatrix: Matrix): Unit = 
{
    +
    +      val sqlContext = new SQLContext(sc)
    +      import sqlContext.implicits._
    +
    +      val k = topicsMatrix.numCols
    +      val metadata = compact(render
    +        (("class" -> classNameV1_0) ~ ("version" -> formatVersionV1_0) ~
    +         ("k" -> k) ~ ("vocabSize" -> topicsMatrix.numRows)))
    +      sc.parallelize(Seq(metadata), 
1).saveAsTextFile(Loader.metadataPath(path))
    +
    +      val topicsDenseMatrix = topicsMatrix.toBreeze.toDenseMatrix
    +
    +      val termDistributions = Range(0, k).map { topicInd =>
    +        Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)))
    +      }.toSeq
    +      sc.parallelize(termDistributions, 
1).toDF().write.parquet(Loader.dataPath(path))
    +    }
    +
    +    def load(sc: SparkContext, path: String): Matrix = {
    +
    +      val dataPath = Loader.dataPath(path)
    +      val sqlContext = new SQLContext(sc)
    +      val dataFrame = sqlContext.read.parquet(dataPath)
    +
    +      Loader.checkSchema[Data](dataFrame.schema)
    +      val termDistributions = dataFrame.collect()
    +      val vocabSize = termDistributions(0)(0).asInstanceOf[Vector].size
    +      val k = termDistributions.size
    +
    +      val brzTopics = BDM.zeros[Double](vocabSize, k)
    +      termDistributions.zipWithIndex.foreach { case (Row(vec: Vector), 
ind: Int) =>
    +        brzTopics(::, ind) := vec.toBreeze
    +      }
    +      Matrices.fromBreeze(brzTopics)
    +    }
    --- End diff --
    
    Is this necessary for the private method in `SaveLoadV1.load` as well?
    
    The thing is for the `DistributedLDAModel`, there is no way I can just 
construct the model just using that data. (eg. information regarding 
iterationTimes, docConcentration). Hence, I construct the graph and return this 
from `SaveLoadV1.load` and construct the model from this information using the 
metadata.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-5989] [MLlib] Model save/load for LDA

Reply via email to