Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18034#discussion_r141016974
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala ---
    @@ -468,7 +469,16 @@ object LocalLDAModel extends Loader[LocalLDAModel] {
           val topics = Range(0, k).map { topicInd =>
             Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), 
topicInd)
           }
    -      
spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path))
    +
    +      val bufferSize = Utils.byteStringAsBytes(
    +        spark.conf.get("spark.kryoserializer.buffer.max", "64m"))
    +      // We calculate the approximate size of the model
    +      // We only calculate the array size, considering an
    +      // average string size of 15 bytes, the formula is:
    +      // (floatSize * vectorSize + 15) * numWords
    +      val approxSize = (4L * k + 15) * topicsMatrix.numRows
    +      val nPartitions = ((approxSize / bufferSize) + 1).toInt
    +      
spark.createDataFrame(topics).repartition(nPartitions).write.parquet(Loader.dataPath(path))
    --- End diff --
    
    Can you add a testcase to save model into multiple files and load back and 
check the correctness ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to