Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/18034#discussion_r141016974
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala ---
@@ -468,7 +469,16 @@ object LocalLDAModel extends Loader[LocalLDAModel] {
val topics = Range(0, k).map { topicInd =>
Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)),
topicInd)
}
-
spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path))
+
+ val bufferSize = Utils.byteStringAsBytes(
+ spark.conf.get("spark.kryoserializer.buffer.max", "64m"))
+ // We calculate the approximate size of the model
+ // We only calculate the array size, considering an
+ // average string size of 15 bytes, the formula is:
+ // (floatSize * vectorSize + 15) * numWords
+ val approxSize = (4L * k + 15) * topicsMatrix.numRows
+ val nPartitions = ((approxSize / bufferSize) + 1).toInt
+
spark.createDataFrame(topics).repartition(nPartitions).write.parquet(Loader.dataPath(path))
--- End diff --
Can you add a testcase to save model into multiple files and load back and
check the correctness ?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]