d0evi1 created SPARK-20797:
------------------------------
Summary: mllib lda load and save out of memory.
Key: SPARK-20797
URL: https://issues.apache.org/jira/browse/SPARK-20797
Project: Spark
Issue Type: Bug
Components: MLlib
Affects Versions: 2.1.1, 2.0.2, 2.0.0, 1.6.3, 1.6.1
Reporter: d0evi1
when i try online lda model with large text data, the training step went well,
but the save step failed. but something like below happened (etc. 1.6.1):
1.bigger than spark.kryoserializer.buffer.max. (turning bigger the param can
fixed),
2. exceed spark.akka.frameSize. (turning this param too bigger will fail,
version > 2.0.0, exceeds max allowed: spark.rpc.message.maxSize).
when topics num is large, and vocab size is large too. this problem will
appear.
so i found this:
https://github.com/apache/spark/pull/9989, word2vec's problem has been fixed,
this is word2vec's save method from latest code:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala:
val approxSize = (4L * vectorSize + 15) * numWords
val nPartitions = ((approxSize / bufferSize) + 1).toInt
val dataArray = model.toSeq.map { case (w, v) => Data(w, v) }
spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path))
but the code in mllib.clustering.LDAModel's save:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala
you'll see:
val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix
val topics = Range(0, k).map { topicInd =>
Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), topicInd)
}
spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path))
i try word2vec's save, replace numWords to topic K, repartition(nPartitions),
recompile the code, deploy the new lda's project with large data on our machine
cluster, it works.
hopes it will fixed in the next version.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]