d0evi1 created SPARK-20797: ------------------------------ Summary: mllib lda load and save out of memory. Key: SPARK-20797 URL: https://issues.apache.org/jira/browse/SPARK-20797 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 2.1.1, 2.0.2, 2.0.0, 1.6.3, 1.6.1 Reporter: d0evi1
when i try online lda model with large text data, the training step went well, but the save step failed. but something like below happened (etc. 1.6.1): 1.bigger than spark.kryoserializer.buffer.max. (turning bigger the param can fixed), 2. exceed spark.akka.frameSize. (turning this param too bigger will fail, version > 2.0.0, exceeds max allowed: spark.rpc.message.maxSize). when topics num is large, and vocab size is large too. this problem will appear. so i found this: https://github.com/apache/spark/pull/9989, word2vec's problem has been fixed, this is word2vec's save method from latest code: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala: val approxSize = (4L * vectorSize + 15) * numWords val nPartitions = ((approxSize / bufferSize) + 1).toInt val dataArray = model.toSeq.map { case (w, v) => Data(w, v) } spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path)) but the code in mllib.clustering.LDAModel's save: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala you'll see: val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix val topics = Range(0, k).map { topicInd => Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), topicInd) } spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path)) i try word2vec's save, replace numWords to topic K, repartition(nPartitions), recompile the code, deploy the new lda's project with large data on our machine cluster, it works. hopes it will fixed in the next version. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org