d0evi1 created SPARK-20797:
------------------------------

             Summary: mllib lda load and save out of memory. 
                 Key: SPARK-20797
                 URL: https://issues.apache.org/jira/browse/SPARK-20797
             Project: Spark
          Issue Type: Bug
          Components: MLlib
    Affects Versions: 2.1.1, 2.0.2, 2.0.0, 1.6.3, 1.6.1
            Reporter: d0evi1


when i try online lda model with large text data, the training step went well, 
but the save step failed. but  something like below happened (etc. 1.6.1):

1.bigger than spark.kryoserializer.buffer.max.  (turning bigger the param can 
fixed),
2. exceed spark.akka.frameSize. (turning this param too bigger will fail, 
version > 2.0.0, exceeds max allowed: spark.rpc.message.maxSize).

when topics  num is large, and vocab size is large too. this problem will 
appear.


so i found this:

https://github.com/apache/spark/pull/9989, word2vec's problem has been fixed, 

this is word2vec's  save method from latest code:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala:

      val approxSize = (4L * vectorSize + 15) * numWords
      val nPartitions = ((approxSize / bufferSize) + 1).toInt
      val dataArray = model.toSeq.map { case (w, v) => Data(w, v) }
      
spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path))


but the code in mllib.clustering.LDAModel's save:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala

you'll see:

      val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix
      val topics = Range(0, k).map { topicInd =>
        Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), topicInd)
      }
      
spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path))


i try word2vec's save, replace numWords to topic K, repartition(nPartitions), 
recompile the code, deploy the new lda's project with large data on our machine 
cluster, it works.

hopes it will fixed in the next version.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to