[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB]Collapsed Gibbs sampli...

allwefantasy Wed, 10 Sep 2014 02:09:42 -0700

Github user allwefantasy commented on the pull request:

    https://github.com/apache/spark/pull/1983#issuecomment-55089256
  
    @witgo çäºä½ çæ§è½æµè¯ ä½ 
éé¢æ²¡ææå°è¿ä»£æ¬¡æ°ãæ¯å¤å°æ¬¡è¿ä»£å¢ï¼ä¸ä¸ªå°æ¶å°±å®æäºã
    
    æè¿éä¹éæ°æµè¯äºä¸ä»½æ°æ®ï¼
    
    The cluster resource        60 executors(60 cores, 220g memory)
    The corpus size     240000 document
    The number of iterations    100
    The number of term   80000
    The number of topics        500
    alpha       0.1
    beta        0.01
    
    åºæ¬ä¸ä¸è½®è¿ä»£å°±è¦40-60åéãèæ¶éå¸¸ä¹é¿ãæµè¯ä»£ç 
å¦ä¸:
    
    
        val data = 
sc.textFile(s"/output/william/spark-lda-data/trainings").sample(false,0.1)
        val parsedData = data.map { line =>
          val parts = line.split(',')
          val values = parts(1).split(' ').map{k=>
            val Array(pos,v) = k.split(":")
            (pos.toInt,v.toInt)
          }.toMap[Int,Int]
          Document(parts(0).toInt,(0 until wordInfo.value.size).map(k=> 
values.getOrElse(k,0)).toArray)
        }
        val (topicModel,documents) = 
org.apache.spark.mllib.clustering.LDA.train(parsedData,wordInfo.value.size,500,30,(0.7*30).toInt,0.1,0.01)




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB]Collapsed Gibbs sampli...

Reply via email to