Github user ygcao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r51391185
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -289,17 +301,28 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    +    // each partition is a collection of sentences, will be translated 
into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter 
=>
           new Iterator[Array[Int]] {
    -        def hasNext: Boolean = iter.hasNext
    +        var wordIter: Iterator[String] = null
    +
    +        def hasNext: Boolean = sentenceIter.hasNext || (wordIter != null 
&& wordIter.hasNext)
     
             def next(): Array[Int] = {
               val sentence = ArrayBuilder.make[Int]
               var sentenceLength = 0
    -          while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
    -            val word = bcVocabHash.value.get(iter.next())
    -            word match {
    +          // do translation of each word into its index in the vocabulary,
    --- End diff --
    
    I finally made up mind to do a hacky simple perf-test just for proof of 
concept: the 5x runs' perf diff of different implementation is quite ignorable 
since it's within variance of each run of the same version.
    Some details:
    I prepared a 32k document from two arbitrary picked wikipedia pages( for 
"machine learning" and "Adversarial machine learning", didn't include reference 
section), which contains 341 lines and can be split into 442 sentences by 
simply using dot+space for sentence boundary). I injected following test case 
into Word2VecSuite class and run it against three different implementations(the 
old one which is in the master branch, my final two versions before adopting 
Sean's suggestion, and after adopted Sean's suggestion) of fit function in 
mllib.feature.Word2Vec class.
    
    <pre>
    test("testSpeed") {
        val lines = sc.parallelize(Source.fromFile(new 
File("/home/ygcao/machinelearning.txt")).getLines().toSeq)
        val sentences = lines.flatMap(_.split("\\. ")).map(line => line.split(" 
").toSeq)
        println("read file into rdd, lines=", sentences.count())
        var builtModel: org.apache.spark.mllib.feature.Word2VecModel = null
        var duration = 0l
        for (i <- 1 to 5) {
          val start = System.currentTimeMillis()
          val model = new 
org.apache.spark.mllib.feature.Word2Vec().setVectorSize(3).setSeed(42l)
          builtModel = model.fit(sentences)
          duration += (System.currentTimeMillis() - start)
        }
        println(s"builtModel take ${duration},vocabulary 
size:${builtModel.getVectors.size}, learning's 
synonyms:${builtModel.findSynonyms("learning", 4).mkString("\n")}")
      }
    </pre>
    
     the vocabulary size from the model is 155. and here are the time taking 
three runs of each version and the average of the final two runs of them. As 
you can see from the code, each run actually run the model building 5 times to 
magnify the potential diff.
    <pre>
                masterVersion   PR-useIter      PR-useCollection
    run1        2232    2107    1933
    run2        2085    1986    1987
    run3        2005    2123    2004
    avarage(run2, run3) 2045    2054.5  1995.5
    </pre>
    BTW: Following is not relevant for perf-test, just FYI. the two versions in 
this pull request will produce exact the same result, which proves the 
correctness of both. and the result is interesting as well although the dataset 
is quite tiny, new versions(un-merged ones) looks better than the old 
version(the one in master branch right now). Of course, you can screw new 
version up by a bad sentence splitter(we can make a hard cut splitter to do 
exact the same thing as the old version). the simple splitter used for the test 
case can't deal with abbreviations, that's why I removed references section in 
the text.
    Here are the top synonyms of learning using the tiny dataset, please keep 
in mind, it's just for fun, not a solid proof of which is definitely better 
since dataset is tiny.
    New versions:  learning's synonyms:
    (network,0.9990321742854605)
    (related,0.9966140511173031)
    (sparse,0.9965729586431097)
    (algorithms,0.99376379497485)
    
    Old version, learning's synonyms:
    (against,0.9895162633562077)
    (Support,0.9547255372896342)
    (Association,0.9499811242788365)
    (Attacks,0.9321700815006693)
    
    
    
    
    
    
    
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to