[GitHub] [spark] zhengruifeng commented on pull request #31276: [SPARK-34189][ML] w2v findSynonyms optimization

GitBox Thu, 21 Jan 2021 22:23:50 -0800


zhengruifeng commented on pull request #31276:
URL: https://github.com/apache/spark/pull/31276#issuecomment-764468057



   train a model with https://en.wikipedia.org/wiki/Word2vec as the training 
data;
   ```
   import org.apache.spark.ml.feature._
   
   val df = spark.read.text("/d0/Dev/PRs/Word2vec")
   val df2 = df.as[String].map(_.split(" ")).toDF("words")
   
   val w2v = new Word2Vec().setInputCol("words").setMaxIter(1)
   val w2vm = w2v.fit(df2)
   w2vm.save("/tmp/w2vm")
   
   ```
   
   performance test
   ```
   import org.apache.spark.ml.feature._
   
   val w2vm = Word2VecModel.load("/tmp/w2vm")
   val words = w2vm.getVectors.select("word").as[String].collect
   
   val start = System.currentTimeMillis; Seq.range(0, 10000).foreach { i => 
words.foreach(word => w2vm.findSynonymsArray(word, 10)) }; val end = 
System.currentTimeMillis; val duration = end - start; 
   ```
   
   
   results:
   master: 8978
   this PR: 6419, about 30% faster than existing impl


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] zhengruifeng commented on pull request #31276: [SPARK-34189][ML] w2v findSynonyms optimization

Reply via email to