zhengruifeng commented on pull request #31276:
URL: https://github.com/apache/spark/pull/31276#issuecomment-764468057


   train a model with https://en.wikipedia.org/wiki/Word2vec as the training 
data;
   ```
   import org.apache.spark.ml.feature._
   
   val df = spark.read.text("/d0/Dev/PRs/Word2vec")
   val df2 = df.as[String].map(_.split(" ")).toDF("words")
   
   val w2v = new Word2Vec().setInputCol("words").setMaxIter(1)
   val w2vm = w2v.fit(df2)
   w2vm.save("/tmp/w2vm")
   
   ```
   
   performance test
   ```
   import org.apache.spark.ml.feature._
   
   val w2vm = Word2VecModel.load("/tmp/w2vm")
   val words = w2vm.getVectors.select("word").as[String].collect
   
   val start = System.currentTimeMillis; Seq.range(0, 10000).foreach { i => 
words.foreach(word => w2vm.findSynonymsArray(word, 10)) }; val end = 
System.currentTimeMillis; val duration = end - start; 
   ```
   
   
   results:
   master: 8978
   this PR: 6419, about 30% faster than existing impl


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to