zhengruifeng commented on pull request #31276: URL: https://github.com/apache/spark/pull/31276#issuecomment-764468057
train a model with https://en.wikipedia.org/wiki/Word2vec as the training data; ``` import org.apache.spark.ml.feature._ val df = spark.read.text("/d0/Dev/PRs/Word2vec") val df2 = df.as[String].map(_.split(" ")).toDF("words") val w2v = new Word2Vec().setInputCol("words").setMaxIter(1) val w2vm = w2v.fit(df2) w2vm.save("/tmp/w2vm") ``` performance test ``` import org.apache.spark.ml.feature._ val w2vm = Word2VecModel.load("/tmp/w2vm") val words = w2vm.getVectors.select("word").as[String].collect val start = System.currentTimeMillis; Seq.range(0, 10000).foreach { i => words.foreach(word => w2vm.findSynonymsArray(word, 10)) }; val end = System.currentTimeMillis; val duration = end - start; ``` results: master: 8978 this PR: 6419, about 30% faster than existing impl ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
