Github user shubhamchopra commented on a diff in the pull request:
https://github.com/apache/spark/pull/17673#discussion_r143569334
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/Word2VecCBOWSolver.scala ---
@@ -0,0 +1,344 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import com.github.fommil.netlib.BLAS.{getInstance => blas}
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.mllib.feature
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.random.XORShiftRandom
+
+object Word2VecCBOWSolver extends Logging {
+ // learning rate is updated for every batch of size batchSize
+ private val batchSize = 10000
+
+ // power to raise the unigram distribution with
+ private val power = 0.75
+
+ private val MAX_EXP = 6
+
+ case class Vocabulary(
+ totalWordCount: Long,
+ vocabMap: Map[String, Int],
+ unigramTable: Array[Int],
+ samplingTable: Array[Float])
+
+ /**
+ * This method implements Word2Vec Continuous Bag Of Words based
implementation using
+ * negative sampling optimization, using BLAS for vectorizing operations
where applicable.
+ * The algorithm is parallelized in the same way as the skip-gram based
estimation.
+ * We divide input data into N equally sized random partitions.
+ * We then generate initial weights and broadcast them to the N
partitions. This way
+ * all the partitions start with the same initial weights. We then run N
independent
+ * estimations that each estimate a model on a partition. The weights
learned
+ * from each of the N models are averaged and rebroadcast the weights.
+ * This process is repeated `maxIter` number of times.
+ *
+ * @param input A RDD of strings. Each string would be considered a
sentence.
+ * @return Estimated word2vec model
+ */
+ def fitCBOW[S <: Iterable[String]](
+ word2Vec: Word2Vec,
+ input: RDD[S]): feature.Word2VecModel = {
+
+ val negativeSamples = word2Vec.getNegativeSamples
+ val sample = word2Vec.getSample
+
+ val Vocabulary(totalWordCount, vocabMap, uniTable, sampleTable) =
+ generateVocab(input, word2Vec.getMinCount, sample,
word2Vec.getUnigramTableSize)
+ val vocabSize = vocabMap.size
+
+ assert(negativeSamples < vocabSize, s"Vocab size ($vocabSize) cannot
be smaller" +
+ s" than negative samples($negativeSamples)")
+
+ val seed = word2Vec.getSeed
+ val initRandom = new XORShiftRandom(seed)
+
+ val vectorSize = word2Vec.getVectorSize
+ val syn0Global = Array.fill(vocabSize *
vectorSize)(initRandom.nextFloat - 0.5f)
--- End diff --
There are a few places where the implementation seems counter-intuitive,
and doesn't directly relate to the papers its based on. I thought it might help
if someone wants to cross-check the implementation with the C implementation.
Other than that, I do agree with you, and have tried to name things
appropriately, other than these key variables around which the implementation
is based.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]