Github user ygcao commented on a diff in the pull request:
https://github.com/apache/spark/pull/10152#discussion_r51391185
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -289,17 +301,28 @@ class Word2Vec extends Serializable with Logging {
val expTable = sc.broadcast(createExpTable())
val bcVocab = sc.broadcast(vocab)
val bcVocabHash = sc.broadcast(vocabHash)
-
- val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
+ // each partition is a collection of sentences, will be translated
into arrays of Index integer
+ val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter
=>
new Iterator[Array[Int]] {
- def hasNext: Boolean = iter.hasNext
+ var wordIter: Iterator[String] = null
+
+ def hasNext: Boolean = sentenceIter.hasNext || (wordIter != null
&& wordIter.hasNext)
def next(): Array[Int] = {
val sentence = ArrayBuilder.make[Int]
var sentenceLength = 0
- while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
- val word = bcVocabHash.value.get(iter.next())
- word match {
+ // do translation of each word into its index in the vocabulary,
--- End diff --
I finally made up mind to do a hacky simple perf-test just for proof of
concept: the 5x runs' perf diff of different implementation is quite ignorable
since it's within variance of each run of the same version.
Some details:
I prepared a 32k document from two arbitrary picked wikipedia pages( for
"machine learning" and "Adversarial machine learning", didn't include reference
section), which contains 341 lines and can be split into 442 sentences by
simply using dot+space for sentence boundary). I injected following test case
into Word2VecSuite class and run it against three different implementations(the
old one which is in the master branch, my final two versions before adopting
Sean's suggestion, and after adopted Sean's suggestion) of fit function in
mllib.feature.Word2Vec class.
<pre>
test("testSpeed") {
val lines = sc.parallelize(Source.fromFile(new
File("/home/ygcao/machinelearning.txt")).getLines().toSeq)
val sentences = lines.flatMap(_.split("\\. ")).map(line => line.split("
").toSeq)
println("read file into rdd, lines=", sentences.count())
var builtModel: org.apache.spark.mllib.feature.Word2VecModel = null
var duration = 0l
for (i <- 1 to 5) {
val start = System.currentTimeMillis()
val model = new
org.apache.spark.mllib.feature.Word2Vec().setVectorSize(3).setSeed(42l)
builtModel = model.fit(sentences)
duration += (System.currentTimeMillis() - start)
}
println(s"builtModel take ${duration},vocabulary
size:${builtModel.getVectors.size}, learning's
synonyms:${builtModel.findSynonyms("learning", 4).mkString("\n")}")
}
</pre>
the vocabulary size from the model is 155. and here are the time taking
three runs of each version and the average of the final two runs of them. As
you can see from the code, each run actually run the model building 5 times to
magnify the potential diff.
<pre>
masterVersion PR-useIter PR-useCollection
run1 2232 2107 1933
run2 2085 1986 1987
run3 2005 2123 2004
avarage(run2, run3) 2045 2054.5 1995.5
</pre>
BTW: Following is not relevant for perf-test, just FYI. the two versions in
this pull request will produce exact the same result, which proves the
correctness of both. and the result is interesting as well although the dataset
is quite tiny, new versions(un-merged ones) looks better than the old
version(the one in master branch right now). Of course, you can screw new
version up by a bad sentence splitter(we can make a hard cut splitter to do
exact the same thing as the old version). the simple splitter used for the test
case can't deal with abbreviations, that's why I removed references section in
the text.
Here are the top synonyms of learning using the tiny dataset, please keep
in mind, it's just for fun, not a solid proof of which is definitely better
since dataset is tiny.
New versions: learning's synonyms:
(network,0.9990321742854605)
(related,0.9966140511173031)
(sparse,0.9965729586431097)
(algorithms,0.99376379497485)
Old version, learning's synonyms:
(against,0.9895162633562077)
(Support,0.9547255372896342)
(Association,0.9499811242788365)
(Attacks,0.9321700815006693)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]