[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

MLnick Mon, 14 Dec 2015 11:35:53 -0800

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r47545680
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -469,13 +495,13 @@ class Word2VecModel private[spark] (
         this(Word2VecModel.buildWordIndex(model), 
Word2VecModel.buildWordVectors(model))
       }
     
    -  private def cosineSimilarity(v1: Array[Float], v2: Array[Float]): Double 
= {
    -    require(v1.length == v2.length, "Vectors should have the same length")
    -    val n = v1.length
    -    val norm1 = blas.snrm2(n, v1, 1)
    -    val norm2 = blas.snrm2(n, v2, 1)
    -    if (norm1 == 0 || norm2 == 0) return 0.0
    -    blas.sdot(n, v1, 1, v2, 1) / norm1 / norm2
    +  /**
    +   * get the built vocabulary from the input
    +   * this is useful for getting the whole vocabulary to join with other 
data or filtering other data
    +   * @return a map of word to its index
    +   */
    +  def getVocabulary: Map[String, Int] = {
    --- End diff --
    
    @ygcao I'm still not certain what the benefit of exposing these is. 
`getVectors` returns the map of vectors for all words in the vocab, so you only 
need to call it once. Then you could either broadcast it or parallelize it to 
join up to a set of words. How is that different from calling a combination of 
`getVocabulary` and `getWordVectors`?
    
    If the performance of `getVectors` can be substantially sped up then that 
is something we should do (in which case it would be great to have some 
relative timings for comparison for different vocabulary size and vector 
dimension).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Reply via email to