There are several ways I can compute the cosine similarities between a Spark ML
vector to each ML vector in a Spark DataFrame column then sorting for the
highest results. However, I can't come up with a method that is faster than
replacing the `/data/` in a Spark ML Word2Vec model, then using
`.findSynonyms()`. The problem is the Word2Vec model is held entirely in the
driver which can cause memory issues if the data set I want to compare to gets
too big.
1. Is there a more efficient method than the ones I have shown below?
2. Could the data for the Word2Vec model be distributed across the cluster?
3. Could the the `.findSynonyms()` [Scala
code](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L571toL619)
be modified to make a spark sql function that can operate efficiently over a
whole Spark DataFrame?
Methods I have tried:
#1 rdd function:
```
# vecIn = vector of same dimensions as 'vectors' column
def cosSim(row, vecIn):
return (
tuple(( Vectors.dense( Vectors.dense(row.vectors.dot(vecIn)) /
(Vectors.dense(np.sqrt(row.vectors.dot(row.vectors))) *
Vectors.dense(np.sqrt(vecIn.dot(vecIn)
).toArray().tolist()))
df.rdd.map(lambda row: cosSim(row,
vecIn)).toDF(['CosSim']).show(truncate=False)
```
#2 `.toIndexedRowMatrix().columnSimilarities()` then filter the results (not
shown):
```
spark.createDataFrame(
IndexedRowMatrix(df.rdd.map(lambda row: (row.vectors.toArray(
.toBlockMatrix()
.transpose()
.toIndexedRowMatrix()
.columnSimilarities()
.entries)
```
#3 replace Word2Vec model `/data/` with my own, then load 'revised' model and
use `.findSynonyms()`:
```
df_words_vectors.schema
##
StructType(List(StructField(word,StringType,true),StructField(vector,ArrayType(FloatType,true),true)))
df_words_vectors.write.parquet("exiting_Word2Vec_model/data/",
mode='overwrite')
new_Word2Vec_model = Word2VecModel.load("exiting_Word2Vec_model")
## vecIn = vector of same dimensions as 'vector' column in DataFrame saved
over Word2Vec model /data/
new_Word2Vec_model.findSynonyms(vecIn, 20).show()
```
Clay Stevens