GitHub user ezli opened a pull request:
https://github.com/apache/spark/pull/6131
[MLLIB]: Avoid divided by zero norm, cache normalized Word Vectors to speed
up
1. Check if norm == 0 when calculate the division, add a ScalaTest for
divided by zero scenario;
2. Cache the normalized wordVectors, speed up multiple findSynonyms()
calls;
3. Do lazy loading for wordVectors and wordVectorsNormalized;
4. Normalize fVector in findSynonyms() to make cosine distances comparable
across all words.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ezli/spark master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/6131.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #6131
----
commit 459bd93cb084f37361d92d5ef05508321d98481b
Author: Eric Li <[email protected]>
Date: 2015-05-13T17:47:40Z
1. Cache the normalized wordVectors, speed up multiple findSynonyms()
calls; 2. Do lazy loading for wordVectors and wordVectorsNormalized; 3. Check
if norm == 0 when calculate the division; 4. Normalize fVector in
findSynonyms() to make cosine distances comparable across all words.
commit 91013053aa665f1024a636c7b562324dc5f2de4c
Author: Eric Li <[email protected]>
Date: 2015-05-13T17:52:25Z
Merge remote-tracking branch 'upstream/master'
commit 52c867e005ade5aae794985ad67f6076b50d44e1
Author: Eric Li <[email protected]>
Date: 2015-05-13T20:38:51Z
Add test for Word2VecModel for norm equals to 0
commit 05453206ef774f28cdd75c02ce667767800e5d7b
Author: Eric Li <[email protected]>
Date: 2015-05-13T20:40:58Z
Merge remote-tracking branch 'upstream/master'
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]