Author: tommaso Date: Tue Oct 6 13:15:12 2015 New Revision: 1707048 URL: http://svn.apache.org/viewvc?rev=1707048&view=rev Log: randomized test word
Modified: labs/yay/trunk/core/src/test/java/org/apache/yay/core/Word2VecTest.java labs/yay/trunk/core/src/test/resources/word2vec/sentences.txt Modified: labs/yay/trunk/core/src/test/java/org/apache/yay/core/Word2VecTest.java URL: http://svn.apache.org/viewvc/labs/yay/trunk/core/src/test/java/org/apache/yay/core/Word2VecTest.java?rev=1707048&r1=1707047&r2=1707048&view=diff ============================================================================== --- labs/yay/trunk/core/src/test/java/org/apache/yay/core/Word2VecTest.java (original) +++ labs/yay/trunk/core/src/test/java/org/apache/yay/core/Word2VecTest.java Tue Oct 6 13:15:12 2015 @@ -70,12 +70,12 @@ public class Word2VecTest { FeedForwardStrategy predictionStrategy = new FeedForwardStrategy(new IdentityActivationFunction<Double>()); BackPropagationLearningStrategy learningStrategy = new BackPropagationLearningStrategy(BackPropagationLearningStrategy. DEFAULT_ALPHA, -1, BackPropagationLearningStrategy.DEFAULT_THRESHOLD, predictionStrategy, new LMSCostFunction(), - 20); + 5); NeuralNetwork neuralNetwork = NeuralNetworkFactory.create(randomWeights, learningStrategy, predictionStrategy); neuralNetwork.learn(trainingSet); - String word = "paper"; + String word = vocabulary.get(new Random().nextInt(vocabulary.size())); // final Double[] doubles = ConversionUtils.toValuesCollection(next.getFeatures()).toArray(new Double[next.getFeatures().size()]); final Double[] doubles = hotEncode(word, vocabulary); // String word = hotDecode(doubles, vocabulary); Modified: labs/yay/trunk/core/src/test/resources/word2vec/sentences.txt URL: http://svn.apache.org/viewvc/labs/yay/trunk/core/src/test/resources/word2vec/sentences.txt?rev=1707048&r1=1707047&r2=1707048&view=diff ============================================================================== --- labs/yay/trunk/core/src/test/resources/word2vec/sentences.txt (original) +++ labs/yay/trunk/core/src/test/resources/word2vec/sentences.txt Tue Oct 6 13:15:12 2015 @@ -1,8 +1,15 @@ -The word2vec software of Tomas Mikolov and colleagues1 has gained a lot of traction lately and provides state-of-the-art word embeddings -The learning models behind the software are described in two research papers. +The word2vec software of Tomas Mikolov and colleagues has gained a lot of traction lately and provides state-of-the-art word embeddings +The learning models behind the software are described in two research papers We found the description of the models in these papers to be somewhat cryptic and hard to follow While the motivations and presentation may be obvious to the neural-networks language-modeling crowd we had to struggle quite a bit to figure out the rationale behind the equations This note is an attempt to explain the negative sampling equation in âDistributed Representations of Words and Phrases and their Compositionalityâ by Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado and Jeffrey Dean The departure point of the paper is the skip-gram model In this model we are given a corpus of words w and their contexts c -We consider the conditional probabilities p(c|w) and given a corpus Text, the goal is to set the parameters θ of p(c|w; θ) so as to maximize the corpus probability \ No newline at end of file +We consider the conditional probabilities p(c|w) and given a corpus Text, the goal is to set the parameters θ of p(c|w;θ) so as to maximize the corpus probability +The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships +In this paper we present several extensions that improve both the quality of the vectors and the training speed +By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations +We also describe a simple alternative to the hierarchical softmax called negative sampling +An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases +For example, the meanings of âCanadaâ and âAirâ cannot be easily combined to obtain âAir Canadaâ +Motivated by this example, we present a simple method for finding phrases in text and show that learning good vector representations for millions of phrases is possible \ No newline at end of file --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@labs.apache.org For additional commands, e-mail: commits-h...@labs.apache.org