Digging deeper, I note that for learning the dimension-reduction matrix the folks in that paper I referred to are using L-BFGS which is basically a memory-optimized Newton's Method...
I was thinking about recent work by OpenAI using CMA-ES for neural net weight learning (in a RL context) and then I remembered a paper about hybridizing CMA-ES with Newton's Method, http://www.dem.ist.utl.pt/engopt2010/Book_and_CD/Papers_CD_Final_Version/pdf/08/01534-01.pdf I suspect this sort of hybridization can be very useful for NN learning, combining the best of evolutionary learning and gradient descent (as has been done previously in other contexts quite frequently).... I imagine implementing this in Tensorflow would not be extremely challenging for someone who knows the framework, as these are all just pretty basic operations (matrix operations, radial basis functions, etc.) ... -- Ben On Thu, May 11, 2017 at 5:40 PM, Ben Goertzel <[email protected]> wrote: > On Thu, May 11, 2017 at 11:27 AM, Linas Vepstas <[email protected]> > wrote: >> There are two hard parts to clustering. One is writing all the code to get >> the clusters working in the pipeline. I guess I'll have to do that. The >> other is dealing with words with multiple meanings: "I saw the man with the >> saw" and clustering really needs to distinguish saw the verb from saw the >> noun. Not yet clear about the details of this. i've a glimmer of the >> general idea, > > I was thinking to explore addressing this with (fairly shallow) neural > networks ... > > This paper > > https://nlp.stanford.edu/pubs/HuangACL12.pdf > > which I've pointed out before, does unsupervised construction of > word2vec type vectors for word senses (thus, doing sense > disambiguation sorta mixed up with the dimension-reduction process) > > Now that algorithm takes sentences as inputs, not parse trees. But I > think you could modify the approach to apply to our context, in an > interesting way... > > The following describes one way to do this. I'm sure there are others. > > 1) A first step would be to use the OpenCog pattern miner to mine the > surprising patterns from the set of parse trees produced by MST > parsing. > > 2) Then, one could associate with each word-instance W a set of > instance-pattern-vectors. Each instance vector is very sparse, and > contains an entry for each of the patterns (among the surprising > patterns found in step 1) that W is involved in. Given these > instance-pattern-vectors, one can also calculate word-pattern-vectors > or word-sense-pattern-vectors (via averaging the instance-vectors for > all instance of the word or word-sense) > > 3) Their algorithm involves an embedding matrix L that maps: a binary > vector with a 1 in position i representing the i'th word in the > dictionary, into a much smaller dense vector. I would suggest > instead having an embedding matrix L that maps the pattern-vectors > representing words or senses (constructed in step 2) into a much > smaller dense vector. This is word2vec-ish, but the data it's drawing > on is the set of patterns observed in a corpus of parse trees... > > 4) Their algorithm involves, in the local score function, using a > sequence [x1, ..., xm], where xi is the embedding vector assigned to > word i in the sequence being looked at. Instead, we could use a > structure like the following, where w is the word being predicted and > S is the sentence containing w, > > [ avg. embedding vector of words one link to the left of w in the > parse tree of S, avg. embedding vector of words one link to the right > of w in the parse tree of S, avg. embedding vector of words two links > to the left of w in the parse tree of S, avg. embedding vector of > words two links to the right of w in the parse tree of S] > > This context-matrix is a way of capturing "the embedding vectors of > the words constituting the context of w in parsed sentence S" as a > linear vector... Stopping at "two links away" is arbitrary, probably > we want to go 4-5 links away (yielding a vector of length 8-10); this > would have to be experimented with... > > ... > > Given these changes, one could apply the algorithm in the paper for > sense disambiguation and clustering... > > Of course, there would also be a lot of other ways to mix up the same > ingredients mentioned in the above ... the two unique ingredients I > have introduced are > > * creating dense vectors for words or senses from pattern-vectors > > * creating context-matrices partly capturing the context of a > word-instance (or word or sense) based on a corpus of parse trees... > > ...and one could play with these in many different ways. > > To put it more precisely, there are a lot of ways that one could iteratively > > -- cluster word-instances based on their context-matrices (thus > generating word labels) > > -- learn an embedding matrix (starting from pattern-vectors) that > enables accurate skip-gram prediction based on knowing the labels of > the words produced by the clustering done in the preceding step > > Mimicking the algorithm from the above paper (with the changes I've > suggested) is one way to do this but there are lots of other ways one > could try... > > -- Ben > > > > > > -- > Ben Goertzel, PhD > http://goertzel.org > > "I am God! I am nothing, I'm play, I am freedom, I am life. I am the > boundary, I am the peak." -- Alexander Scriabin -- Ben Goertzel, PhD http://goertzel.org "I am God! I am nothing, I'm play, I am freedom, I am life. I am the boundary, I am the peak." -- Alexander Scriabin -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/opencog. To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CACYTDBf_sQDOXm-F7X4n-Tqbddcy9uNoSFbA0Ak2-f1txe6tng%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
