On Thu, May 11, 2017 at 11:27 AM, Linas Vepstas <[email protected]> wrote: > There are two hard parts to clustering. One is writing all the code to get > the clusters working in the pipeline. I guess I'll have to do that. The > other is dealing with words with multiple meanings: "I saw the man with the > saw" and clustering really needs to distinguish saw the verb from saw the > noun. Not yet clear about the details of this. i've a glimmer of the > general idea,
I was thinking to explore addressing this with (fairly shallow) neural networks ... This paper https://nlp.stanford.edu/pubs/HuangACL12.pdf which I've pointed out before, does unsupervised construction of word2vec type vectors for word senses (thus, doing sense disambiguation sorta mixed up with the dimension-reduction process) Now that algorithm takes sentences as inputs, not parse trees. But I think you could modify the approach to apply to our context, in an interesting way... The following describes one way to do this. I'm sure there are others. 1) A first step would be to use the OpenCog pattern miner to mine the surprising patterns from the set of parse trees produced by MST parsing. 2) Then, one could associate with each word-instance W a set of instance-pattern-vectors. Each instance vector is very sparse, and contains an entry for each of the patterns (among the surprising patterns found in step 1) that W is involved in. Given these instance-pattern-vectors, one can also calculate word-pattern-vectors or word-sense-pattern-vectors (via averaging the instance-vectors for all instance of the word or word-sense) 3) Their algorithm involves an embedding matrix L that maps: a binary vector with a 1 in position i representing the i'th word in the dictionary, into a much smaller dense vector. I would suggest instead having an embedding matrix L that maps the pattern-vectors representing words or senses (constructed in step 2) into a much smaller dense vector. This is word2vec-ish, but the data it's drawing on is the set of patterns observed in a corpus of parse trees... 4) Their algorithm involves, in the local score function, using a sequence [x1, ..., xm], where xi is the embedding vector assigned to word i in the sequence being looked at. Instead, we could use a structure like the following, where w is the word being predicted and S is the sentence containing w, [ avg. embedding vector of words one link to the left of w in the parse tree of S, avg. embedding vector of words one link to the right of w in the parse tree of S, avg. embedding vector of words two links to the left of w in the parse tree of S, avg. embedding vector of words two links to the right of w in the parse tree of S] This context-matrix is a way of capturing "the embedding vectors of the words constituting the context of w in parsed sentence S" as a linear vector... Stopping at "two links away" is arbitrary, probably we want to go 4-5 links away (yielding a vector of length 8-10); this would have to be experimented with... ... Given these changes, one could apply the algorithm in the paper for sense disambiguation and clustering... Of course, there would also be a lot of other ways to mix up the same ingredients mentioned in the above ... the two unique ingredients I have introduced are * creating dense vectors for words or senses from pattern-vectors * creating context-matrices partly capturing the context of a word-instance (or word or sense) based on a corpus of parse trees... ...and one could play with these in many different ways. To put it more precisely, there are a lot of ways that one could iteratively -- cluster word-instances based on their context-matrices (thus generating word labels) -- learn an embedding matrix (starting from pattern-vectors) that enables accurate skip-gram prediction based on knowing the labels of the words produced by the clustering done in the preceding step Mimicking the algorithm from the above paper (with the changes I've suggested) is one way to do this but there are lots of other ways one could try... -- Ben -- Ben Goertzel, PhD http://goertzel.org "I am God! I am nothing, I'm play, I am freedom, I am life. I am the boundary, I am the peak." -- Alexander Scriabin -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/opencog. To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CACYTDBeycUvtsN-5RVaqeQ4KdyOyGMi48ayiwS6CkwGTdv9A%2BA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
