Here's an interesting thought on how to do pattern mining in a corpus of mathematical proofs, or a corpus of probabilistic-logic proofs (like we'll gather from applying PLN). This is inspired by the below email I just sent regarding how to combine "pattern mining on parse trees" with NN learning for finding syntactic categories and doing word sense disambiguation...
Suppose you have a corpus of proofs... then 1) You find surprising patterns in the corpus, using e.g. the OpenCog Pattern Miner. Each such pattern is a series of inference-rule applications, with a mix of concrete terms, categories of terms, or variables being subjected to the rules... 2) You associate each node in an inference-tree (each inference-rule-application) with a large sparse vector, which has an entry for each possible pattern identified in step 1. (So a 1 in entry i indicates that the i'th pattern in the dictionary exists in the node corresponding to that vector) 3) Then you learn a big matrix that does compression of these sparse vectors to dense vectors. (more on this below) 4) Then, you can associate a dense vector to each node in the inference tree. You can then learn a neural net that tries to (in "skip-gram" style) predict a node in an inference tree given its surrounding context (where the surrounding context can be summarized in various ways, including simple ways like the context-matrix I suggested in step 4 of my linguistics algorithm below, and potentially other ways...). You can also do this separately for successful and unsuccessful inference chains... 5) You co-learn the matrix in step 3 with the NN in step 4 6) Using the predictor in step 4, you guide forward and/or backward chaining inference (e.g. using the NN trained on successful inferences to help choose which inference tree nodes are likely to be good) 7) After doing 6) for a while you extract new patterns, and you update your pattern library in step 1), and repeat the whole thing ... Note that the learning in step 5) and the inference-step selection in step 6) can both be framed using probabilistic programming... This combines "deep math" type analysis with information-theoretic pattern mining... Of course one can also add 1') Apply probabilistic inference to extrapolate from the patterns mined, to learn new conjectural patterns. Throw those into the mix when going to step 2. This last step 1' makes the whole thing additionally recursive, because the probabilistic inference used there is guided based on steps 1-7 ... And so, to use my favorite Aussie expression, Bob's your uncle! Singularity awakens! .. I'll post a full implementation in a couple hours [urrggghh ... I wish...] -- Ben ---------- Forwarded message ---------- From: Ben Goertzel <[email protected]> Date: Thu, May 11, 2017 at 5:40 PM Subject: Re: [Link Grammar] Re: Word similarity database report To: link-grammar <[email protected]> Cc: opencog <[email protected]>, Ruiting Lian <[email protected]> On Thu, May 11, 2017 at 11:27 AM, Linas Vepstas <[email protected]> wrote: > There are two hard parts to clustering. One is writing all the code to get > the clusters working in the pipeline. I guess I'll have to do that. The > other is dealing with words with multiple meanings: "I saw the man with the > saw" and clustering really needs to distinguish saw the verb from saw the > noun. Not yet clear about the details of this. i've a glimmer of the > general idea, I was thinking to explore addressing this with (fairly shallow) neural networks ... This paper https://nlp.stanford.edu/pubs/HuangACL12.pdf which I've pointed out before, does unsupervised construction of word2vec type vectors for word senses (thus, doing sense disambiguation sorta mixed up with the dimension-reduction process) Now that algorithm takes sentences as inputs, not parse trees. But I think you could modify the approach to apply to our context, in an interesting way... The following describes one way to do this. I'm sure there are others. 1) A first step would be to use the OpenCog pattern miner to mine the surprising patterns from the set of parse trees produced by MST parsing. 2) Then, one could associate with each word-instance W a set of instance-pattern-vectors. Each instance vector is very sparse, and contains an entry for each of the patterns (among the surprising patterns found in step 1) that W is involved in. Given these instance-pattern-vectors, one can also calculate word-pattern-vectors or word-sense-pattern-vectors (via averaging the instance-vectors for all instance of the word or word-sense) 3) Their algorithm involves an embedding matrix L that maps: a binary vector with a 1 in position i representing the i'th word in the dictionary, into a much smaller dense vector. I would suggest instead having an embedding matrix L that maps the pattern-vectors representing words or senses (constructed in step 2) into a much smaller dense vector. This is word2vec-ish, but the data it's drawing on is the set of patterns observed in a corpus of parse trees... 4) Their algorithm involves, in the local score function, using a sequence [x1, ..., xm], where xi is the embedding vector assigned to word i in the sequence being looked at. Instead, we could use a structure like the following, where w is the word being predicted and S is the sentence containing w, [ avg. embedding vector of words one link to the left of w in the parse tree of S, avg. embedding vector of words one link to the right of w in the parse tree of S, avg. embedding vector of words two links to the left of w in the parse tree of S, avg. embedding vector of words two links to the right of w in the parse tree of S] This context-matrix is a way of capturing "the embedding vectors of the words constituting the context of w in parsed sentence S" as a linear vector... Stopping at "two links away" is arbitrary, probably we want to go 4-5 links away (yielding a vector of length 8-10); this would have to be experimented with... ... Given these changes, one could apply the algorithm in the paper for sense disambiguation and clustering... Of course, there would also be a lot of other ways to mix up the same ingredients mentioned in the above ... the two unique ingredients I have introduced are * creating dense vectors for words or senses from pattern-vectors * creating context-matrices partly capturing the context of a word-instance (or word or sense) based on a corpus of parse trees... ...and one could play with these in many different ways. To put it more precisely, there are a lot of ways that one could iteratively -- cluster word-instances based on their context-matrices (thus generating word labels) -- learn an embedding matrix (starting from pattern-vectors) that enables accurate skip-gram prediction based on knowing the labels of the words produced by the clustering done in the preceding step Mimicking the algorithm from the above paper (with the changes I've suggested) is one way to do this but there are lots of other ways one could try... -- Ben -- Ben Goertzel, PhD http://goertzel.org "I am God! I am nothing, I'm play, I am freedom, I am life. I am the boundary, I am the peak." -- Alexander Scriabin -- Ben Goertzel, PhD http://goertzel.org "I am God! I am nothing, I'm play, I am freedom, I am life. I am the boundary, I am the peak." -- Alexander Scriabin -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/opencog. To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CACYTDBfSoY_CKEEorvouT86xihdjxgh9%3DJJ1GicmOmYmu1KscQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
