Re: [opencog-dev] Re: word2vec within openCog language learning?

Jesús López Sat, 22 Apr 2017 13:49:12 -0700

correction: swap co/contravariant.

On 4/22/17, Jesús López <jesus.lopez.salva...@gmail.com> wrote:
> Hi again, just wanted to drop a pair of thoughts.
>
> What I'm talking about is more of conceptual exploration, categorical
> and liguistically motivated while Ben talk is more neural and
> hands-on. What would be nice is connecting the threads.
>
> Previously Ben said:
>> The semiring could also be a non-Boolean algebra of relations on
> graphs or hypergraphs
>
> That would demand to substitute the numbers in the word2vec vectors
> (and Coecke tensors!) by whole relations (relations on hypergraphs are
> much fatter than just numbers) which I'm not sure you'd even want. I
> didn't remember seeing this before. For good or bad, last week
> appeared arxiv:1704.05725 for the categorical quantum mechanics
> setting where they seem to be doing just that sort of thing,
> substituting the complex numbers field by an arbitrary C*-algebra. If
> you can think of your algebra of relations as C-star, that would push
> that idea some further, though I don't really know how far it goes
> semantically, not to speak about learning parameters. One would need
> also the glue to apply the former paper idea to the quantum flavor of
> Cocke semantics.
>
> Can't help on GAN stuff because of lacking homework on that. However I
> would also look to what Socher did in 2013. Typical neural nets are
> many-flat sandwiches of rectangles of weights (linear), that have
> stacked on top a vector of nonlinearities and so on. Socher
> introduced/used *tensor* neural nets where he used a *cube* for a
> *bi*-linear transformation followed by nonlinearity. His units
> transform pair of vectors to single vectors and his NN topology is a
> binary tree (instead of a linear stacking of layers of a classical
> NN). If you have a fragment of English generated by a CFG, the parse
> tree (true tree) can be binarized [1], and each node would be a Socher
> net unit, with leaves being distributional (word2vec) vectors.
>
> The difference of this with Cocke is that in the later there is not
> binarization (instead multilinear, general tensors), and the net is
> not a tree but a DAG. And more importantly of course there are
> nonlinear extra toppings of nodes in Socher and an actual learning
> algorithm, thing left for the future more or less in Coecke view
> despite some efforts. So basically if you put a nonlinear topping or
> hat on each of the nodes of what I was calling a tensor network you
> should arrive at a neural tensor net. Just split the rank r of the
> tensor in r = u + v, for u the quantity of contravariant (input)
> indices, and v the quantity of covariants (outputs). Then each node
> tensor has u *vectors* as inputs (2 in Socher) and v output vectors.
> One needs an analogue of the element-wise nonlinearity in this context
> but I don't know which. As the topology can include "diamond" paths,
> one needs a suited learning method. I've read about what's called
> backpropagation through structure in tensor neural net papers.
>
> Another technical difference is that Socher had an extra additive
> contribution to the output of their bilinearly-flavored units by an
> extra classical NN-stage, just not to lie.
>
> All the former if one has serious interest in the Cocke approach to
> semantics.
>
> Note that while Coecke theory is very pleasant categorically, the
> nonlinear toppings have not received any attention from categorists
> that I know of.
>
> On the purely categorical side of understanding this same problem, and
> forgetting parameter learning for a moment, I had a litte realization
> to share. I talked about categories resulting of several monads as
> *targets* of Coecke semantic functor. Later I remembered that the
> source has also monad flavour. Sequences of things can be understood
> through the list monad from the viewpoint of functional programming,
> or the free monoid monad of the purists. One can thus see sentences as
> sequences of words (lexical entities) given by a specific monad. Thus
> we have monad flavour in both source and target of the semantics
> functor. That prompts questions on the character of the functor
> itself.
>
> That thoughts put me in the functional programmer mindset and I
> remembered an old reading by Wadler, he was talking of understanding
> (in functional programming and using Moggi ideas on computing with
> monads) recursive descent parsers of domain specific languages given
> by a context free grammar by monadic means. The topic is called
> monadic parsing. For developers. Interestingly this viewpoint is
> permeating into Linguistics as well, as demonstrated by "Monads for
> natural language semantics" (Shan). He talks of semantics as a monad
> transformer. We are at a point where there even is a section called
> "The CCG monad" in book of isbn 9783110251708.
>
> I don't know of work reconciling the monadic viewpoint with Coecke
> stuff, but it is intriguing.
>
> Regards, Jesús.
>
>
> [1] http://images.slideplayer.com/15/4559376/slides/slide_39.jpg
>
>
>
>
> On 4/13/17, Ben Goertzel <b...@goertzel.org> wrote:
>> OK, let me try to rephrase this more clearly...
>>
>> What I am thinking is --
>>
>> In the GAN, the generative network takes in some random noise
>> variables, and outputs a distribution over (link type, word) pairs
>> [or in the plain-vanilla version without dependency parses, it would
>> merely be over words
>>
>> The GAN would then be generating "statistical contexts" (corresponding to
>> words)
>>
>> The adversarial (discriminator) network is trying to tell the real
>> contexts from the randomly generated fake contexts...
>>
>> The InfoGAN variation would mean the GAN has some latent noise
>> variables that indicate key features of real word contexts.....
>> Presumably these would give a multidimensional parametrization of the
>> scope of word contexts, and hence the scope of words-in-context (i.e.
>> word meanings)
>>
>> So the architecture is nothing like word2vec, but the result is a
>> vector for each word: the vector being the settings of the latent
>> variables of the GAN network that generate the context for that
>> word...
>>
>> This may still be fuzzy but hopefully is more clearly in a meaningful
>> direction...
>>
>> This is "just" to find a maximally nice way to fill in the
>> clustering-ish step in our unsupervised grammar induction algorithm...
>>
>> ben
>>
>> On Wed, Apr 12, 2017 at 6:50 PM, Ben Goertzel <b...@goertzel.org> wrote:
>>> Having thought a little more... I'll need to think more about what's
>>> the right network architecture to handle the inputs for applying the
>>> InfoGAN methodology to this case...
>>>
>>> On Wed, Apr 12, 2017 at 4:46 PM, Ben Goertzel <b...@goertzel.org> wrote:
>>>> Speculating a little further on this...
>>>>
>>>> In word2vec one trains a neural networks to do the following. Given a
>>>> specific word in the middle of a sentence (the input word), one looks
>>>> at the words nearby and pick one at random.  The network is going to
>>>> tell us the probability -- for every word in our vocabulary -- of that
>>>> word being the “nearby word” that we chose.
>>>>
>>>> Suppose we try to use word2vec on a vocabulary of 10K words and try to
>>>> project the words into vectors of 300 features.
>>>>
>>>> Then the input layer has 10K neurons (one per word), only one of which
>>>> is active at a time; the hidden layer has 300 neurons, and the output
>>>> layer has 10K neurons... the vector for a word is then given by the
>>>> weights to the hidden layer from that word...
>>>>
>>>> (see
>>>> http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
>>>> for simple overview...)
>>>>
>>>> This is cool but not necessarily the best way to do this sort of thing,
>>>> right?
>>>>
>>>> An alternate approach in the spirit of InfoGAN would be to try to
>>>> learn a "generative" network that, given an input word W, outputs the
>>>> distribution of words surrounding W ....   There would also be an
>>>> "adversarial" network that would try to distinguish the distributions
>>>> produced by the generative network, from the distribution produced
>>>> from the actual word....  The generative network could have some
>>>> latent variables that are supposed to be informationally correlated
>>>> with the distribution produced...
>>>>
>>>> One would then expect/hope that the latent variables of the generative
>>>> model would correspond to relevant linguistic features... so one would
>>>> get shorter and more interesting vectors than word2vec gives...
>>>>
>>>> Suppose that in such a network, for "words surrounding W", one used
>>>> "words linked to W in a dependency parse"....  Then the latent
>>>> variables of the generative model mentioned above, should be the
>>>> relevant syntactico-semantic aspects of the syntactic relationships
>>>> that W displays in the dependency parse....
>>>>
>>>> Clustering on these vectors of latent variables should give very nice
>>>> clusters which can then be used to define new variables ("parts of
>>>> speech") for the next round of dependency parsing in our language
>>>> learning algorithm...
>>>>
>>>> -- Ben
>>>>
>>>>
>>>> On Sat, Apr 8, 2017 at 2:24 AM, Jesús López
>>>> <jesus.lopez.salva...@gmail.com> wrote:
>>>>> Hello Ben and Linas,
>>>>>
>>>>> Sorry for the delay, I was reading the papers. About additivity: In
>>>>> Coecke's et al. program you turn a sentence into a *multilinear* map
>>>>> that goes from the vectors of the words having elementary syntactic
>>>>> category to a semantic vector space, the sentence meaning space. So
>>>>> yes, there is additivity in each of theese arguments (thing which by
>>>>> the way should have a consequence in those beautiful word2vec
>>>>> relations of France - Paris ~= Spain - Madrid, though I haven't seen a
>>>>> description).
>>>>>
>>>>> As I understand, your goal is to go from plain text to logical forms
>>>>> in a probabilistic logic, and you have two stages, parsing from plain
>>>>> text to a pregroup grammar parse structure (I'm not sure that the
>>>>> parse trees I spoken before are really trees, hence the change to
>>>>> 'parse structure'), and then you go from that parse structure (via
>>>>> RelEx and RelEx2Logic if that's ok) to a lambda calculus term bearing
>>>>> the meaning and having attached extrinsically a kind of probability
>>>>> and another number.
>>>>>
>>>>> How do Coecke's program (and from now on that unfairly includes all
>>>>> the et als.) fit in that picture? I think the key observation is when
>>>>> Coecke says that his framework can be interpreted, as a particular
>>>>> case, as Montague semantics. Though adorned by linguistic
>>>>> considerations this semantic is well known as amenable to computation,
>>>>> and a toy version is shown in chapter 10 of the NLTK book, where they
>>>>> show how lambda calculus represents a logic that has a model theory.
>>>>> That is important because all those lambda terms have to be actual
>>>>> functions with actual values.
>>>>>
>>>>> How exactly does Coecke's framework reduces to Montague semantics?
>>>>> That matters, because if we understand how Montague semantics is a
>>>>> particular case of Coecke's, we can think in the opposite direction
>>>>> and see Coecke's semantics as an extension.
>>>>>
>>>>> As starting point we have the fact that Coecke semantics can be
>>>>> summarized as a monoidal functor that sends a morphism from a compact
>>>>> closed category in syntax-land (the pregroup grammar parse structure,
>>>>> resulting from parsing the plain text of a sentence) to a morphism in
>>>>> a compact closed category in semantics-land, the category of real
>>>>> vector spaces, that morphism being a (multi)linear map.
>>>>>
>>>>> Coecke semantic functor definition, however, hardly needs any
>>>>> modification if we use as target the compact closed category of
>>>>> modules over a fixed semiring. If the semiring is that of booleans, we
>>>>> are talking about the category of relations between sets, with Pierce
>>>>> relational product (uncle = brother * father) expressed with the same
>>>>> matrix product formula of linear algebra, and with cartesian product
>>>>> as the tensor product that makes it monoidal.
>>>>>
>>>>> The idea is that when Coecke semantic functor has as codomain the
>>>>> category of relations, one obtains Montague semantics. More exactly,
>>>>> when one applies the semantic functor to a pregroup grammar parse
>>>>> structure of a sentence, one obtains the lambda term that Montague
>>>>> would have attached to it. Naturally the question is how exactly
>>>>> unfold that abstract notion. The folk joke on 'abstract nonsense'
>>>>> forgets that there is a down button in the elevator.
>>>>>
>>>>> Well, this would be lenghty here, but the way I started to come to
>>>>> grips is by entering into the equation the CCG linguistic formalism. A
>>>>> fast and good slide show of how one goes from plain text to CCG
>>>>> derivations, and from derivations then to classic Montague-semantics
>>>>> lambda terms, can be found in [1].
>>>>>
>>>>> One important feature in CCG is that it is lexicalized, i. e., all the
>>>>> linguistic data necessary to do both syntatic and semantic parsing is
>>>>> attached to the words of the dictionary, in contrast with, say, NLTK
>>>>> book ch. 10, where the linguistic data is inside production rules of
>>>>> an explicit grammar.
>>>>>
>>>>> Looking closer to the lexicon (dictionary), one has that each word is
>>>>> supplemented with its syntactic category (N/N...) and also with a
>>>>> lambda term compatible with the syntactic category used in semantic
>>>>> parsing. Those lambda terms are not magical letters. For the lambda
>>>>> terms to have a true model theoretic semantics they must correspond to
>>>>> specific functions.
>>>>>
>>>>> The good thing is that the work of porting Coecke semantics to CCG
>>>>> (instead of pregroup grammar) is already done: in [2]. The details are
>>>>> there, but the thing that I want to highlight is that in this case,
>>>>> when one is doing Coecke semantics with CCG parsing, the structure of
>>>>> the lexicon is changed. One retains the words, and their associated
>>>>> syntactic category. But now, instead of the lambda terms (with their
>>>>> corresponding interpretation as actual relations/functions), one has
>>>>> vectors and tensors for simple and compound syntactic categories (say
>>>>> N vs N/N) respectively. When those tensors/vectors are of booleans one
>>>>> recovers Montague semantics.
>>>>>
>>>>> In the Coecke general case, sentences mean vectors in a real vector
>>>>> space and the benefits start by using its inner product, and hence
>>>>> norm and metric, so you can measure quantitatively sentence similarity
>>>>> (rather normalized vectors...).
>>>>>
>>>>> CCG is very nice in practical terms. An open SOTA parser
>>>>> implementation is [3] described in [4], to be compared with [5] ("The
>>>>> parser finds the optimal parse for 99.9% of held-out sentences").
>>>>> openCCG is older but does parsing and generation.
>>>>>
>>>>> One thing that I don't understand well with the above stuff is that
>>>>> the category of vector spaces over a fixed field (or even the finite
>>>>> dimensional ones) is *not* cartesian closed. While in the presentation
>>>>> of Montague semantics in NLTK book ch. 10 the lambda calculus appears
>>>>> to be untyped, more faithful presentations seem to require (simply)
>>>>> typed or even a more complex calculus/logic. In that case the semantic
>>>>> category perhaps should had to be cartesian closed, supporting in
>>>>> particular higher order maps.
>>>>>
>>>>> That's all in the expository front and now some speculation.
>>>>>
>>>>> Up to now the only tangible enhancement brought by Coecke semantics is
>>>>> the motivation of a metric among sentence meanings. What we really
>>>>> want is a mathematical motivation to probabilize the crisp, hard facts
>>>>> character of the interpretation of sentences as Montague lambda terms.
>>>>> How to attack the problem?
>>>>>
>>>>> One idea is to experiment with other kinds of semantic category as
>>>>> target of the Coecke semantic functor. To be terse, this can be
>>>>> explored by means of a monad on a vanilla unstructured base category
>>>>> such as finite sets. One can have several choices of endofunctor to
>>>>> specify the corresponding monad. Then the semantic category proposed
>>>>> is its Kleisli category. Theese categories are monoidal and have a
>>>>> revealing diagrammatic notation.
>>>>>
>>>>> 1.- Powerset endofunctor. This gives rise to the category of sets,
>>>>> relations and cartesian product as monoidal operation. Coecke
>>>>> semantincs results in montagovian hard facts as described above.
>>>>> Coecke and Kissinger's new book [6] details the diagramatic language
>>>>> particulars.
>>>>> 2.- Vector space monad (over the reals). Since the sets are finite,
>>>>> the Kleisli category is that of finite dimensional real vector spaces.
>>>>> That is properly Coecke's framework for computing sentence similarity.
>>>>> Circuit diagrams are tensor networks where boxes are tensors and wires
>>>>> are  contractions of specific indices.
>>>>> 3.- A monad in quantum computing is shown in [7], and quantumly
>>>>> motivated semantics is specifically addressed by Coecke. The whole
>>>>> book [8] discuss the connection though I haven't read it. Circuit
>>>>> diagrams should be quantum circuits representing possibly unitary
>>>>> process. Quantum amplitudes through measurement give rise to classical
>>>>> probabilities.
>>>>> 4.- The Giry monad here results from the functor that produces all
>>>>> formal convex linear combinations of the elements of a given set. The
>>>>> Kleisli category is very interesting, having as maps probabilistic
>>>>> mappings that under the hood are just conditional probabilities. This
>>>>> maps allow a more user friendly understanding of Markov Chains, Markov
>>>>> Decission Processes, HMMs, POMDPs, Naive Bayes classifiers and Kalman
>>>>> filters. Circuit diagrams have to correspond to the factor diagrams
>>>>> notation of bayesian networks [9], and the law of total probability
>>>>> generalizes in bayesian networks to the linear algebra tensor network
>>>>> calculations of the corresponding network (this can be shown in actual
>>>>> bayesian network software).
>>>>>
>>>>> A quote from mathematician Gian Carlo Rota [10]:
>>>>>
>>>>> "The first lecture by Jack [Schwartz] I listened to was given in the
>>>>> spring of 1954 in a seminar in functional analysis. A brilliant array
>>>>> of lecturers had been expounding throughout the spring term on their
>>>>> pet topics. Jack's lecture dealt with stochastic processes.
>>>>> Probability was still a mysterious subject cultivated by a few
>>>>> scattered mathematicians, and the expression "Markov chain" conveyed
>>>>> more than a hint of mystery. Jack started his lecture with the words,
>>>>> "A Markov chain is a generalization of a function." His perfect
>>>>> motivation of the Markov property put the audience at ease. Graduate
>>>>> students and instructors relaxed and followed his every word to the
>>>>> end."
>>>>>
>>>>> The thing I would research would be to use as semantic category that
>>>>> of those generalized functions of the former quote and bullet 4 so
>>>>> basically you replace word2vec vectors by probability distributions of
>>>>> the words meaning something, connect a bayesian network from the CCG
>>>>> parse and apply generalized total probability to obtain probabilized
>>>>> booleans, i.e. a number 0 <= x <= 1 (instead of just a boolean as with
>>>>> Montague semantics). That is, the probability that a sentence holds
>>>>> depends on the distributions of its syntactically elementary
>>>>> contituyents meaning something, and those distros are combined by
>>>>> factors of a bayesian net with conditional independence relations that
>>>>> respect and reflect the sentence syntax and have the local Markov
>>>>> property. The factors are for words of complex syntactic cateogory (as
>>>>> N/N...) and their attached tensors are multivariate conditional
>>>>> probability distributions.
>>>>>
>>>>> Hope this helps somehow. Kind regards,
>>>>> Jesus.
>>>>>
>>>>>
>>>>> [1] http://yoavartzi.com/pub/afz-tutorial.acl.2013.pdf
>>>>> [2] http://www.cl.cam.ac.uk/~sc609/pubs/eacl14types.pdf
>>>>> [3] http://homepages.inf.ed.ac.uk/s1049478/easyccg.html
>>>>> [4] http://www.aclweb.org/anthology/D14-1107
>>>>> [5] https://arxiv.org/abs/1607.01432
>>>>> [6] ISBN 1108107710
>>>>> [7] https://bram.westerbaan.name/kleisli.pdf
>>>>> [8] ISBN 9780199646296
>>>>> [9] http://helper.ipam.ucla.edu/publications/gss2012/gss2012_10799.pdf
>>>>> [10] Indiscrete thoughts
>>>>>
>>>>> On 4/2/17, Linas Vepstas <linasveps...@gmail.com> wrote:
>>>>>> Hi Ben,
>>>>>>
>>>>>> On Sun, Apr 2, 2017 at 3:16 PM, Ben Goertzel <b...@goertzel.org>
>>>>>> wrote:
>>>>>>
>>>>>>>  So e.g. if we find X+Y is roughly equal to Z in the domain
>>>>>>> of semantic vectors,
>>>>>>>
>>>>>>
>>>>>> But what Jesus is saying (and what we say in our paper, with all that
>>>>>> fiddle-faddle about categories)  is precisely that while the concept
>>>>>> of
>>>>>> addition is kind-of-ish OK for meanings  it can be even better if
>>>>>> replaced
>>>>>> with the correct categorial generalization.
>>>>>>
>>>>>> That is, addition -- the plus sign --is a certain speciific morphism,
>>>>>> and
>>>>>> that this morphism,  the addition of vectors, has the unfortunate
>>>>>> property
>>>>>> of being commutative, whereas we know that language is
>>>>>> non-commutative.
>>>>>> The
>>>>>> stuff  about pre-group grammars is all about identifying exactly
>>>>>> which
>>>>>> morphism it is that correctly generalizes the addition morphism.
>>>>>>
>>>>>> That addition is kind-of OK is why word2vec kind-of works. But I
>>>>>> think
>>>>>> we
>>>>>> can do better.
>>>>>>
>>>>>> Unfortunately, the pressing needs of having to crunch data, and to
>>>>>> write
>>>>>> the code to crunch that data, prevents me from devoting enough time
>>>>>> to
>>>>>> this
>>>>>> issue for at least a few more weeks or a month. I would very much
>>>>>> like
>>>>>> to
>>>>>> clarify the theoretical situation here, but need to find a chunk of
>>>>>> time
>>>>>> that isn't taken up by email and various mundane tasks.
>>>>>>
>>>>>> --linas
>>>>>>
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "opencog" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to opencog+unsubscr...@googlegroups.com.
>>>>> To post to this group, send email to opencog@googlegroups.com.
>>>>> Visit this group at https://groups.google.com/group/opencog.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/opencog/CAFx29Pu6zK7MwbOuTPHcwOUuW9C6Wrhc9ZFxc5Kp3J4GkMtHkg%40mail.gmail.com.
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>>
>>>>
>>>> --
>>>> Ben Goertzel, PhD
>>>> http://goertzel.org
>>>>
>>>> "I am God! I am nothing, I'm play, I am freedom, I am life. I am the
>>>> boundary, I am the peak." -- Alexander Scriabin
>>>
>>>
>>>
>>> --
>>> Ben Goertzel, PhD
>>> http://goertzel.org
>>>
>>> "I am God! I am nothing, I'm play, I am freedom, I am life. I am the
>>> boundary, I am the peak." -- Alexander Scriabin
>>
>>
>>
>> --
>> Ben Goertzel, PhD
>> http://goertzel.org
>>
>> "I am God! I am nothing, I'm play, I am freedom, I am life. I am the
>> boundary, I am the peak." -- Alexander Scriabin
>>
>> --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "opencog" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/opencog/mX93L866Z_Q/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
>> opencog+unsubscr...@googlegroups.com.
>> To post to this group, send email to opencog@googlegroups.com.
>> Visit this group at https://groups.google.com/group/opencog.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/opencog/CACYTDBcgaVWFzeQ6jVHcpqb2%2BoRJBvs0MOwEMHFKpD3__gSbxA%40mail.gmail.com.
>> For more options, visit https://groups.google.com/d/optout.
>>
>


-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to opencog+unsubscr...@googlegroups.com.
To post to this group, send email to opencog@googlegroups.com.
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAFx29Ps03Y5XMCw0Ee4vDxm1sp1oWmnPDYg%2BrZA3TsKk26gdpQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [opencog-dev] Re: word2vec within openCog language learning?

Reply via email to