Re: [opencog-dev] Re: word2vec within openCog language learning?

Jesús López Sat, 22 Apr 2017 06:53:46 -0700

Hi again, just wanted to drop a pair of thoughts.

What I'm talking about is more of conceptual exploration, categorical
and liguistically motivated while Ben talk is more neural and
hands-on. What would be nice is connecting the threads.


Previously Ben said:
> The semiring could also be a non-Boolean algebra of relations on
graphs or hypergraphs

That would demand to substitute the numbers in the word2vec vectors
(and Coecke tensors!) by whole relations (relations on hypergraphs are
much fatter than just numbers) which I'm not sure you'd even want. I
didn't remember seeing this before. For good or bad, last week
appeared arxiv:1704.05725 for the categorical quantum mechanics
setting where they seem to be doing just that sort of thing,
substituting the complex numbers field by an arbitrary C*-algebra. If
you can think of your algebra of relations as C-star, that would push
that idea some further, though I don't really know how far it goes
semantically, not to speak about learning parameters. One would need
also the glue to apply the former paper idea to the quantum flavor of
Cocke semantics.

Can't help on GAN stuff because of lacking homework on that. However I
would also look to what Socher did in 2013. Typical neural nets are
many-flat sandwiches of rectangles of weights (linear), that have
stacked on top a vector of nonlinearities and so on. Socher
introduced/used *tensor* neural nets where he used a *cube* for a
*bi*-linear transformation followed by nonlinearity. His units
transform pair of vectors to single vectors and his NN topology is a
binary tree (instead of a linear stacking of layers of a classical
NN). If you have a fragment of English generated by a CFG, the parse
tree (true tree) can be binarized [1], and each node would be a Socher
net unit, with leaves being distributional (word2vec) vectors.

The difference of this with Cocke is that in the later there is not
binarization (instead multilinear, general tensors), and the net is
not a tree but a DAG. And more importantly of course there are
nonlinear extra toppings of nodes in Socher and an actual learning
algorithm, thing left for the future more or less in Coecke view
despite some efforts. So basically if you put a nonlinear topping or
hat on each of the nodes of what I was calling a tensor network you
should arrive at a neural tensor net. Just split the rank r of the
tensor in r = u + v, for u the quantity of contravariant (input)
indices, and v the quantity of covariants (outputs). Then each node
tensor has u *vectors* as inputs (2 in Socher) and v output vectors.
One needs an analogue of the element-wise nonlinearity in this context
but I don't know which. As the topology can include "diamond" paths,
one needs a suited learning method. I've read about what's called
backpropagation through structure in tensor neural net papers.

Another technical difference is that Socher had an extra additive
contribution to the output of their bilinearly-flavored units by an
extra classical NN-stage, just not to lie.

All the former if one has serious interest in the Cocke approach to semantics.

Note that while Coecke theory is very pleasant categorically, the
nonlinear toppings have not received any attention from categorists
that I know of.

On the purely categorical side of understanding this same problem, and
forgetting parameter learning for a moment, I had a litte realization
to share. I talked about categories resulting of several monads as
*targets* of Coecke semantic functor. Later I remembered that the
source has also monad flavour. Sequences of things can be understood
through the list monad from the viewpoint of functional programming,
or the free monoid monad of the purists. One can thus see sentences as
sequences of words (lexical entities) given by a specific monad. Thus
we have monad flavour in both source and target of the semantics
functor. That prompts questions on the character of the functor
itself.

That thoughts put me in the functional programmer mindset and I
remembered an old reading by Wadler, he was talking of understanding
(in functional programming and using Moggi ideas on computing with
monads) recursive descent parsers of domain specific languages given
by a context free grammar by monadic means. The topic is called
monadic parsing. For developers. Interestingly this viewpoint is
permeating into Linguistics as well, as demonstrated by "Monads for
natural language semantics" (Shan). He talks of semantics as a monad
transformer. We are at a point where there even is a section called
"The CCG monad" in book of isbn 9783110251708.

I don't know of work reconciling the monadic viewpoint with Coecke
stuff, but it is intriguing.

Regards, Jesús.


[1] http://images.slideplayer.com/15/4559376/slides/slide_39.jpg




On 4/13/17, Ben Goertzel <b...@goertzel.org> wrote:
> OK, let me try to rephrase this more clearly...
>
> What I am thinking is --
>
> In the GAN, the generative network takes in some random noise
> variables, and outputs a distribution over (link type, word) pairs
> [or in the plain-vanilla version without dependency parses, it would
> merely be over words
>
> The GAN would then be generating "statistical contexts" (corresponding to
> words)
>
> The adversarial (discriminator) network is trying to tell the real
> contexts from the randomly generated fake contexts...
>
> The InfoGAN variation would mean the GAN has some latent noise
> variables that indicate key features of real word contexts.....
> Presumably these would give a multidimensional parametrization of the
> scope of word contexts, and hence the scope of words-in-context (i.e.
> word meanings)
>
> So the architecture is nothing like word2vec, but the result is a
> vector for each word: the vector being the settings of the latent
> variables of the GAN network that generate the context for that
> word...
>
> This may still be fuzzy but hopefully is more clearly in a meaningful
> direction...
>
> This is "just" to find a maximally nice way to fill in the
> clustering-ish step in our unsupervised grammar induction algorithm...
>
> ben
>
> On Wed, Apr 12, 2017 at 6:50 PM, Ben Goertzel <b...@goertzel.org> wrote:
>> Having thought a little more... I'll need to think more about what's
>> the right network architecture to handle the inputs for applying the
>> InfoGAN methodology to this case...
>>
>> On Wed, Apr 12, 2017 at 4:46 PM, Ben Goertzel <b...@goertzel.org> wrote:
>>> Speculating a little further on this...
>>>
>>> In word2vec one trains a neural networks to do the following. Given a
>>> specific word in the middle of a sentence (the input word), one looks
>>> at the words nearby and pick one at random.  The network is going to
>>> tell us the probability -- for every word in our vocabulary -- of that
>>> word being the “nearby word” that we chose.
>>>
>>> Suppose we try to use word2vec on a vocabulary of 10K words and try to
>>> project the words into vectors of 300 features.
>>>
>>> Then the input layer has 10K neurons (one per word), only one of which
>>> is active at a time; the hidden layer has 300 neurons, and the output
>>> layer has 10K neurons... the vector for a word is then given by the
>>> weights to the hidden layer from that word...
>>>
>>> (see
>>> http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
>>> for simple overview...)
>>>
>>> This is cool but not necessarily the best way to do this sort of thing,
>>> right?
>>>
>>> An alternate approach in the spirit of InfoGAN would be to try to
>>> learn a "generative" network that, given an input word W, outputs the
>>> distribution of words surrounding W ....   There would also be an
>>> "adversarial" network that would try to distinguish the distributions
>>> produced by the generative network, from the distribution produced
>>> from the actual word....  The generative network could have some
>>> latent variables that are supposed to be informationally correlated
>>> with the distribution produced...
>>>
>>> One would then expect/hope that the latent variables of the generative
>>> model would correspond to relevant linguistic features... so one would
>>> get shorter and more interesting vectors than word2vec gives...
>>>
>>> Suppose that in such a network, for "words surrounding W", one used
>>> "words linked to W in a dependency parse"....  Then the latent
>>> variables of the generative model mentioned above, should be the
>>> relevant syntactico-semantic aspects of the syntactic relationships
>>> that W displays in the dependency parse....
>>>
>>> Clustering on these vectors of latent variables should give very nice
>>> clusters which can then be used to define new variables ("parts of
>>> speech") for the next round of dependency parsing in our language
>>> learning algorithm...
>>>
>>> -- Ben
>>>
>>>
>>> On Sat, Apr 8, 2017 at 2:24 AM, Jesús López
>>> <jesus.lopez.salva...@gmail.com> wrote:
>>>> Hello Ben and Linas,
>>>>
>>>> Sorry for the delay, I was reading the papers. About additivity: In
>>>> Coecke's et al. program you turn a sentence into a *multilinear* map
>>>> that goes from the vectors of the words having elementary syntactic
>>>> category to a semantic vector space, the sentence meaning space. So
>>>> yes, there is additivity in each of theese arguments (thing which by
>>>> the way should have a consequence in those beautiful word2vec
>>>> relations of France - Paris ~= Spain - Madrid, though I haven't seen a
>>>> description).
>>>>
>>>> As I understand, your goal is to go from plain text to logical forms
>>>> in a probabilistic logic, and you have two stages, parsing from plain
>>>> text to a pregroup grammar parse structure (I'm not sure that the
>>>> parse trees I spoken before are really trees, hence the change to
>>>> 'parse structure'), and then you go from that parse structure (via
>>>> RelEx and RelEx2Logic if that's ok) to a lambda calculus term bearing
>>>> the meaning and having attached extrinsically a kind of probability
>>>> and another number.
>>>>
>>>> How do Coecke's program (and from now on that unfairly includes all
>>>> the et als.) fit in that picture? I think the key observation is when
>>>> Coecke says that his framework can be interpreted, as a particular
>>>> case, as Montague semantics. Though adorned by linguistic
>>>> considerations this semantic is well known as amenable to computation,
>>>> and a toy version is shown in chapter 10 of the NLTK book, where they
>>>> show how lambda calculus represents a logic that has a model theory.
>>>> That is important because all those lambda terms have to be actual
>>>> functions with actual values.
>>>>
>>>> How exactly does Coecke's framework reduces to Montague semantics?
>>>> That matters, because if we understand how Montague semantics is a
>>>> particular case of Coecke's, we can think in the opposite direction
>>>> and see Coecke's semantics as an extension.
>>>>
>>>> As starting point we have the fact that Coecke semantics can be
>>>> summarized as a monoidal functor that sends a morphism from a compact
>>>> closed category in syntax-land (the pregroup grammar parse structure,
>>>> resulting from parsing the plain text of a sentence) to a morphism in
>>>> a compact closed category in semantics-land, the category of real
>>>> vector spaces, that morphism being a (multi)linear map.
>>>>
>>>> Coecke semantic functor definition, however, hardly needs any
>>>> modification if we use as target the compact closed category of
>>>> modules over a fixed semiring. If the semiring is that of booleans, we
>>>> are talking about the category of relations between sets, with Pierce
>>>> relational product (uncle = brother * father) expressed with the same
>>>> matrix product formula of linear algebra, and with cartesian product
>>>> as the tensor product that makes it monoidal.
>>>>
>>>> The idea is that when Coecke semantic functor has as codomain the
>>>> category of relations, one obtains Montague semantics. More exactly,
>>>> when one applies the semantic functor to a pregroup grammar parse
>>>> structure of a sentence, one obtains the lambda term that Montague
>>>> would have attached to it. Naturally the question is how exactly
>>>> unfold that abstract notion. The folk joke on 'abstract nonsense'
>>>> forgets that there is a down button in the elevator.
>>>>
>>>> Well, this would be lenghty here, but the way I started to come to
>>>> grips is by entering into the equation the CCG linguistic formalism. A
>>>> fast and good slide show of how one goes from plain text to CCG
>>>> derivations, and from derivations then to classic Montague-semantics
>>>> lambda terms, can be found in [1].
>>>>
>>>> One important feature in CCG is that it is lexicalized, i. e., all the
>>>> linguistic data necessary to do both syntatic and semantic parsing is
>>>> attached to the words of the dictionary, in contrast with, say, NLTK
>>>> book ch. 10, where the linguistic data is inside production rules of
>>>> an explicit grammar.
>>>>
>>>> Looking closer to the lexicon (dictionary), one has that each word is
>>>> supplemented with its syntactic category (N/N...) and also with a
>>>> lambda term compatible with the syntactic category used in semantic
>>>> parsing. Those lambda terms are not magical letters. For the lambda
>>>> terms to have a true model theoretic semantics they must correspond to
>>>> specific functions.
>>>>
>>>> The good thing is that the work of porting Coecke semantics to CCG
>>>> (instead of pregroup grammar) is already done: in [2]. The details are
>>>> there, but the thing that I want to highlight is that in this case,
>>>> when one is doing Coecke semantics with CCG parsing, the structure of
>>>> the lexicon is changed. One retains the words, and their associated
>>>> syntactic category. But now, instead of the lambda terms (with their
>>>> corresponding interpretation as actual relations/functions), one has
>>>> vectors and tensors for simple and compound syntactic categories (say
>>>> N vs N/N) respectively. When those tensors/vectors are of booleans one
>>>> recovers Montague semantics.
>>>>
>>>> In the Coecke general case, sentences mean vectors in a real vector
>>>> space and the benefits start by using its inner product, and hence
>>>> norm and metric, so you can measure quantitatively sentence similarity
>>>> (rather normalized vectors...).
>>>>
>>>> CCG is very nice in practical terms. An open SOTA parser
>>>> implementation is [3] described in [4], to be compared with [5] ("The
>>>> parser finds the optimal parse for 99.9% of held-out sentences").
>>>> openCCG is older but does parsing and generation.
>>>>
>>>> One thing that I don't understand well with the above stuff is that
>>>> the category of vector spaces over a fixed field (or even the finite
>>>> dimensional ones) is *not* cartesian closed. While in the presentation
>>>> of Montague semantics in NLTK book ch. 10 the lambda calculus appears
>>>> to be untyped, more faithful presentations seem to require (simply)
>>>> typed or even a more complex calculus/logic. In that case the semantic
>>>> category perhaps should had to be cartesian closed, supporting in
>>>> particular higher order maps.
>>>>
>>>> That's all in the expository front and now some speculation.
>>>>
>>>> Up to now the only tangible enhancement brought by Coecke semantics is
>>>> the motivation of a metric among sentence meanings. What we really
>>>> want is a mathematical motivation to probabilize the crisp, hard facts
>>>> character of the interpretation of sentences as Montague lambda terms.
>>>> How to attack the problem?
>>>>
>>>> One idea is to experiment with other kinds of semantic category as
>>>> target of the Coecke semantic functor. To be terse, this can be
>>>> explored by means of a monad on a vanilla unstructured base category
>>>> such as finite sets. One can have several choices of endofunctor to
>>>> specify the corresponding monad. Then the semantic category proposed
>>>> is its Kleisli category. Theese categories are monoidal and have a
>>>> revealing diagrammatic notation.
>>>>
>>>> 1.- Powerset endofunctor. This gives rise to the category of sets,
>>>> relations and cartesian product as monoidal operation. Coecke
>>>> semantincs results in montagovian hard facts as described above.
>>>> Coecke and Kissinger's new book [6] details the diagramatic language
>>>> particulars.
>>>> 2.- Vector space monad (over the reals). Since the sets are finite,
>>>> the Kleisli category is that of finite dimensional real vector spaces.
>>>> That is properly Coecke's framework for computing sentence similarity.
>>>> Circuit diagrams are tensor networks where boxes are tensors and wires
>>>> are  contractions of specific indices.
>>>> 3.- A monad in quantum computing is shown in [7], and quantumly
>>>> motivated semantics is specifically addressed by Coecke. The whole
>>>> book [8] discuss the connection though I haven't read it. Circuit
>>>> diagrams should be quantum circuits representing possibly unitary
>>>> process. Quantum amplitudes through measurement give rise to classical
>>>> probabilities.
>>>> 4.- The Giry monad here results from the functor that produces all
>>>> formal convex linear combinations of the elements of a given set. The
>>>> Kleisli category is very interesting, having as maps probabilistic
>>>> mappings that under the hood are just conditional probabilities. This
>>>> maps allow a more user friendly understanding of Markov Chains, Markov
>>>> Decission Processes, HMMs, POMDPs, Naive Bayes classifiers and Kalman
>>>> filters. Circuit diagrams have to correspond to the factor diagrams
>>>> notation of bayesian networks [9], and the law of total probability
>>>> generalizes in bayesian networks to the linear algebra tensor network
>>>> calculations of the corresponding network (this can be shown in actual
>>>> bayesian network software).
>>>>
>>>> A quote from mathematician Gian Carlo Rota [10]:
>>>>
>>>> "The first lecture by Jack [Schwartz] I listened to was given in the
>>>> spring of 1954 in a seminar in functional analysis. A brilliant array
>>>> of lecturers had been expounding throughout the spring term on their
>>>> pet topics. Jack's lecture dealt with stochastic processes.
>>>> Probability was still a mysterious subject cultivated by a few
>>>> scattered mathematicians, and the expression "Markov chain" conveyed
>>>> more than a hint of mystery. Jack started his lecture with the words,
>>>> "A Markov chain is a generalization of a function." His perfect
>>>> motivation of the Markov property put the audience at ease. Graduate
>>>> students and instructors relaxed and followed his every word to the
>>>> end."
>>>>
>>>> The thing I would research would be to use as semantic category that
>>>> of those generalized functions of the former quote and bullet 4 so
>>>> basically you replace word2vec vectors by probability distributions of
>>>> the words meaning something, connect a bayesian network from the CCG
>>>> parse and apply generalized total probability to obtain probabilized
>>>> booleans, i.e. a number 0 <= x <= 1 (instead of just a boolean as with
>>>> Montague semantics). That is, the probability that a sentence holds
>>>> depends on the distributions of its syntactically elementary
>>>> contituyents meaning something, and those distros are combined by
>>>> factors of a bayesian net with conditional independence relations that
>>>> respect and reflect the sentence syntax and have the local Markov
>>>> property. The factors are for words of complex syntactic cateogory (as
>>>> N/N...) and their attached tensors are multivariate conditional
>>>> probability distributions.
>>>>
>>>> Hope this helps somehow. Kind regards,
>>>> Jesus.
>>>>
>>>>
>>>> [1] http://yoavartzi.com/pub/afz-tutorial.acl.2013.pdf
>>>> [2] http://www.cl.cam.ac.uk/~sc609/pubs/eacl14types.pdf
>>>> [3] http://homepages.inf.ed.ac.uk/s1049478/easyccg.html
>>>> [4] http://www.aclweb.org/anthology/D14-1107
>>>> [5] https://arxiv.org/abs/1607.01432
>>>> [6] ISBN 1108107710
>>>> [7] https://bram.westerbaan.name/kleisli.pdf
>>>> [8] ISBN 9780199646296
>>>> [9] http://helper.ipam.ucla.edu/publications/gss2012/gss2012_10799.pdf
>>>> [10] Indiscrete thoughts
>>>>
>>>> On 4/2/17, Linas Vepstas <linasveps...@gmail.com> wrote:
>>>>> Hi Ben,
>>>>>
>>>>> On Sun, Apr 2, 2017 at 3:16 PM, Ben Goertzel <b...@goertzel.org> wrote:
>>>>>
>>>>>>  So e.g. if we find X+Y is roughly equal to Z in the domain
>>>>>> of semantic vectors,
>>>>>>
>>>>>
>>>>> But what Jesus is saying (and what we say in our paper, with all that
>>>>> fiddle-faddle about categories)  is precisely that while the concept
>>>>> of
>>>>> addition is kind-of-ish OK for meanings  it can be even better if
>>>>> replaced
>>>>> with the correct categorial generalization.
>>>>>
>>>>> That is, addition -- the plus sign --is a certain speciific morphism,
>>>>> and
>>>>> that this morphism,  the addition of vectors, has the unfortunate
>>>>> property
>>>>> of being commutative, whereas we know that language is non-commutative.
>>>>> The
>>>>> stuff  about pre-group grammars is all about identifying exactly which
>>>>> morphism it is that correctly generalizes the addition morphism.
>>>>>
>>>>> That addition is kind-of OK is why word2vec kind-of works. But I think
>>>>> we
>>>>> can do better.
>>>>>
>>>>> Unfortunately, the pressing needs of having to crunch data, and to
>>>>> write
>>>>> the code to crunch that data, prevents me from devoting enough time to
>>>>> this
>>>>> issue for at least a few more weeks or a month. I would very much like
>>>>> to
>>>>> clarify the theoretical situation here, but need to find a chunk of
>>>>> time
>>>>> that isn't taken up by email and various mundane tasks.
>>>>>
>>>>> --linas
>>>>>
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "opencog" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to opencog+unsubscr...@googlegroups.com.
>>>> To post to this group, send email to opencog@googlegroups.com.
>>>> Visit this group at https://groups.google.com/group/opencog.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/opencog/CAFx29Pu6zK7MwbOuTPHcwOUuW9C6Wrhc9ZFxc5Kp3J4GkMtHkg%40mail.gmail.com.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>>
>>>
>>> --
>>> Ben Goertzel, PhD
>>> http://goertzel.org
>>>
>>> "I am God! I am nothing, I'm play, I am freedom, I am life. I am the
>>> boundary, I am the peak." -- Alexander Scriabin
>>
>>
>>
>> --
>> Ben Goertzel, PhD
>> http://goertzel.org
>>
>> "I am God! I am nothing, I'm play, I am freedom, I am life. I am the
>> boundary, I am the peak." -- Alexander Scriabin
>
>
>
> --
> Ben Goertzel, PhD
> http://goertzel.org
>
> "I am God! I am nothing, I'm play, I am freedom, I am life. I am the
> boundary, I am the peak." -- Alexander Scriabin
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "opencog" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/opencog/mX93L866Z_Q/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> opencog+unsubscr...@googlegroups.com.
> To post to this group, send email to opencog@googlegroups.com.
> Visit this group at https://groups.google.com/group/opencog.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/opencog/CACYTDBcgaVWFzeQ6jVHcpqb2%2BoRJBvs0MOwEMHFKpD3__gSbxA%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to opencog+unsubscr...@googlegroups.com.
To post to this group, send email to opencog@googlegroups.com.
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAFx29PsuX4_TTN7LX3RoJ4-bdKtZreshQS85P0f-uSR2rMruzg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [opencog-dev] Re: word2vec within openCog language learning?

Reply via email to