[opencog-dev] Re: [Link Grammar] Re: Word similarity database report

Ben Goertzel Thu, 11 May 2017 02:41:48 -0700

On Thu, May 11, 2017 at 11:27 AM, Linas Vepstas <[email protected]> wrote:
> There are two hard parts to clustering. One is writing all the code to get
> the clusters working in the pipeline.  I guess I'll have to do that.  The
> other is dealing with words with multiple meanings: "I saw the man with the
> saw" and clustering really needs to distinguish saw the verb from saw the
> noun.  Not yet clear about the details of this.  i've a glimmer of the
> general idea,

I was thinking to explore addressing this with (fairly shallow) neural
networks ...

This paper

https://nlp.stanford.edu/pubs/HuangACL12.pdf

which I've pointed out before, does unsupervised construction of
word2vec type vectors for word senses (thus, doing sense
disambiguation sorta mixed up with the dimension-reduction process)

Now that algorithm takes sentences as inputs, not parse trees. But I
think you could modify the approach to apply to our context, in an
interesting way...

The following describes one way to do this. I'm sure there are others.

1) A first step would be to use the OpenCog pattern miner to mine the
surprising patterns from the set of parse trees produced by MST
parsing.

2) Then, one could associate with each word-instance W a set of
instance-pattern-vectors. Each instance vector is very sparse, and
contains an entry for each of the patterns (among the surprising
patterns found in step 1) that W is involved in. Given these
instance-pattern-vectors, one can also calculate word-pattern-vectors
or word-sense-pattern-vectors (via averaging the instance-vectors for
all instance of the word or word-sense)

3) Their algorithm involves an embedding matrix L that maps: a binary
vector with a 1 in position i representing the i'th word in the
dictionary, into a much smaller dense vector. I would suggest
instead having an embedding matrix L that maps the pattern-vectors
representing words or senses (constructed in step 2) into a much
smaller dense vector. This is word2vec-ish, but the data it's drawing
on is the set of patterns observed in a corpus of parse trees...

4) Their algorithm involves, in the local score function, using a
sequence [x1, ..., xm], where xi is the embedding vector assigned to
word i in the sequence being looked at. Instead, we could use a
structure like the following, where w is the word being predicted and
S is the sentence containing w,

[ avg. embedding vector of words one link to the left of w in the
parse tree of S, avg. embedding vector of words one link to the right
of w in the parse tree of S, avg. embedding vector of words two links
to the left of w in the parse tree of S, avg. embedding vector of
words two links to the right of w in the parse tree of S]

This context-matrix is a way of capturing "the embedding vectors of
the words constituting the context of w in parsed sentence S" as a
linear vector... Stopping at "two links away" is arbitrary, probably
we want to go 4-5 links away (yielding a vector of length 8-10); this
would have to be experimented with...

...

Given these changes, one could apply the algorithm in the paper for
sense disambiguation and clustering...

Of course, there would also be a lot of other ways to mix up the same
ingredients mentioned in the above ... the two unique ingredients I
have introduced are

* creating dense vectors for words or senses from pattern-vectors

* creating context-matrices partly capturing the context of a
word-instance (or word or sense) based on a corpus of parse trees...

...and one could play with these in many different ways.

To put it more precisely, there are a lot of ways that one could iteratively

-- cluster word-instances based on their context-matrices (thus
generating word labels)

-- learn an embedding matrix (starting from pattern-vectors) that
enables accurate skip-gram prediction based on knowing the labels of
the words produced by the clustering done in the preceding step

Mimicking the algorithm from the above paper (with the changes I've
suggested) is one way to do this but there are lots of other ways one
could try...

-- Ben

--
Ben Goertzel, PhD
http://goertzel.org

"I am God! I am nothing, I'm play, I am freedom, I am life. I am the
boundary, I am the peak." -- Alexander Scriabin

--
You received this message because you are subscribed to the Google Groups
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit
https://groups.google.com/d/msgid/opencog/CACYTDBeycUvtsN-5RVaqeQ4KdyOyGMi48ayiwS6CkwGTdv9A%2BA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[opencog-dev] Re: [Link Grammar] Re: Word similarity database report

Reply via email to