[opencog-dev] Re: digging into MST-Parser code

Linas Vepstas Fri, 11 Jan 2019 18:31:57 -0800

Hi Anton!

On Thu, Jan 10, 2019 at 11:45 PM Anton Kolonin @ Aigents <
[email protected]> wrote:


> Hi Linas, while digging into  MST-Parser code, we have found that some
> of the NLP Scheme code resides in singnet/opencog and some is in
> singnet/atomspace.


My goal is that all developers agree that there is a development branch and
a stable branch, and to know which one is which, and to work so that all
development go into the development branch, and that the stable branch be a
branch that is stable according to industry-standard definitions of
stability.

I am concerned that there continues to be confusion about this. I am
concerned that this will just lead to wasted time and bad design and bad
code that is buggy, inoperable.

I spend vast amounts of my time being "the janitor" who cleans up messes,
and this is a thankless job, and I don't enjoy it, and I get concerned
whenever I read something that suggests I have a big cleanup job waiting
for me in the future.


> I wonder, if idea of having Scheme code in AtomSpace
> layer has some conceptual justification or it is just historical matter?
>

There has always been scheme code in the atomspace. The atomspace provides
all of the core infrastructure for the scheme bindings.

>
> For instance we were unraveling the uses and implementation of
> add-symmetric-mi-compute.
>

That function is provided as a part of the "matrix" package. That package
provides a way for looking at subsets of the atomspace as if they were
(sparse) matrices. Please recall that a matrix (a 2-tensor) is an N x N
grid of values.  There are many, many things one can do with a matrix.
Almost all of the code in the matrix directory is focused on treating the
matrix as a probability P(X,Y) of two random processes.

Whenever one has a probability like that, one is typically interested in
the marginals (the P(X) which is the P(X,Y) summed over all Y), the
conditional probability P(X|Y), the entropy H(X,Y) and the mutual
information MI(X,Y). Another very important quantity is the product P(X,Y)
P^T(Y,Z) where ^T denotes the matrix transpose.  This product is can be
used to build the cosine distance between X and Z; it can also be used to
build the symmetric-MI, which is like the cosine distance; but has sums
over logarithms in strategic places.

You may wonder "why not use an ordinary linear algebra package?" or "why
not use Gnu R?"  (or SPSS or SciPy, or whatever) There are three reasons
for this:

1) The atomspace matrices are extremely sparse: for the NLP data, only one
in a million entries are non-zero.

2) the NxN matrix has N=100K to 1 million for NLP data, which is more RAM
than computers can easily provide. The matrix package has to be optimized
for sparse data.  Genomic data might have even larger N.

3) It would be marvelous if someone wrote an R wrapper for this stuff. It's
not hard. Someone needs to do this. I have been urging the agi-bio guys to
do this, because their genomic/proteomic data is also extremely sparse,
and because they like to use R for data analysis.

The general justification is that every atom is like a tensor index, and
the value attached to that atom is the value of that tensor at that index.
Since a collection of atoms is conceptually the same thing as a set of
sparse tensors, lets acknowledge that fact, and provide an API that allows
ordinary users to access the tensor data as tensors. By "ordinary users" I
mean anyone who has ever done statistical analysis,  or more generally, any
user who uses SciPy or Gnu R to mangle their data.

The atomspace is not for everyone: its only for those people who have very
sparse data with a Zipfian distribution. But if that is what they have, let
them access it in a "normal" kind of data-analytics kind of way, like how
you'd do data analytics in other packages.


>
> If we keep extending the MST-Parser code


The MST parser code is in a different directory; it is not a part of the
matrix code.  It is much more experimental. The MST parser is meant to be a
part of a generic parsing and theorem-proving infrastructure. Among both
the academics, and the readers of this mailing list, there is some general
understanding that theorem proving, natural deduction, Hilbert-style
deduction, sequent calculus, parsing and constraint-solving are all
kind-of-ish "the same thing".  The goal here is to actually try to make
them actually be "the same thing" by providing something that accomplishes
all of the above with the same code base.

To be more explicit: we have the URE, which performs forward and backward
chaining. If you look at the chaining algorithm, you promptly realize that
it is a certain kind of parsing algorithm.  This insight, that parsing and
theorem proving is "the same thing" is what prompted the URE to be
created.  It has been used for the proving side; i.e. for PLN, but it has
not been used for parsing, yet.  No one has ever attempted to import
link-grammar into the URE. (There was also the intent that open-pse would
also run on top of, run with the URE, but the current URE does not support
that mode of operation, and so open-psi exists as a distinct, separate code
base)

If the URE had been sufficiently powerful and robust, we would have been
able to import the full English link-grammar dictionary into the URE, and
run it, and get ordinary LG parses coming out. This is not currently
possible with the current URE design.

Given that the current URE is unable to support open-psi, and is unable to
support LG, it seemed like it was time to redesign it, from the ground-up.
Thus, the code in the "sheaf" directory is an attempt to re-imagine how
theorem-proving and parsing can be accomplished in a fashion that is much
faster, easier and more usable than the current URE is, with a simpler API
and a stronger toolset.  The paper on sheafs was an attempt to explain how
this could be done.


> for account for word, link and
> disjunct frequency


Please understand that disjuncts are a general concept. They occur not only
in natural language, but they also occur in biology, and they also occur in
theorem proving.


> and provide integration with DNN-s


The paper on skip-grams is an attempt to explain how theorem proving is
just like deep-learning in neural nets. It attempt to explain how these two
different systems are really variations on the same theme.

Ideally, the API provided by the code in the "sheaf" directory will be able
to provide a common API to deep learning systems, a well as to parsing
systems, as well as to PLN, as well as to open-psi, and that, one could
choose between different algorithms and implementations that can process
your data.

Currently, this dream is pre-pre-alpha, and it contains only a generic MST
parser.


> and add
> incremental/iterative learning capabilities to it,


The goal of tracking disjunct statistics is that this *is* the learning
system.  Yes, there are other ways of learning. Again, the paper on
skip-grams attempts to explain all the different ways in which learning can
be accomplished.


> should the changes be
> done both to singnet/opencog and singnet/atomspace following the same
> pattern?
>

See comments at top about stable and development branches.

>
> Or, we should better pull all NLP code out from singnet/atomspace to
> singnet/opencog or even place them in separate project?
>

The MST parsing code is intended to be a part of a generic learning system
that can be applied to NLP or genetics or to logical induction or to
robotic motion control.  It is not specific to natural language.

-- Linas

>
> Ben, Man Hin, Amen - any insights on this?
>
> Thanks,
>
> --
> -Anton Kolonin
> skype: akolonin
> cell: +79139250058
>
>

-- 
cassette tapes - analog TV - film cameras - you

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAHrUA37DvSm6dhdZXmWLijVRqQ%2BURjxQpmh8v-0rxEiUM4-mUw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[opencog-dev] Re: digging into MST-Parser code

Reply via email to