[opencog-dev] Re: digging into MST-Parser code

Michael Duncan Wed, 16 Jan 2019 08:05:11 -0800

hi linus, given your assertion that the atomspace is fundamentally better
than the other graph databases out there, wouldn't it be strategic from an
ecosystem point of view to reconceptualize the functional distinction
between the atomspace repo and the opencog repo as a graph database and
(proto) agi infrastructure, respectively?


On Fri, Jan 11, 2019 at 9:31 PM Linas Vepstas <[email protected]>
wrote:

> Hi Anton!
>
> On Thu, Jan 10, 2019 at 11:45 PM Anton Kolonin @ Aigents <
> [email protected]> wrote:
>
>> Hi Linas, while digging into  MST-Parser code, we have found that some
>> of the NLP Scheme code resides in singnet/opencog and some is in
>> singnet/atomspace.
>
>
> My goal is that all developers agree that there is a development branch
> and a stable branch, and to know which one is which, and to work so that
> all development go into the development branch, and that the stable branch
> be a branch that is stable according to industry-standard definitions of
> stability.
>
> I am concerned that there continues to be confusion about this. I am
> concerned that this will just lead to wasted time and bad design and bad
> code that is buggy, inoperable.
>
> I spend vast amounts of my time being "the janitor" who cleans up messes,
> and this is a thankless job, and I don't enjoy it, and I get concerned
> whenever I read something that suggests I have a big cleanup job waiting
> for me in the future.
>
>
>> I wonder, if idea of having Scheme code in AtomSpace
>> layer has some conceptual justification or it is just historical matter?
>>
>
> There has always been scheme code in the atomspace. The atomspace provides
> all of the core infrastructure for the scheme bindings.
>
>>
>> For instance we were unraveling the uses and implementation of
>> add-symmetric-mi-compute.
>>
>
> That function is provided as a part of the "matrix" package. That package
> provides a way for looking at subsets of the atomspace as if they were
> (sparse) matrices. Please recall that a matrix (a 2-tensor) is an N x N
> grid of values.  There are many, many things one can do with a matrix.
> Almost all of the code in the matrix directory is focused on treating the
> matrix as a probability P(X,Y) of two random processes.
>
> Whenever one has a probability like that, one is typically interested in
> the marginals (the P(X) which is the P(X,Y) summed over all Y), the
> conditional probability P(X|Y), the entropy H(X,Y) and the mutual
> information MI(X,Y). Another very important quantity is the product P(X,Y)
> P^T(Y,Z) where ^T denotes the matrix transpose.  This product is can be
> used to build the cosine distance between X and Z; it can also be used to
> build the symmetric-MI, which is like the cosine distance; but has sums
> over logarithms in strategic places.
>
> You may wonder "why not use an ordinary linear algebra package?" or "why
> not use Gnu R?"  (or SPSS or SciPy, or whatever) There are three reasons
> for this:
>
> 1) The atomspace matrices are extremely sparse: for the NLP data, only one
> in a million entries are non-zero.
>
> 2) the NxN matrix has N=100K to 1 million for NLP data, which is more RAM
> than computers can easily provide. The matrix package has to be optimized
> for sparse data.  Genomic data might have even larger N.
>
> 3) It would be marvelous if someone wrote an R wrapper for this stuff.
> It's not hard. Someone needs to do this. I have been urging the agi-bio
> guys to do this, because their genomic/proteomic data is also extremely
> sparse,  and because they like to use R for data analysis.
>
> The general justification is that every atom is like a tensor index, and
> the value attached to that atom is the value of that tensor at that index.
> Since a collection of atoms is conceptually the same thing as a set of
> sparse tensors, lets acknowledge that fact, and provide an API that allows
> ordinary users to access the tensor data as tensors. By "ordinary users" I
> mean anyone who has ever done statistical analysis,  or more generally, any
> user who uses SciPy or Gnu R to mangle their data.
>
> The atomspace is not for everyone: its only for those people who have very
> sparse data with a Zipfian distribution. But if that is what they have, let
> them access it in a "normal" kind of data-analytics kind of way, like how
> you'd do data analytics in other packages.
>
>
>>
>> If we keep extending the MST-Parser code
>
>
> The MST parser code is in a different directory; it is not a part of the
> matrix code.  It is much more experimental. The MST parser is meant to be a
> part of a generic parsing and theorem-proving infrastructure. Among both
> the academics, and the readers of this mailing list, there is some general
> understanding that theorem proving, natural deduction, Hilbert-style
> deduction, sequent calculus, parsing and constraint-solving are all
> kind-of-ish "the same thing".  The goal here is to actually try to make
> them actually be "the same thing" by providing something that accomplishes
> all of the above with the same code base.
>
> To be more explicit: we have the URE, which performs forward and backward
> chaining. If you look at the chaining algorithm, you promptly realize that
> it is a certain kind of parsing algorithm.  This insight, that parsing and
> theorem proving is "the same thing" is what prompted the URE to be
> created.  It has been used for the proving side; i.e. for PLN, but it has
> not been used for parsing, yet.  No one has ever attempted to import
> link-grammar into the URE. (There was also the intent that open-pse would
> also run on top of, run with the URE, but the current URE does not support
> that mode of operation, and so open-psi exists as a distinct, separate code
> base)
>
> If the URE had been sufficiently powerful and robust, we would have been
> able to import the full English link-grammar dictionary into the URE, and
> run it, and get ordinary LG parses coming out. This is not currently
> possible with the current URE design.
>
> Given that the current URE is unable to support open-psi, and is unable to
> support LG, it seemed like it was time to redesign it, from the ground-up.
> Thus, the code in the "sheaf" directory is an attempt to re-imagine how
> theorem-proving and parsing can be accomplished in a fashion that is much
> faster, easier and more usable than the current URE is, with a simpler API
> and a stronger toolset.  The paper on sheafs was an attempt to explain how
> this could be done.
>
>
>> for account for word, link and
>> disjunct frequency
>
>
> Please understand that disjuncts are a general concept. They occur not
> only in natural language, but they also occur in biology, and they also
> occur in theorem proving.
>
>
>> and provide integration with DNN-s
>
>
> The paper on skip-grams is an attempt to explain how theorem proving is
> just like deep-learning in neural nets. It attempt to explain how these two
> different systems are really variations on the same theme.
>
> Ideally, the API provided by the code in the "sheaf" directory will be
> able to provide a common API to deep learning systems, a well as to parsing
> systems, as well as to PLN, as well as to open-psi, and that, one could
> choose between different algorithms and implementations that can process
> your data.
>
> Currently, this dream is pre-pre-alpha, and it contains only a generic MST
> parser.
>
>
>> and add
>> incremental/iterative learning capabilities to it,
>
>
> The goal of tracking disjunct statistics is that this *is* the learning
> system.  Yes, there are other ways of learning. Again, the paper on
> skip-grams attempts to explain all the different ways in which learning can
> be accomplished.
>
>
>> should the changes be
>> done both to singnet/opencog and singnet/atomspace following the same
>> pattern?
>>
>
> See comments at top about stable and development branches.
>
>>
>> Or, we should better pull all NLP code out from singnet/atomspace to
>> singnet/opencog or even place them in separate project?
>>
>
> The MST parsing code is intended to be a part of a generic learning system
> that can be applied to NLP or genetics or to logical induction or to
> robotic motion control.  It is not specific to natural language.
>
> -- Linas
>
>>
>> Ben, Man Hin, Amen - any insights on this?
>>
>> Thanks,
>>
>> --
>> -Anton Kolonin
>> skype: akolonin
>> cell: +79139250058
>>
>>
>
> --
> cassette tapes - analog TV - film cameras - you
>

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAOoNsN-6OLLgX80SsQmpSNs9uEXsquaD0%3Dpj2h865spQG1OkNQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[opencog-dev] Re: digging into MST-Parser code

Reply via email to