Hi Anton! On Thu, Jan 10, 2019 at 11:45 PM Anton Kolonin @ Aigents < [email protected]> wrote:
> Hi Linas, while digging into MST-Parser code, we have found that some > of the NLP Scheme code resides in singnet/opencog and some is in > singnet/atomspace. My goal is that all developers agree that there is a development branch and a stable branch, and to know which one is which, and to work so that all development go into the development branch, and that the stable branch be a branch that is stable according to industry-standard definitions of stability. I am concerned that there continues to be confusion about this. I am concerned that this will just lead to wasted time and bad design and bad code that is buggy, inoperable. I spend vast amounts of my time being "the janitor" who cleans up messes, and this is a thankless job, and I don't enjoy it, and I get concerned whenever I read something that suggests I have a big cleanup job waiting for me in the future. > I wonder, if idea of having Scheme code in AtomSpace > layer has some conceptual justification or it is just historical matter? > There has always been scheme code in the atomspace. The atomspace provides all of the core infrastructure for the scheme bindings. > > For instance we were unraveling the uses and implementation of > add-symmetric-mi-compute. > That function is provided as a part of the "matrix" package. That package provides a way for looking at subsets of the atomspace as if they were (sparse) matrices. Please recall that a matrix (a 2-tensor) is an N x N grid of values. There are many, many things one can do with a matrix. Almost all of the code in the matrix directory is focused on treating the matrix as a probability P(X,Y) of two random processes. Whenever one has a probability like that, one is typically interested in the marginals (the P(X) which is the P(X,Y) summed over all Y), the conditional probability P(X|Y), the entropy H(X,Y) and the mutual information MI(X,Y). Another very important quantity is the product P(X,Y) P^T(Y,Z) where ^T denotes the matrix transpose. This product is can be used to build the cosine distance between X and Z; it can also be used to build the symmetric-MI, which is like the cosine distance; but has sums over logarithms in strategic places. You may wonder "why not use an ordinary linear algebra package?" or "why not use Gnu R?" (or SPSS or SciPy, or whatever) There are three reasons for this: 1) The atomspace matrices are extremely sparse: for the NLP data, only one in a million entries are non-zero. 2) the NxN matrix has N=100K to 1 million for NLP data, which is more RAM than computers can easily provide. The matrix package has to be optimized for sparse data. Genomic data might have even larger N. 3) It would be marvelous if someone wrote an R wrapper for this stuff. It's not hard. Someone needs to do this. I have been urging the agi-bio guys to do this, because their genomic/proteomic data is also extremely sparse, and because they like to use R for data analysis. The general justification is that every atom is like a tensor index, and the value attached to that atom is the value of that tensor at that index. Since a collection of atoms is conceptually the same thing as a set of sparse tensors, lets acknowledge that fact, and provide an API that allows ordinary users to access the tensor data as tensors. By "ordinary users" I mean anyone who has ever done statistical analysis, or more generally, any user who uses SciPy or Gnu R to mangle their data. The atomspace is not for everyone: its only for those people who have very sparse data with a Zipfian distribution. But if that is what they have, let them access it in a "normal" kind of data-analytics kind of way, like how you'd do data analytics in other packages. > > If we keep extending the MST-Parser code The MST parser code is in a different directory; it is not a part of the matrix code. It is much more experimental. The MST parser is meant to be a part of a generic parsing and theorem-proving infrastructure. Among both the academics, and the readers of this mailing list, there is some general understanding that theorem proving, natural deduction, Hilbert-style deduction, sequent calculus, parsing and constraint-solving are all kind-of-ish "the same thing". The goal here is to actually try to make them actually be "the same thing" by providing something that accomplishes all of the above with the same code base. To be more explicit: we have the URE, which performs forward and backward chaining. If you look at the chaining algorithm, you promptly realize that it is a certain kind of parsing algorithm. This insight, that parsing and theorem proving is "the same thing" is what prompted the URE to be created. It has been used for the proving side; i.e. for PLN, but it has not been used for parsing, yet. No one has ever attempted to import link-grammar into the URE. (There was also the intent that open-pse would also run on top of, run with the URE, but the current URE does not support that mode of operation, and so open-psi exists as a distinct, separate code base) If the URE had been sufficiently powerful and robust, we would have been able to import the full English link-grammar dictionary into the URE, and run it, and get ordinary LG parses coming out. This is not currently possible with the current URE design. Given that the current URE is unable to support open-psi, and is unable to support LG, it seemed like it was time to redesign it, from the ground-up. Thus, the code in the "sheaf" directory is an attempt to re-imagine how theorem-proving and parsing can be accomplished in a fashion that is much faster, easier and more usable than the current URE is, with a simpler API and a stronger toolset. The paper on sheafs was an attempt to explain how this could be done. > for account for word, link and > disjunct frequency Please understand that disjuncts are a general concept. They occur not only in natural language, but they also occur in biology, and they also occur in theorem proving. > and provide integration with DNN-s The paper on skip-grams is an attempt to explain how theorem proving is just like deep-learning in neural nets. It attempt to explain how these two different systems are really variations on the same theme. Ideally, the API provided by the code in the "sheaf" directory will be able to provide a common API to deep learning systems, a well as to parsing systems, as well as to PLN, as well as to open-psi, and that, one could choose between different algorithms and implementations that can process your data. Currently, this dream is pre-pre-alpha, and it contains only a generic MST parser. > and add > incremental/iterative learning capabilities to it, The goal of tracking disjunct statistics is that this *is* the learning system. Yes, there are other ways of learning. Again, the paper on skip-grams attempts to explain all the different ways in which learning can be accomplished. > should the changes be > done both to singnet/opencog and singnet/atomspace following the same > pattern? > See comments at top about stable and development branches. > > Or, we should better pull all NLP code out from singnet/atomspace to > singnet/opencog or even place them in separate project? > The MST parsing code is intended to be a part of a generic learning system that can be applied to NLP or genetics or to logical induction or to robotic motion control. It is not specific to natural language. -- Linas > > Ben, Man Hin, Amen - any insights on this? > > Thanks, > > -- > -Anton Kolonin > skype: akolonin > cell: +79139250058 > > -- cassette tapes - analog TV - film cameras - you -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/opencog. To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA37DvSm6dhdZXmWLijVRqQ%2BURjxQpmh8v-0rxEiUM4-mUw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
