To expand on this idea a bit further and maybe formulate something a bit more concrete how about we start a github repository under the opencog space on guthub. Not sure what the term for them are, but there are already sub repositories for the atomspace, relex, the minecraft code, etc all under the overall umbrella of the opencog organisation.
This repository, would host the various statements and other files like short stories, etc, as well as maybe a few minimal scripts in python or similar to do maintenance on the corpus. Things like looking for duplicate sentences, etc. Additionally there might be some scripts to feed these sentences to something like relex, but maybe these should be best left to the other repositories as they are more specialised. The idea is just to have one place where all this corpus and training data can be aggregated that is under control of the opencog organisation. Making our own corpus has obvious advantages in that it can be carefully vetted to make sure the grammar and punctuation is correct, it doesn't contain nonsense or poorly formatted text, and it can be arranged in a way that is most easily suited to its purpose of being a learning space and/or a source of test data to test various reasoning portions of the opencog system. Obviously the AI will eventually need to learn to deal with language that is not grammatically clean and tidy, but it is going to be far easier to start out with a corpus that is clean and then grow its understanding over time. Also, by hosting it on github, we can take advantage of things like pull requests so that anyone can easily suggest changes/additions to the text and play around with it, but it won't be put into the "stable" master branch of the repository until it meets whatever standards we decide upon. This way at any given time you know that if you run tests using the master branch of the corpus you should expect a similar response, just like with stable releases of a software program. The way I envision such a system working is by having individual files with a bunch of short sentences on a single topic: something like "breakfast". In here you would have things like "Cereal is a common breakfast food", "Breakfast is a meal eaten at the start of the day", etc. The idea is that each file would contain lots of overlapping and redundant statements about some small aspect of human experience. Redundancy will be a key idea here; unlike something like ConceptNet where each idea is contained in a single sentence, we might have 10 different sentence variations on the same idea. Each of these files would be something akin to what a kindergartner might learn in one afternoon at school. Then we would have some "meta" files which contain lists of these individual files that group them into something more akin to a chapter in a grade school textbook. One such example list would be the files on breakfast, lunch, dinner, common food items, resturants, soup kitchens, etc. Basically grouping lots of little blocks of knowledge into something more broad. The idea is not to make this grouping a "formal logic" kind of grouping in the sense of subsets and predicate logic, but rather to group them in the sense of "if you were learning about subject X, what are the kinds of things you would read about. The list might contain some subjects which are only tangentitially related to the core topic, but all should have at least some overlap in context. By making lots of little self contained files and then grouping them like this we make it easy to test out the systems ability to integrate more and more complex ideas. If you want to simply test a parser to see if it gets parts of speech tagged correctly you can run through individual files and not be bombarded with too much information. Then when you want to test higher level things like word sense disambiguation, you can easily run through a few "chapters" to see how the same words might be used in different contexts, etc. I think I might start putting together some small examples of such a system, but it would be good to have feedback from the devs on this to avoid making something that is not useful. Ultimately you are the ones who would be using this kind of data, so I want to structure it in the way that makes it easiest for you to feed into the various things you are testing. -AndrewBuck On Tuesday, April 4, 2017 at 8:40:36 AM UTC-5, Andrew Buck wrote: > > Linas, > > You mentioned that a corpus of short factual statements is something that > could be useful in the self learning approach you are using. In another > thread I asked about how volunteers like me who lack the coding knowledge > to contribute directly might be able to help out. Working on a corpus of > short, grammatically correct sentences is something I had in mind. Does > such a corpus exist and if not, can you give us a bit of guidance on what > your "ideal" corpus might look like? I know there are things like > ConceptNet, and I also know there are a lot of problems with them that make > them difficult to use for something like OpenCog. > > I have lots of time available to work on something like this, I just need > to know what to actually work on so that my efforts are not a waste. > > -AndrewBuck > -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/opencog. To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/ce2e4e90-4322-47be-8086-f887ab260f2c%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
