To expand on this idea a bit further and maybe formulate something a bit 
more concrete how about we start a github repository under the opencog 
space on guthub.  Not sure what the term for them are, but there are 
already sub repositories for the atomspace, relex, the minecraft code, etc 
all under the overall umbrella of the opencog organisation.

This repository, would host the various statements and other files like 
short stories, etc, as well as maybe a few minimal scripts in python or 
similar to do maintenance on the corpus.  Things like looking for duplicate 
sentences, etc.  Additionally there might be some scripts to feed these 
sentences to something like relex, but maybe these should be best left to 
the other repositories as they are more specialised.  The idea is just to 
have one place where all this corpus and training data can be aggregated 
that is under control of the opencog organisation.

Making our own corpus has obvious advantages in that it can be carefully 
vetted to make sure the grammar and punctuation is correct, it doesn't 
contain nonsense or poorly formatted text, and it can be arranged in a way 
that is most easily suited to its purpose of being a learning space and/or 
a source of test data to test various reasoning portions of the opencog 
system.  Obviously the AI will eventually need to learn to deal with 
language that is not grammatically clean and tidy, but it is going to be 
far easier to start out with a corpus that is clean and then grow its 
understanding over time.  Also, by hosting it on github, we can take 
advantage of things like pull requests so that anyone can easily suggest 
changes/additions to the text and play around with it, but it won't be put 
into the "stable" master branch of the repository until it meets whatever 
standards we decide upon.  This way at any given time you know that if you 
run tests using the master branch of the corpus you should expect a similar 
response, just like with stable releases of a software program.

The way I envision such a system working is by having individual files with 
a bunch of short sentences on a single topic: something like "breakfast".  
In here you would have things like "Cereal is a common breakfast food", 
"Breakfast is a meal eaten at the start of the day", etc.  The idea is that 
each file would contain lots of overlapping and redundant statements about 
some small aspect of human experience.  Redundancy will be a key idea here; 
unlike something like ConceptNet where each idea is contained in a single 
sentence, we might have 10 different sentence variations on the same idea.  
Each of these files would be something akin to what a kindergartner might 
learn in one afternoon at school.  Then we would have some "meta" files 
which contain lists of these individual files that group them into 
something more akin to a chapter in a grade school textbook.  One such 
example list would be the files on breakfast, lunch, dinner, common food 
items, resturants, soup kitchens, etc.  Basically grouping lots of little 
blocks of knowledge into something more broad.

The idea is not to make this grouping a "formal logic" kind of grouping in 
the sense of subsets and predicate logic, but rather to group them in the 
sense of "if you were learning about subject X, what are the kinds of 
things you would read about.  The list might contain some subjects which 
are only tangentitially related to the core topic, but all should have at 
least some overlap in context.  By making lots of little self contained 
files and then grouping them like this we make it easy to test out the 
systems ability to integrate more and more complex ideas.  If you want to 
simply test a parser to see if it gets parts of speech tagged correctly you 
can run through individual files and not be bombarded with too much 
information.  Then when you want to test higher level things like word 
sense disambiguation, you can easily run through a few "chapters" to see 
how the same words might be used in different contexts, etc.

I think I might start putting together some small examples of such a 
system, but it would be good to have feedback from the devs on this to 
avoid making something that is not useful.  Ultimately you are the ones who 
would be using this kind of data, so I want to structure it in the way that 
makes it easiest for you to feed into the various things you are testing.

-AndrewBuck

On Tuesday, April 4, 2017 at 8:40:36 AM UTC-5, Andrew Buck wrote:
>
> Linas,
>
> You mentioned that a corpus of short factual statements is something that 
> could be useful in the self learning approach you are using.  In another 
> thread I asked about how volunteers like me who lack the coding knowledge 
> to contribute directly might be able to help out.  Working on a corpus of 
> short, grammatically correct sentences is something I had in mind.  Does 
> such a corpus exist and if not, can you give us a bit of guidance on what 
> your "ideal" corpus might look like?  I know there are things like 
> ConceptNet, and I also know there are a lot of problems with them that make 
> them difficult to use for something like OpenCog.
>
> I have lots of time available to work on something like this, I just need 
> to know what to actually work on so that my efforts are not a waste.
>
> -AndrewBuck
>

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/ce2e4e90-4322-47be-8086-f887ab260f2c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to