[opencog-dev] Training data

Andrew Buck Thu, 30 Mar 2017 15:14:46 -0700

I would like to help contribute to the OpenCog project and have tried to do 
so numerous times in the past on various pieces.  However in every case I 
got bogged down by the enormous complexity of both the codebase itself, as 
well as the complexity of the overall design and how all the pieces fit 
together.  I am sure I am not the only person in this position of wanting 
to help out but not having the time/ability to really dig into the core 
pieces of the project.

I do however have the time to help in the creation and curation of training
data and corpuses for the various efforts being undertaken by this
project. I would like to start a discussion around this domain.
Basically, what kind of data is needed, what have we got so far, and how
could what we have be extended. One of the ideas I had (and this is just a
"brainstorming" kind of idea which may not be useful) would be to go
through something like the link-parser word lists and put them into
categories. For example the "colors" category would have all the words
pertaining to color (red, green, etc). This category information could
then be used to provide context for OpenCog when it is parsing sentences.
Obviously this is just a simple example, but it illustrates the kind of
thing that people like me could easily work on to do some of the "grunt
work" so that people who actually can code can focus their time on that and
whenever they want to test something they have a nice library of clean,
easily parsed data available to see how their theories work.

Another possible thing that could be created would be a large dataset
consisting of simple sentences like "John threw the ball." and a small bit
of atomese representing the pieces of information that can be learned from
the sentence. In this example you could learn a couple of things, the
obvious one is the action of the ball being thrown and who threw it,
another is that john no longer has the ball, john is likely human since few
other entities can throw a ball, etc. Basically you would have a little
block of text, one or a few sentences, and then a bunch of atomese to go
along with it. Then with a large library of such things you could use
things like PLN or MOSES type learning to try to map something like relex
output into the atomese in the training data. Again, this is just a
suggestion and may not be that useful, but it illustrates the kinds of
things volunteers like me could work on.

For any of these projects we would need some guidance and examples to get
us started but once the general format of what you would like has been
worked out we should be able to largely carry it forward on our own. I
think there are probably a lot of people in this community that would like
to help out on these kinds of efforts, we just need to know where to start.

-AndrewBuck

--
You received this message because you are subscribed to the Google Groups
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit
https://groups.google.com/d/msgid/opencog/894952b4-56e9-4438-a78a-edbac050f275%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[opencog-dev] Training data

Reply via email to